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Preface 


In this decade, the field of parallel processing has exploded! This conference 
provides clear proof: in the early 1980's on the order of 100 to 125 papers were 
submitted annually; in contrast, over 400 papers were submitted this year. Such 
growth is both gratifying and challenging. On the one hand, it indicates a high and 
growing level of interest in parallel processing techniques and technologies. The 
resulting parallel systems are sorely needed to provide computing resources for ever 
more demanding applications in science, medicine, commerce, and industry. 


On the other hand, such growth also challenges the logistics implicit in 
organizing the conference. Consider, for example, the paper selection process: (1) 
papers are submitted to the program co-chairs, (2) the co-chairs reassign some 
papers to other program areas (based on the co-chairs' mutually agreed definitions 
of the subject areas), (3) at least three reviews are solicited for each paper, (4) 
as the program selection date approaches, additional reviews are solicited (on a 
"crash" basis) for those papers with fewer than three reviews, (5) each program co- 
chair makes a _ tentative selection of papers and organizes them into. potential 
sessions, (6) the three co-chairs merge their tentative programs, coalescing 
overlapping sessions, deleting orphan papers (those good papers that don't quite fit 
with others to form cohesive sessions), and continue to raise the acceptance 
standards to meet the conference size constraints, and finally (7) acceptance and 
rejection letters are sent to the authors. The difficulty of this process increases 
in proportion to the number of submissions, and in inverse proportion to the paper 
acceptance rate. 


For many conferences, a typical paper acceptance rate is 60% to 70%, in our 
case, we were limited by the conference facilities and by the desire that the 
proceedings be portable to a 45% acceptance rate. While such a low rate’ indicates 
that the papers that comprise this proceedings are of extremely high quality, it 
also indicates that many good papers had to be rejected. The impact is_ probably 
more profound on new entrants to the field and foreign (i.e., non-native English 
Speaking) authors than upon experienced authors, which may give the appearance that 
the field has become "in-bred". This year's program committee has tried to avoid 
any such prejudice, but given our limitations, the program can only be viewed as a 
best approximation to the rapidly evolving state of the art of parallel processing 
in 1986. 


It has been our privilege to have received support from a large and dedicated 
assembly of reviewers as listed on the next page. Without their rapid assistance, 
the conference could not occur. It is also a pleasure to identify the secretaries 
that have handled the various filing, copying, correspondence, and countless other 
chores required to develop the program. Our sincere thanks go to: Jenine 
Abarbanel, Vicki Adame, Lauren Hall, Alice Harris, Shirin Mistry, and Elaine Smiles. 
The continuous support and encouragement of the Conference General Chairman, 
Professor Tse-yun Feng has been most gratifying. Most of all, we thank the authors 
for taking the time and effort to share their work with the parallel processing 
community. 


Program Co-Chairs: 
Kai Hwang 

Steve Jacobs 

Earl Swartzlander 
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AN AUTOMORPHISM OF A CLASS OF 
INTERCONNECTION NETWORKS 
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ABSTRACT 


A class of interconnection schemes, commonly referred to as multistage 
interconnection networks, is proposed as a means, providing simultaneous 
connections among processing elements in multiprocessor systems. In this 
paper, we present an automorphism, with which the logical structure of the 
interconnection networks can be changed. We show that the resulting logical 
structure of a network is isomorphic to its physical structure. For the purpose 
of demonstration, three popular networks are examined: baseline network, 
omega network and indirect n-cube network. 


I. INTRODUCTION 


A class of multistage interconnection networks have been proposed, and 
many of them have been implemented for providing simultaneous, 
reconfigurable connections among processing elements in multiprocessor 
systems. These networks include baseline network(1]-(3], flip network(4], 
omega network(5], and many others[6]-{8]. Advantages of these networks 
include cost effectiveness, logarithmic communication delay, modular 
expansibility, and partitionabilit. The primary goal of this work is to show that 
for each network of this class, there exist a multiplicity of logical network 
structures, which are isomorphic to its physical network structure. More 
specifically, we will present a one-to-one correspondence, which turns out to 
be an automorphism, changing logical structures of these networks. To 
illustrate this idea, three popular multistage interconnection networks are 
chosen for demonstration: baseline network, indirect n-cube network and 
omega network. 

The rest of this work is organized as follows. In Section I1, we will describe 
the integration of the networks, including network components and general 
network configuration. In Section III, we will present the definition of a 
one-to-one correspondence. Section IV contains the proofs, showing that this 
correspondence is an automorphism with respect to the three networks. 


IT. PRELIMINARIES 


2.1 Configuration of Networks 

The multistage interconnection networks to be examined are characterized 
by three attributes: (1) switching element, (2) arrangement of switching 
elements, and (3) permutation pattern between stages of switching elements. 

(1)Switching element. These networks employ a 2x2 crossbar switch as a 
buliding block, as shown in Figure 1. This 2x2 switching element has 
two inputs and two output, denoted X,, X,, Y;, and Yo. It has the 
capability of connecting the input X, to either the output Y, or the 
output Yo, depending on the value of some routing bit of the input X,. If 
the routing bit is 0, the input is connected to the output Y,. and if the 
routing bit is 1, the connection is made to the output Y,. Input X, of the 
switch behaves similarly with a routing bit. 

(2)Arrangement of switching element. These networks is composed of a 
logarithmic number of stages of switching elements; a network of size 
N=2" comprises n stages of switching elements. Furthermore, each stage 
comprises N/2 switching elements, as described above. Consequently, an 
interconnection network of such a configuration has \V 2etwork mputs 
and \ 2etwork outputs and contains (N/2)log,N switching elements. 

(3)Permutation pattern between stages of switching elements. In a 
network, two adjacent stages of switching elements are cascaded by a 
set of N communication links, from the outputs of switching elements of 
the preceding stage to the inputs of switching elements of the 
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succeeding stage. In a network of size N, there are (log,N)+1 levels of 
communication links. Each level is associated with a specific 
permutation pattern. It should be noted that these networks, however, 
are different in the permutation patterns between stages of switching 
elements. In Section 1V the permutation patterns of each individual 
network will be described. 


2.2 Addressing of networks 

To facilitate describing the configurations of these interconnection networks, 
a uniform addressing scheme is presented. As shown in Figure 2, in a network 
of size N=2 the network inputs are addressed in a sequence from 0 to N-1, 
from top to bottom. In a similar way, the network outputs are addressed [rom 
O to N-1. Recall that an interconnection network of size N, as described in the 
previous subsection, contains n=log2N stages of switching elements and n+1 
levels of links. The addressing schemes of stages and switching elements are 
depicted as follows. These n stages are labeled in a sequence from 0 to n-1 
with 0 for the leftmost stage and n-1 for the rightmost stage. Similarly, the 
levels of links are labeled in a sequence from 0 to n. In each stage, the N/2 
switching elements are addressed from 0 to N/2-1, each of which can also be 
represented by an (n-1)-bit binary number of the form, s,_,$,_>--3;. In each 
stage, each switching element is associated with four communication links; for 
a switching element labeled by $,-189-2--54. the upper link to the input X, is 
addressed by an n-bit binary number, $,-18n-2-319, and the lower one to the 
input X, is addressed by s,_,3,_>..8,1. Similarly, the two links from outputs 
Y, and Y, are addressed by $n-18q-2-~819 and $,_,$,_5...8, 1, respectively. 

From topological point of view, these interconnection networks are different 
in their permutation patterns between stages of switching elements. To 
describe the permutation patterns of an interconnection network, the 
following notation will be adopted: 

BV)X. 


For a network of size N=2", % specifies the permutation of communication 
links of level i (from the outputs of stage i-1 to the inputs of stage i of 
switching elements). However, in the case of i-0, the input domain of % is the 
set of network inputs. Similarly, in the case of i=-n, the output domain of % is 
the set of network outputs. For instance, if the permutation pattern of level i is 
a perfect shuffle per mutation[12], then we have 
3(¥)=(2¥+L2Y/NJ ) mod N, 

for all the outputs of switching elements of stage i-!. That is, the output, Y, of 
stage i-1 is connected to the input, ((2¥+{2Y/N]) mod N), of stage i. 


ved EO ; a. 
X, Yz % % 
Routing bit of X, is 0 


Routing bit of X, is 1 


Figure 1. A 2x2 switching element. 
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Figure 2. Configuration of a mullistage interconnecton 
network of size N=2". 


III. THE DEFINITION OF & 


This is the purpose of this section to define a one-to-one correspondence, 
which turns out to be an automorphism, with respect to the multistage 
interconnection networks. Also, we will present two functions, which are used 

‘to rearrange the switching elements and the topology of the five networks. 

DEFINITION. Let S be a set of n-bit binary numbers from 0 to 22-1, or 
S=(0,i,.29-1). And let C be a constant, 0¢C<29-1. £, is a mapping from S to S, 
and ,. (X)=X@C, for every XeS. 

With the definition above, we now present the statement of the problem to be 
investigated as follows. 

STATEMENT OF THE PROBLEM: If we relabel the network inputs and 
outputs of a multistage interconnection network by ED and Eo respectively, 
then there exists a way of rearranging the switching elements and links such 
that the topology of the rearranged network is equivalent to that of the 
original network. That is, both of the two topologies are isomorphic. 

To illustrate the statement above, an example is given in Figure 3. Consider a 
baseline network of size 16, whose network inputs and outputs are relabeled 
through £,, and &, respectively. What we intend to show is that the two 
network topologies are isomorphic. For convenience, hereafter Ep, denotes 
“apply ED to the network inputs and Eo to the network outputs of a multistage 


network.” 
Before presenting the proof that Epo is an automorphism onto the 


multistage interconnection network, we further introduce two functions, which 
will be used to rearrange switching elements and their associated links: 
(MOVE 1; This is a one-to-one correspondence, which is used to 


rename the switching elements of stage i, according to some mapping 
rule. \ 
(2)TWIST t;: This is used to interchange a pair of inputs or outputs of a 


switching element of stage i. Associated with Ce there are two control 
bits c, and cy, which are used to specify the interchange on the input 
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Figure 3. A baseline network whose network inputs and outputs are permuted by €, , ,. 


side and the output side of a switching element, respectively. As 
illustrated in Figure 4, there are four possible operations of Tv; on a 
switching element: 
1.No-iwist When c,=0 and c=0, no operation is performed on the 
switching element; 
2.Left-twist. When c,=1 and cy=0, the input side of the switching 
element is twisted 180° counterclockwise; 
3.Right-twist, When c=0 and G0, the output side of the 
* switching is twisted 180° clockwise; 
4. Vertical-flip. When c,=1 and cy=1, the switching element is flipped 
vertically. 


VI. THE AUTOMOPHISM OF NETWORKS 


In this section, we prove that Epo is an automorphism onto the three 
multistage interconnection networks: baseline network, omega network, and 
indirect n-cube network. (However, only the proof for the baseline network is 
presented, and the other two proofs for the omega network and the indirect 
n-cube network are omitted.) Throughout the discussion, we assume that 
X=x0-1%n-2--Xp is an input and Y=¥n-1¥n-2-Yo i8 an output of a switchig 
element. For the reason of uniform notation, here we define that 
T_(p_(Y))=E plY) and v,( Hig (X))=E (fX); that is, network inputs are considered 
as the outputs of a virtual stage, Stage -1, and similarly network outputs are 
considered as the inputs of a virtual stage, Stage n. 


Ho-twiet Lett-twist 
xX, Y, X, Y, 
C,-0,C, =0 C21,¢ 0 
Right-twist Vertical-flip 
x, YY SW op’ 
C,+0,C, #1 C-1,C,=1 


Figure 4. Four combinations of 1 andT on a switching 
element. 


4.1 Baseline Network 


The permutation patterns of a baseline network of size N=2" are defined by: 
OAY)=x,_ 1 n-2°~0 


f Yn-1%n-2~Yo i=0, 
= 1Ya-t-Yarie YoYa-i~ ¥; =: 1SiSn-l, 
: Yn-1¥n-2~Yo izn. (1) 


Let P and Q be two n-bit numbers, and Psp. 1Pp-2--Po, Q=q,. 149-2--49- The 
function of j1; is to move a switching element (z,. 12n-2--2;) of stage i to a new 
location 


(2 q_12q-2-21)8 (Dy 1Pq.2--Py) =0, 
ihe -12p-2-2 8 (Qg.1--Gg-iPq-)Pj,,) 1 SiSn-2, 
(Zy_sZp-2--21 8 (05. 195-2--4)) i=n-1. 


In terms of binary addresses of the communication links associated With 
switching elements of stage i, the function of ji; Can be explicitly described by 


f (x,. 12p-2--%9 (p,_ 1Py-2--P;9) i=0 
yi (X)=4 (x, )%p-2-Xp 8 (qn-1 Ap -jPp-1--Pj, 1 0 ) 1 Si<n-2 
6 a 17 n-2-Xp? (5-19 y-9-~-4 1 0) i=n-| 
and 
rly. 1¥n-2-Yo® (p,. 1 Py-2---P 1 0 ) ix0 
p(Y)={ (y,. 1¥n-2-~-Yo® (Gy-1--Iq-iPn-1~Pj, 19) 1<i<n-2 
ly, 1¥n-2~-Yo? (q,- 19p-2~-9) 0) i=n-1. (2) 
In addition, %; is defined by 
&(X)=(x,_ 1 n-2-%9 8 ( 0...0p;) O<i<n-1 
and 
TlY)= (Ya 1Vn-2~-Vo@ (0..0q,.;.,) Osisn-1. (3) 


The equivalent function of vj, as defined above, is to left-twist all the 
Switching elements at stage i if p;=1 and right-twist the switching elements if 
Q,-;-;7!. With %; and pj; above, we now prove that there always exists an 
equivalence relationship between a baseline network and its rearranged 
network by Epo 

THEOREM 1. &,, 0<P,Q<2"-1, is an automorphism onto a baseline network 
of size 23. 

Proof: Our approach is to show that after the switching elements of stage 
i-1 are rearranged by 7; and }i;_,, and the switching elements of stages i are 
rearranged by 7%; and j!;, the rules of (1) still hold for all the links of level! i 
Thus we have 


(¥9- 19 Pas N¥p-78 Pg-g).- (¥9? Dg) i-0, 

(V9 dg: Y,-19 Py-1)-- (¥,9 By) i=l, 
BT (p TW 4 (¥ 9-18 Vp p)(¥n- ie t® Ig-ie M¥Q® 4-7) 

| (y,-° Py- 1)--LY 9 p;) 2sisn-1, 


Moya 18 Gy-s (V7 Ip-g)-- (¥g® Ip) i-n. 
Furthermore, we have , 
(1,19 Dy_ 1X, 59 D,-9)... (X99 Do) i=0, 
x, 19 Wp 1 (X,.78 Py_1).-- (x99 Py) i-1, 


TCU} (2g Vy lp 5 19 Igeie (Tq: Ip; 
| bere 19 Pp- )--(199 p;) 2<i<n-1, 
U(r 4-19 Gq 1 (TX, 99 Gy_2). (X98 dg) i=n. 
As a result, we have 0:(7;_)(u;_(Y)))=7,(y1,(X)), for all O<i<n. O 


For instance, Figure 5 illustrates a rearranged baseline network of size 16 by 
Eis and t as well as j1 as defined by (2) and (3). Note that the rearranged 


network is topologically equivalent to the baseline network. 
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Figure 5. Configuration of a rearranged baseline network. 


4.2 Omega Network 


The permutation patterns of an omega network of size N-25 are described 


by: 
OY )¥,- 1%n-2—2g 
£ Yn-2Vn-3~-YoYn-1 Osi<a-1, 
t Vn-1¥ n-2--Yo i=n. (4) 


Let P and Q be two n-bit numbers, and P=p,. 1P9-2--Po Q=4,- 199-240 The 
function of 1; is to move a switching element (z,_;Z,_7.-2;) of stage i to a new 


n-2° 
location 
f (z -12n-2--21 8 (D,-2Pa-3~-Po) i=0 
{ (23. :Zq-2--219 (Pp-j-2~-PpIg-1-~Vp-) ‘1 Sisn-2 
M2qi2p-2-21) (4g 14q-2~41) imn-1. 


In terms of binary addresses of the communication links, 1; can be described 
by 


(x, )2n-2-29)2 (Pa -2P_-3--Pg0) i=0, 
H(X)={ (2, 1X ,-2-%p 8 (Do-j-2--Podq-1--4q-j2) ‘1 Sisn-2, 
(xe 1X n-2-¥g)? (45.19,-9--940) inn-I. 
and 
(Yq-iVa-2--Yo® (Pp-2Pq-3~-Pp0) i=0, 
wi(Y)={ (Yq 1¥n-2~-¥q® (Pg i-2-Po4g-1~Vn-i0) 1<isn-2, 
MYA -1Vq-2~¥o) (4g-49q-2--4)0) isn-1. (5) 
In addition, we define %; by 
U(X)=(z, 1X 4-2-Xp? (0...0 Pa-i-1) 0<i<n-1, 
and 
TY )=(Y 0 1¥n-2--Vo)® (0..0q,_;.)) O<i<n-1. (6) 


The equivalent function of t;, as defined above, is to left-twist all the 
switching elements at stage i if P,-;-171 and right-twist the switching 
elements if 4,-;-)71. With %; and fl; above, we now prove that there always 
exists an equivalence relationship between an omega network and its 
rearranged network by Eng 

THEOREM 2. Epo 0<P,Q<2"-1, is an automorphism onto an omega network 
of size 22. 


4.3 Indirect n-Cube Network 
In an indirect n-cube network of size N=2™ the permutation patterns of n 
stages of links are defined by: 
O(V)=¥ 51% q-2--Xp 


f¥n-1¥n-2~-Yo i=0, 
[¥a-1~-Y2¥o1 i=1, 

7 1¥n-1-Vier ¥oVie-V1Vi 2<i<n-2, 
Yo¥n-2-YiYn-t ien-1, 


YoYa- 1%9-2-Y1 isn. (7) 


- Let P and Q be two n-bit numbers, and P=p,_,P,_2--Dy and Q=4,-199-2--40- 


The function of 1; is to move a switching element (z,_,Z -2~-2,) of stage i to a 


new location 
(24.12 2-28 (P,-1Py-2--P1) i=0, 
(29-1 Zqqg--21 2 (Pq-1--Pie 1 9i-1-90) 1<isn-2, 
(25-12p-2--21)8 (p249-3--Ag) i=n-1. 


In terms of the binary addresses of communication links associated with 
switching elements of stage i, ve have 


(Zq.12q-2-Zg)@ (Pq-1Pq-2~P}9) i=0, 
Hy(X)=4 (25. 4%q-2~-29)8 (Pq-1--Pio1%;-1--990) 1<i<n-2, 
UXq1Tp-2-Tp)@ (Gq -24q-3~-490) i=n-1, 
and 
(Y9-1¥n-2--Vo? (Py-1Pq-2--Py9) i=0, 
HiL=4 (¥ 9. 1Vn-2-Vo)® (Pq-1-Pj,19}-1~-490) I Sisn-2, 
My, 1Vn-2-Yo? (Gg-2%q-3--4g0) ien-1. (8) 
In addition, we define 
(X)a(x,_ 1X p-2--Xp)@ (0.09;) O<i<n-1 
and 
TY)=(Y,-1¥q-2~-¥o¥9 (0..04;) Osisn-1. (9) 


The equivalent function of T;, as defined above, is to left-twist all of the 
switching elements at stage i if p,=1 and right-twist the switching elements if 
q;-1. With %; and My above, we now prove that there always exists an 


equivalence relationship betveen an indirect n-cube network and its 
rearranged network by Ep. 


THEOREM 3. fp, 0<P,0:2"-1, is an automorphism onto an indirect n-cube 
network of size 24. 


VI. CONCLUSION 


We have presented a one-to-one correspondence, epg used to rename 
network inputs and outputs for a class of multistage interconnection networks. 
It is then proved that a renamed network is isomorphic to its parent network; 
that is, Epo is an automorphism. The class of multistage interconnection 
networks we have shown includes baseline network, omega network and 
indirect n-cube network. To illustrate that Epa is an automorphism, schemes 


of rearranging switching elements for each network are described and 


demonstrated in detail. 
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ABSTRACT 


In performance evaluation of interconnection net- 
works it is usually assumed that in case of conflicts a 
request is accepted with equal probability. This paper 
illustrates that this arbitration policy is discriminatory 
to remote or less frequent requests because they are 
rejected most of the time. The paper considers a 
favorite memory environment as an example and 
examines the network performance under various arbi- 
tration policies. Equal Acceptance, priority to favorite 
and non favorite request policies are examined by 
defining three probabilities of acceptance. The net- 
works considered are crossbar, multiple-bus and mul- 
tistage interconnection networks. 


1. INTRODUCTION 


A tightly coupled multiprocessor is usually 
categorized depending on the interconnection network 
IN) that is used between the processors and memories 
1}. Such IN’s can be broadly divided into: (a) Crossbar 
b) Multiple-bus (c) Multistage interconnection net- 
work (MIN). A crossbar interconnection allows all pos- 
sible connections between the processors and memories 
[2]. When two or more processors try to access the 
same memory module only one of the requests is 
accepted and the others are blocked or rejected. A 
multiple-bus provides a fault-tolerant and cost effective 
interconnection between the processors and memories 
[3-5], but the number of simultaneous connections is 
dependent on the number of.-buses. When the number 
of buses is sufficient, a multiple-bus has the same per- 
formance as that of a crossbar. An MIN, on the other 
hand, is composed of several stages of switching ele- 
ments (SE’s) [6]. A conflict occurs when two or more 
requests contend for the same output of an SE. The 
cost and performance of such an IN is a reasonable 
balance between a shared bus and a crossbar. There is 
an extensive literature on the performance evaluation 
of these networks. All evaluations for synchronous 
multiprocessors measure Bandwidth (BW) which is 
ous as the number of memories remaining busy in a 
cycle. 


The BW does not give us an idea about which 
requests are accepted and which are rejected. For 
equally likely case when a processor generates a request 
that is equiprobably directed to all the memories, it 
may not be essential to know the above details because 
all the processors perform similarly in the long term. 
However, when the processors request different 
memories with different probabilities, the equal accep- 
tance (EA) rule appears discriminating because a pro- 
cessor with a remote request to a memory will be 
rejected most of the times. Rather, it should be given a 
priority over another processor which requests that 


‘This research is supported in part by NSF Grant 
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particular memory more often. Hence, the previous 
analytical models of the BW do not reveal this impor- 
tant information that is so practical of the multipro- 
cessor operation. As we know, no closed form solu- 
tions can be derived for general or arbitrary memory 
references. We will therefore restrict our analysis to 
favorite memory cases as depicted in Fig.1. For simpli- 
city, we will restrict our analyses to NxN networks 
that connect N processors to N memories. Also, 
because of page restrictions some final equations will 
be given without derivations. 


In Fig.1, a processor P, requests its favorite 
memory MM, with a probability of m provided it gen- 
erates a request. The processor requests all other (N-1) 
memories with a probability of 1m. Assuming that 
those requests are uniformly distributed, the probabil- 
ity that a processor requests any one of those memories 


is ao, where pp, is the probability of its request 
generation. When m>-, it is called a favorite 


N 
memory case and by putting m =<, the analysis 


reduces to an equally likely case. The rate of request at 
a memory MM, is then po.m due to its favorite processor 


Pp and 2) 


there is no conflict in the IN. 


2. ANALYSIS FOR CROSSBAR 


The crossbar allows all processor requests to 
reach the memory modules. If more than one request 
reach a particular memory, one of them is accepted 


and the others are rejected. Which one is accepted 
depends on the arbitration policy of the controller. The 
rate of request (p;) at a memory MM, in an NXN 
crossbar is [7] : 


due to any other processor provided 


i 1-(1~pom)(1- pot). (1) 


The BW, given by N.p;, does not depend on the basis 
upon which the selection is made. The probability of 
favorite acceptance (p,;,) at a memory due to EA policy 
is given by: 


N 
Pie= ia (1{1 - po +) i: (2) 


For large values of N , lim pj, = a, (1-e?™" ) | As an 
example, with p.>=1 and m=08 , fim, py. = 0.725 The 
probability of nonfavorite acceptance, 

Pne = Py — Pfe- (3) 
With pp>=1 and m =08, jim, Poe = 0.111 . 


the time a memory receives requests from its favorite 
processor, the requests from the other processors will 
be mostly unsuccessful in an EA policy. In a practical 
design a selection policy is fixed and is based on some 
priority structure. It is therefore reasonable to assign a 
priority to the non-favorite requests for higher values 


Since most of 


of m. Hence, whenever there is a nonfavorite request 
at MM;,, any request from P, will not be granted. We 
will assume that when more than one nonfavorite 
requests contend for a memory module, a random 
selection is made with an equal probability. A request 
from P, is granted only when there is no nonfavorite 
request. As a result the rate of request at a memory 
module due to its favorite processor is : 
N-1 

Pix = Pom(1- pow ) ; (4) 
The subscript n stands for priority to nonfavorite 
requests. With po=1 and m=08 , lim py, = 0.655. 


The rate of request at MM, due to other processors is 
7 N-1 
Pan = Py - Pom(1-Por—~ ) (5) 
With pp>=1 and m =08 , lima Pan = 0.181. The Proba- 


bility of Acceptance (PA) of a request is the ratio of 
the average number of requests accepted to the 
number of requests generated per cycle. Thus 


i ee 
em Po *number of processors Po (6) 
We will define the Probability of Favorite Acceptance 
(PFA) as the ratio of the average number of favorite 
requests accepted to the number of favorite requests 


generated per cycle, 


PRA = Bis for Pre) (7) 
Po.m 


For the same example with pp =1 and m = 08, 
jim PFA = 0.906 for EA polscy. 


= 0.819 for priority to Nonfavorite requests. 
Thus the Probability of Non-favorite Acceptance , 


PNFA = 2 Phe Orie) (8) 


With pp =1 and m = 08, 
lim PNFA = 0.555 for EA case. 


N-+00 


= 0.905 for priority to nonfavorite requests. 


Hence by adopting priority to nonfavorite requests 
there is a remarkable improvement in PNFA compared 
to the corresponding degradation in PFA . The rela- 
tionship between these probabilities is given by 


PA =m. PFA +(1-m).PNFA . (9) 


It is noticed from the above examples that by 
allowing a priority to the nonfavorite requests, the 
degradation in PFA is not substantial. Since the PA 
remains the same, the PNFA is increased to a large 
extent and for this particular example it jumps from 
0.555 to 0.905 as calculated earlier. However, if m is 
low, there will be a substantial degradation in the PFA 
with priority to nonfavorite requests. Then EA policy 
should be adopted. If priority is assigned to the favor- 
ite request, PFA is unity for all values of N and hence, 
the case is not considered. 


3. ANALYSIS OF MULTIPLE-BUS 


The BW analysis of multiple-bus architecture for 
the favorite memory case is given in [5]. There are B 
buses in the system that connect M processors to N 
memories for B< min(M,N). A bus is connected to all 


the processors and memories. The arbiter (controller) 
cyclically allocates a bus to a memory that has an out- 
standing request. The BW of the multiple-bus struc- 
ture is clearly a function of the number of buses B. 
When B is equal to min(M, N), the architecture has the 
same BW as that of a crossbar [5]. When the buses are 
insufficient, there will be bus conflicts in addition to 
the memory access conflicts. Hence, the multiple-bus 
controller can be designed as a cascade of a crossbar 
controller for resolving memory access conflicts and a 
bus controller to allocate buses to the memories. As a 
consequence, we can enumerate the following control 


policies. 
1) Equal Acceptance - Equal Acceptance (EE) 
2) Equal Acceptance - Nonfavorite (EN) 
3) Nonfavorite - Equal Acceptance (NE) 
4) Nonfavorite - Nonfavorite (NN) 


For example, the EN policy states that EA rule is 
adopted for solving the memory access conflicts 
(crossbar controller) and Nonfavorite requests are given 
priority for the bus allocations. In case there are more 
than B nonfavorite requests, B of them are accepted 
with an equal probability. If there are less nonfavorite 
requests the extra buses are distributed to the favorite 
requests on an equally likely basis. This information is 
implicit in an arbiter policy and hence, is not expli- 
citely mentioned in the above classification. 


For simplicity, again, we will consider only NxN 
multiprocessor system. When the BW Bw, of such a 
crossbar is less than or equal to B, the bus controller 
allows a bus to each of the selected requests and hence 
the system behaves as a crossbar. So for BW,<B, the 
parameters derived in section II are true. For example, 
with N=16, pp=1 and m =08 , 14 buses are required 
for the multiple-bus architecture to have the same BW 
as that of a crossbar. When Bw, >B, the bus controller 
plays a major role in deciding which of the requests 
should be allocated the buses. However, the bandwidth 
for the multiple-bus (Bw,,) will remain equal to B 
because of bus deficiency. The overall probability of 
acceptance is : 

Min(B ,BW, ) 
PA a (10) 
Hence, the following analyses are derived when 
BW, = 8. 
(1) Equal Acceptance - Equal Acceptance (EE) case 

The rate of request at a memory due to favorite 
requests is given by p,, in eq. (2). Hence Nop,, is the 
expected number of favorite requests out of Np, 
requests accepted by the crossbar controller in total. 
With an EA policy at the bus controller, number of 


favorite requests allocated buses is 24 .B. The total 


f 
number of favorite requests generated at the processor 
side is po.m.N. Hence, 


Pfe B 
PF SS —— . 5 
Py = Po. m.N (11) 
And similarly 
PNFA = £1. __8 (12) 


Py =po(l-m)N 


where p;, p;- and p,, are given by equations 1, 2 and 3 
respectively. 


(2) Equal Acceptance - Nonfavorite (EN) case 


Total number of nonfavorite requests accepted by 
crossbar controller is p,,.N. They are all allocated buses 
subject to availability. Hence, the number of successful 
nonfavorite requests is min(p,..N, B). The number buses 
available for favorite requests is B-min(p,...N, B) 

Then 


_ B-min(pae.N, B) 
rs ae = (13) 
= min(Pae-N, B) 


(3) Nonfavorite - Equal Acceptance (NE) case 


All the nonfavorite requests are accepted by the 
crossbar controller. The rate of favorite request in this 
case is given by »,, in equation (4). With an equal 
acceptance of p,,.N favorite requests out of p,;.N 
requests by the bus controller, 


Pin B 

PFA = —. ———— ... 15 
Pr = =PpomN ( ) 
The rate of request due to nonfavorite requests as 
selected by the crossbar controller is given by »p,, in 
equation (5). With an EA policy by the bus controller, 


Pan B 
Pf po(l—-m )N ( ) 


(4) Nonfavorite - Nonfavorite (NN) case 


Here both the crossbar as well as the bus controll- 
ers give priority to nonfavorite requests while making 
selection. All the nonfavorite requests are allocated 
buses subject to availability and their number is given 
by min(p,,.N,B). Rest of the buses are allocated to 
favorite requests. Hence, 


fee oe (17) 
B Qe feed 
_ min(pa,.N, B) 18 


The PFA and PNFA for various selection policies, 
obtained for a 16x16 system, with p»>=1 and m=08, 
are plotted against the number of buses in Fig. 2 and 3 
respectively. It can be observed in Fig. 2 that degra- 
dation in PFA is not substantial by giving a priority to 
nonfavorite requests. Moreover, the PFA linearly 
increases until it is saturated (after 14 buses) by the 
memory access conflicts. For EN and NN policies, buses 
are first allocated to nonfavorite requests. So the PFA 
remains zero until more buses are available. In Fig. 3, 
the PNFA increases linearly for EE and NE cases, but 
reaches saturation values quickly for other two policies. 
There is a substantial increase in PNFA from 0.555 to 
0.905 when priority is given to nonfavorite requests by 
the crossbar controller. 


4. MULTISTAGE 
NETWORKS (MIN’s) 


An MIN usually connects N processors to N 
memories through logN stages of 2x2 switching ele- 


ments (SE’s). Each stage contains 2 such SE’s. Exam- 


ples of such networks are Banyan, Omega, Cube and 
Baseline, etc. (6. Although MIN’s for MxN, with 
MAN, systems have been proposed [8], we would limit 
our discussion to NxN systems for simplicity. We will 
assume that each processor can access its favorite 
memory through a straight connection of the switches 


INTERCONNECTION 


on its path [7]. A 2x2 SE in an MIN may have a 
built-in priority structure to resolve the conflicts or it 
may choose one of the contending requests based on an 
EA policy. If a priority is given to straight connec- 
tions, it may mean that more and more favorite 
requests will go through. However, nonfavorite 
requests are rejected most of the time, which should 
indeed be given priority over the favorite requests. We 
have analyzed the MIN performance for all the three 
different cases, namely : equal acceptance policy, prior- 
ity to favorite requests and priority to nonfavorite 
requests. The analysis is an extension of our previous 
analysis for EA policy [7] and is not given here because 
of space restriction. 


The BW obtained, with different priority assign- 
ment in a switch, are plotted in Fig. 4 for p,=1 and 
m =08. The BW of an MIN is affected by the switch 
arbitration policy unlike the crossbar or multiple-bus. 
When a priority is given to favorite requests, more and 
more requests are accepted giving rise to an increased 
BW. With a priority to nonfavorite requests more 
conflicts occur in the network and the BW reduces 
considerably. With decrease in the value of m the BW 
is further reduced, but will be limited to an equally 
likely case. In an equally likely case, the priority 
assignment has no effect on the BW. The probability 
of acceptance (PA), probability of favorite acceptance 
PN and probability of non favorite acceptance 
PNFA) can be easily determined. 


Unlike the crossbar or multiple-bus results, the 
PA of an MIN is different for different arbitration polli- 


cies. This is evident from Fig. 4 because PA = =? ; 
Po- 
The PFA and PNFA for different policies are drawn in 


Fig. 5 and 6 respectively against the size of the MIN . 
As the size of the network increases, the probabilities 
reduce because of more and more conflicts. One can 
again observe that in a i024xi024 network, with p,.=1 
and m =08 , there is about 40 % degradation in PFA 
by adopting priority to nonfavorite requests (cross con- 
nection) compared to an EA policy. However, the 
corresponding PNFA increases by about 9 times in Fig. 
6. The PNFA is almost zero for large networks when 
priority is given to favorite (straight) connections. 


5. CONCLUSION 


The paper described the effect of various selection 
policies on the performance of crossbar, multiple-bus 
and MIN. The BW of crossbar. and multiple-bus net- 
works does not depend on the selection policy, but the 
probabilities of individual request acceptance do. It 
was shown that by giving a priority to remote or less 
frequent requests, the probability of their acceptance is 
dramatically improved. Unlike the above two net- 
works, the selection policy in a 2x2 switch does affect 
the overall BW of an MIN. Although the BW 
decreases by giving priority to nonfavorite requests, it 
is an advisable policy because of the tremendous 
increase in PNFA. Overall, the paper described some 
interesting phenomena that were neglected before and 
are so practical for IN designs. 
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ABSTRACT 


This paper is concerned with demonstrating the topolog- 
ical equivalence between two classes of computer architectures 
that support parallel computation, viz. Cube-Connected 
Cycles network (CCC) and Homogeneous Circular Shuffle 
Network (HCSN). The latter is based on the Perfect Shuffle 
Connection. By developing a suitable and common notation 
for addressing processing elements and specifying interconnec- 
tions in the two networks, it is shown that these are topologi- 
cally equivalent. The implications of such an equivalence are 
described. Known properties and algorithms about HCSN net- 
works, in respect of routing, and fault tolerance, thereby, 
immediately become applicable to CCC networks. It is also 
shown that a large class of algorithms that run on a CCC net- 
work can also be implemented, with slight modification, on an 


HCSN network. 


1. INTRODUCTION 


Within the context of parallel computation a number of 
computer architectures have been proposed. These include 
systolic arrays, associative processors, vector processors, SIMD 
machines with or without shared memory, and MIMD 
machines. Our interest here is with SIMD architectures that 
do not share memory. An SIMD machine consists of a number 
of identical Processing Elements (PEs) each with its own local 


memory. The PEs communicate with each other through an. 


interconnection network. A variety of such networks have 
been proposed, and parallel algorithms to solve various prob- 
lems on them have been developed. Hypercube, Mesh- 
Connected, Cube-Connected Cycles (CCC), Perfect Shuffle 
Connection (PSC) networks (see Figures 1, 2) are some of the 
networks that have been extensively studied (see references 
([1]-[3]). More recently, a Homogeneous Circular Shuffle Net- 
work (HCSN) was proposed and studied (see reference [4}). 
This network has PSC as its basis (see Figure 4). 


This paper demonstrates topological equivalence between 
the CCC network and the HCSN network. We develop a 
notation for addressing PEs in each of the networks and for 
specifying interconnections between PEs (see Section 3). This 
topological equivalence has a number of implications for both 
HCSN networks as well as CCC networks. These relate to 
routing algorithms, fault tolerance and VLSI layout (see Sec- 
tion 4). It is also shown that the equivalence is more than 
simply topological, in that the algorithms that run on an 
HCSN network also run on a CCC network. However, algo- 
rithms that run on a CCC network may require minor 


modifications when implemented on an HCSN network. 


2. THE CCC AND HCSN NETWORKS 


A CCC is a network of identical PEs, where each PE has 
three interconnection ports. Each interconnection linking two 
PEs may be used for bi-directional transmission of operands. 
The CCC network has N=2* PEs, where 1<k <r +2", r >1, 
and r is the smallest integer. Here each PE is addressed as m, 
O<m<N. A PE with an address m is alternatively 
represented as a tuple (/,p), such that m =/*2' +p, O<p <2’, 
O<!<2'-". | and p have k-r bit and r bit representations, 
respectively. The interconnection between the PEs _ is 
described as follows: each PE has three ports, namely F, B, 
and L. Thus F(l,p) is the F port of a PE numbered (l,p), 
etc. The PE with address (/,p ) is connected to the three other 
PEs as follows: 


(B(!,p),F (l,(p -1)mod 2" )), 0<1 <2" ,0<p <2" (1(a)) 


(L(!,p ),L (1-+e* 2? ,p )), (1(b)) 


where e =1-2*bit, (!), and bit, (/) is the p-th bit of /. The 
notation used is as follows: (P,Q) indicates a bi-directional 
communication link between the ports P, and Q of some 
PEs, while <P ,Q > indicates a uni-directional link from port 
P to port Q. (Note: The connection from F(l,p) to 
B(I ,(p +1)mod 2” ) is implied by (1(a)), above.) See Figure 2 for 
an example of a CCC network where N =32, k =5, and r=2. 
Preparata, et al. (see reference [2]) have proposed a VLSI lay- 
out for CCC networks. See Figure 3 for the layout of a CCC 
network where N =64, k=6, and r=2. Looking at the layout 
it is reasonable to generalize the CCC network by dropping 
the requirement that N=2*. Instead, we assume N =n logn 


where n is a power of 2, and n >2. As a consequence, each PE 
in a CCC is numbered (I,p), 0<! <n, O0O<p <logn, and the 
interconnections are . 


(B(I,p ),F (1 ,(p -1)mod logn )) (2(a)) 


(L(1,p ),L (l+e* 2? ,p)) (2(b)) 
where e =1-2*bit, (1). | 

The Perfect Shuffle Connection (PSC) is a network of n 
PEs, where n is a power of 2 and n>2. Each PE has three 
ports O, J, and L. The PEs are numbered as /, 0</ <n. 
The interconnection is described as: 


<O(1l),I(2*l mod(n-1))>, 0<I <n-l, 


(3(a)) 
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<O(n-1),I(n-1)> 


(3(b)) 
(3(c)) 


Based upon the PSC network, Tripathi and Huang (see 
reference [4]) have proposed a Homogeneous Circular Shuffle 
Network (HCSN). An HCSN network, parameterized by two 
numbers (b,r), has r columns with 6*~' processors in each 
column, with a total of r*b"' processors. We shall restrict 
6 =2. Processors in an HCSN network are homogeneous with 
2 input ports and 2 output ports. The interconnection pattern 
is a 2-shuffle connection described below in terms of addresses 
associated with input (and output) ports. 


(L (l),L (1+1)), | mod 2=0, 0<I <n. 


For a given column, the column-wide address of an input 
port (or an output port) is z, O0<« <2’, and given by its 
representation (z,_;,...,%9). The network-wide address of an 
input port (or an output port) « in column 1, 0<i<r, is 
(a,_1,.-, | @;,...,@o), Obtained by circularly left-shifting the 
column-address until the last tuple, zo, is shifted into position 
t. The symbol ”|” is put before a; to indicate the column of 
the input port (or the output port). Since, for various values 
of a;, the notation refers to the two input ports (or output 
ports) of the same processor, the network-wide address of a 
processor in column ¢ is given by (a,_4,...,¢;41, | ,@;_1,---,@0). AS 
an example, if the network-wide address of a processor S is 
(0, | ,1,1), then its two input ports have addresses J (0, | 0,1,1) 
and J(0,|1,1,1), while its output ports are addressed 
O (0, |0,1,1) and O(0,|1,1,1). See Figure 4 for an HCSN net- 
work where 6 =2, and r=4. The interconnection in an HCSN 
network may now be specified: 


<o (4, 1,.--,45 41, | a; Oj _py005Q g)yL (Op 100-45 41) ’ | a; -1,.--,4 0) (4) 


3. TOPOLOGICAL EQUIVALENCE 


Before we demonstrate the topological equivalence 
between the CCC network with nlogn PEs and the HCSN 


network, where r =logn , the following comments are made: 


(1) All interconnection links in a CCC network are bi- 
directional, while in an HCSN network the links are 
uni-directional, that is data may flow along a link from 
an output port of a processor to an input port of some 
processor. 


(2) The notion of PEs in a CCC network and that of a pro- 
cessor in an HCSN network is somewhat different. In an 
HCSN network a processor may operate on two data 
objects that it receives at its input ports, and upon pro- 
cessing, produces two data objects at its two output 
ports. On the other hand, in a CCC network two PEs 
that are directly connected may communicate data with 
each other, and any processing on these data objects 
may be performed in either (or both) of the PEs. 


To reconcile the two notions, we may take either of the 
approaches given below: 


(A) Split each processor in an HCSN network into two PEs, 
each being associated with a pair of input port and out- 
put port. Further, the two PEs are connected to each 
other by a bi-directional link (see Figure 5(a)). 


(B) Combine two neighboring PEs in a CCC network into 
one processor capable of operating on the data objects 
within the PEs (see Figure 5(b)). 


We shall take the former approach. The resulting view 
of the HCSN network of Figure 4 is given in Figure 6. As 
such an HCSN network has r=logn columns, each contain- 
ing n =2” PEs. Thus an HCSN network has a total of nlogn 
PEs. Further, it is easy to see that an HCSN network is an 
unfolding of a PSC network with n PEs, the unfolding is to 
the extent of logn stages. The last stage is circularly con- 
nected to the first. The addresses associated with the PEs 
thus obtained from splitting processors in an HCSN network 
may now be obtained from the network-wide address of the 
corresponding input port (or output port). To show topologi- 
cal equivalence between a CCC network and an HCSN net- 
work, we rewrite the address of PEs in the HCSN network as 
((a,_1,..-,@; ,...,@9),¢ ), Where c is the column number, and is 
equal to the index where the symbol ”|” appeared in the origi- 
nal network-wide address. Each PE in an HCSN network has 
three ports, viz. 1], O, and L./J is an input port, O is an out- 
put port, and L is bi-directional. With this the interconnec- 
tions in the HCSN network are described as: 

<O(a,c),I(a,(c -1)mod logn )>, 


(5(a)) 
(5(b)) 


(L(a,c),L(a+e* 2° ,c)), 
where e =1-2*bit, (a ). 


The equivalence between the CCC network and the 
HCSN network is now evident. The CCC network and the 
HCSN network each have nlogn PEs. That is, in the HCSN 
network there are n PEs in each of the logn columns, while 
in a CCC network there is a hypercube of n cycles each con- 
taining logn PEs. The processors in them are numbered as 
(l,p ), O<1 <n, O<p <logn. The interconnections are identical 
(see interconnections (2) and (5), above). The only difference is 
that in an HCSN network the interconnection from an output 
port to an input port of a PE is uni-directional. 


4. IMPLICATIONS OF EQUIVALENCE 


There are a number of implications of the above topolog- 
ical equivalence. First, a VLSI layout for an HCSN network is 
now evident. In fact, the layout of Figure 3 is a layout for the 
HCSN network of Figure 6, or equivalently that of Figure 4. 
The layout has an area n*(n-1), or more precisely 
O(N®/log?N ). If the known layout for CCC networks is of 
minimum area, then the layout for HCSN networks is also 
optimal. 


From the layout of Figure 3 it is evident that an HCSN 
network, and similarly a CCC network, has a recursive 
definition for certain values of n. Further, an HCSN network 
is symmetric not only with respect to a column (because of 
circular connection), but is also symmetric with respect to a 
PE within a column. The latter is a consequence of the fact 
that in a CCC network any cycle of PEs may be assumed to 
be at the origin of the hypercube. This symmetry allows one 
to make statements about the HCSN network assuming that 
a PE has an address (0,0) "without loss of generality”, and 
enables one to provide new and simpler proofs for known 
results for HCSN networks in respect of routing algorithms 
and fault-tolerance analysis. 


Properties that have been developed for the HCSN net- 
work now readily become applicable for CCC networks. One 
set of results known for the HCSN network are in respect of 
message routing using control tags (see reference [4]). A con- 
trol tag, which is part of the message header, is a sequence of 


control numbers. The first remaining control number is 
extracted from the control tag upon receipt by a PE, and is 
used to select the outgoing route or link. The rest of the con- 
trol tag is used to guide the routing following this PE. When 
the control tag is empty, the message is at its destination. 


Algorithms can now be obtained for routing data through a 
CCC network using a scheme discussed above. Further, it 
can be shown that the length of a control tag is no more than 
2*logn in a CCC network with nlogn PEs. Also, results in 
respect of fault-tolerance of HCSN networks can be applied to 
CCC networks. 


In view of the topological equivalence, an HCSN network 
may be treated as an implementation of a CCC network (pro- 
vided bi-directional communication between B and F ports of 
a CCC network is not insisted upon), or vice versa. And, since 
a Perfect Shuffle Connection (PSC) network can always simu- 
late an HCSN network, it is reasonable to expect that most 
algorithms for CCC networks are similar to those for PSC 
networks or even for Hypercube networks (see references [1], 
[2], and [5)). 

The equivalence between HCSN networks and CCC net- 
works is more than simply topological. For sure, any algo- 
rithm that runs on an HCSN network may be made to run on 
a corresponding CCC network. The reverse is, however, not 
true. The reason why such functional equivalence cannot be 
established is that, in a CCC network, transmission of data 
between PEs within a cycle is instead bi-directional. The 
latter implies a capability of processing data objects contained 
in neighboring PEs within a cycle. | 


Preparata, et al. (see reference [2]) have presented a gen- 
eric DESCENT algorithm that is useful in solving a large 


Figure 1 The Hypercube and PSC networks. 
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class of problems, including Bitonic Merge, FFT, Sorting, 
Matrix operations. This version of the algorithm for CCC net- 
works is immediately applicable to HCSN networks. A 
difficulty, however, does arise in implementing the operations 
on data objects within PEs of a cycle, viz. LOOPOPER(.). An 
alternative implementation of LOOPOPER for HCSN net- 
works can be obtained. This algorithm assumes that transmis- 
sion is uni-directional, but that each PE is capable of storing 
logn data objects. The time complexity of the DESCENT 
algorithm running on an HCSN network can be shown to be 
the same as that when it runs on a CCC network. The conse- 
quence of this is that many algorithms available for CCC net- 
works can now be suitably recoded for implementation on 
HCSN networks. 
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Figure 4 An HCSN network. 


Figure 6 The HCSN network of Figure 3 with each proces- 
sor split into two PEs. : 


THE DISTRIBUTION OF WAITING TIMES 
IN CLOCKED MULTISTAGE INTERCONNECTION NETWORKS 


Clyde P. Kruskal 
Department of Computer Science 
University of Maryland 
College Park, Maryland 20742 


ABSTRACT We analyze the random delay experienced by 
a message traversing a buffered, multistage packet-switching 
banyan network. We find the generating function for the dis- 
tribution of waiting time at the first stage of the network for 
a very general class of traffic, assuming messages have discrete 
sizes. For example, traffic can be uniform or nonuniform, 
messages can have different sizes, and messages can arrive in 
batches. For light to moderate loads, we conjecture that 
delays experienced at the various stages of the network are 
nearly the same and are nearly independent. This allows us 
to approximate the total delay distribution. Better approxi- 
mations for the distribution of waiting times at later stages of 
the network are attained by assuming that in the limit a sort 


of spatial steady state is achieved. Extensive simulations 
confirm the formulas and conjectures. 


1. INTRODUCTION 


Buffered interconnection networks are receiving increas- 
ing consideration for use in parallel computers. They are 
integral components of several machines currently under 
development, including the Cedar machine at the University 
of Illinois [7], the NYU Ultracomputer at New York Univer- 
sity [9], and the RP3 machine [17] at IBM, where they are 


used to interconnect processors to shared memory. In order. 
to study the multitude of options available in actually build-. 


ing a machine, it is extremely useful to have formulas that 
approximate the performance of an interconnection network. 


In fact, formulas derived in a previous paper by two of the. 
authors [12] have been heavily used in designing both the. 


NYU Ultracomputer [9] and RP3 [15]. While simulation 
results are often more accurate, they are time consuming and 
expensive. In this paper, we analyze the random delay experi- 
enced by a message traversing a buffered, multistage packet- 
switching banyan network, for a very general class of traffic. 
For example, traffic can be uniform or nonuniform, messages 
can have different sizes, and messages can arrive in batches. 


Interconnection networks connect processing elements to 
memory modules through stages of switches (Figure 1). Early 
work in describing these networks was done by Goke and 
Lipovski [8], Lawrie [14], and Patel [16], among others. For 
more full explanations of interconnection networks see [6] or 
[18], for example. There have been a number of performance 
analyses of interconnection networks (e.g. [4,5,12,13]). 


The basic building block of an interconnection network 
is a k-input, s-output (k Xs) buffered switch (Figure 2). 
Each input port can accept one packet per clock cycle, and 
route it to the appropriate output port. Each output port 
has a FIFO buffer. Conflicts between messages simultane- 
ously routed to the same output port are resolved by queue- 
ing the messages. We idealize this structure by assuming that 
the output buffers have infinite length. While this is clearly 
infeasible in practice, it is well known that, for light to 
moderate loads, moderate sized buffers provide approximately 


Part of the work was done while the first author was at 
the University of Illinois. The first author was supported in 
part by a grant from IBM. | 


0190-3918/86/0000/0012 $01.00 © 1986 IEEE 


Marc Snir 
Institute of Mathematics and Computer Science 
The Hebrew University of Jerusalem 
Jerusalem, Israel 


12 


ity p, the expected delay has been computed [12]. 
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the same performance as infinite buffers. We also assume 
that each output port buffer can accept any number of mes- 
sages from the input ports in a clock cycle, and that arriving 
messages do not interfere with departmg messages. Each out- 
put port can be viewed as a discrete queueing system. 


We make the following probabilistic assumptions con- 
cerning traffic at the first stage of a network: 

(1) The number of messages arriving at successive cycles 
to an output port are independent, identically distri- 
buted random variables. These random variables may 
have a different distribution at different ports, and are 
clearly dependent from port to port. 


The service requirements (the number of cycles 
required to forward a packet) for successive messages 
at an output port are independent, identically distri- 
buted random variables. This distribution may vary 
from port to port. 


Constant service time is usually the appropriate assumption 
for interconnection networks realized with synchronous logic. 


Assuming that the traffic is uniform (e.g. each request is 
equally likely to go to each output node) and that at each 
cycle a packet arrives at an input node with a fixed probabil- 
This 
analysis is based on Little’s identity, and it is not obvious 
that it can be extended to obtain more information about the 
delay distribution, such as the variance. Such information is 
quite important for two reasons: First, to obtain good perfor- 
mance on a parallel machine, it is not sufficient to have a low 
expected memory access time; high variance will impede per- 
formance, as it is often the case that the speed of the slowest 
processor dictates the system speed. Second, as we discuss in 
Section 5, the variance can be used to obtain an approximate 
formula for the waiting time distribution of a message 
through the entire network. 


Exact formulas for both the average and variance of the 
‘waiting time at the first stage when all messages take a single 
eyele to service were obtained in a previous paper [13]. This 
was used to obtain approximate formulas for longer messages 
of constant size. Using stronger methods we obtain in this 
paper exact formulas for the average and variance of the wait- 
ing time at the first stage for long messages of constant size; 
in fact, we obtain the entire distribution of waiting times for 
any discrete service time distribution. In the previous paper 
[13], we also suggested a method for analyzing the waiting 
time at later stages of the network, by assuming that the out- 
put of a queue can be modeled by a Markov process; the 
approximations were in practice hard to obtain and not very 
accurate. This paper gives an alternative method for approxi- 
mating the waiting time at later stages; it is easy to use and 
provides extremely good approximations, as evidenced by a 
comparison to simulation results. 


In Section 2, we analyze the performance of the first 
stage of an interconnection network, by calculating the z- 
transform of the distribution of waiting times. This enables 
us to compute higher moments of that distribution. In partic- 
ular we present explicit formulas for the expected value and 
variance. In Section 3, the formulas are used to find the 
expected value and variance of the waiting time under various 


standard assumptions. For light to moderate loads, we con- 
jecture that waiting time experienced at the the later stages 
of the network are nearly the same as for the first stage. In 
Section 4, we obtain better approximations for the waiting 
time at the later stages. In Section 5, we discuss how to 


analyze the total delay through the network. Section 6 gives 
some concluding remarks. 


2. ANALYSIS 


Our model for the first stage of switching comes under 
the general rubric of a discrete time queueing system. We 
compute in this section the z-transform for the waiting time, 
and use it to derive the expectation and variance of the wait- 
ing time, for general discrete service and arrival distributions. 
The solution method we use was indicated by Kobayashi and 
Konheim [11]. We are not, however, aware of a complete 
solution to this problem in the literature. As will be seen in 
the remainder of this paper, for the queueing systems we are 
interested in, it is useful to carry out the calculations in their 
entirety. 

We start with some definitions. Let » be the average 
number of arrivals at any cycle and m be the average service 
time of a message. The traffic intensity is then p = m . 


Let f; be the probability that 7 messages arrive at any 
cycle, and 


R(z) = > fz) : 


j =0 
Then 


R'(1) =». 


Let 9; be the probability that a message requires 7 time 
units to serve and — 


00 ; 
U(z) = yy 952? : 
j=0 
Then 
Ud) =m. 


THEOREM 1: Let w be the steady state waiting time for a 
message. The z-transform of the waiting time distribution of 
an output queue at the first stage is 


t(z) = E(z”) 
_ iImd 1-z 1-R (U (z }) (1) 
» R(U(z))-2 1-U(z) © 


PROOF: Let s, be the unfinished work at the end of the 
nth cycle, a, number of messages arriving at the n th cycle, 
and c, be the total service time for messages arriving at the 


nth cycle. Let s, a, and c be the steady state variables 
corresponding to s, , a, , and c,. Note that 


E(z*) = E(z *) = R(z), 


Ke)= VEe =i, = VE (U(z)\F, =R(U(e)), 


j =0 j =0 
and 
8, = max(0,s,_,+c,-1). 
Since c, is independent of s, _, we obtain the identity 
E(z*) = E(z*+¢-!|s >0)P(s >0) 
+ E(z°-!|s =0,c >0)P(s =0,c >0) 
+ E(z°|s=0,c =0)P(s =0,c =0) 


= E(z° )E(z*!|s >0)P(s >0) 
+ E(z°~!|¢ >0)P(s =0)P(c =0) 
+ P(s =0)P(c =0). 


Let 
00 
Wiz) = Pee == E(z*). 
The previous identity implies 
Wz) = R(U(z) Se ae ig Ew) 


so that 
Wz) = h g(1-z )R (U (0) 
R (U(z))-2 
We compute, using L’Hospital’s rule, 
h oR (U (0)) 
1) = 1 = ——\— 
so that 
—— 1-m d 
R(U(0)) ° 
and 
Wz) = (1-z )(1—m d) 


R(U(z)-2 


Since the arrival process is memoryless, arriving batches 
see the steady state unfinished work distribution s. Thus the 
steady state waiting time for a message w = s + w!', where 
w! is the steady state service time for messages arriving at 
the same cycle, but served first. We have 


E(z”) = E(z*)E(z”’) = Wz)E(z”). 


Let d be the steady state number of messages that arrive at 
the same cycle with any message, but are served before it; let 


o(z) = E(z¢ ). Then, using the same derivation as before, we 
get 


E(z”') = ¢(U(z)). 


We shall now compute ¢(z). The probability that any 
message arrives in a batch of 7 messages is equal to 7 f; / >. 
Thus 


message arrives in a 


[o.@) 
Pd=j)= YY P= | batch of k messages) kf, [> 
k=j+1 
= 
3 Lan), 
k=j+1 
so that 1 2 1 2 k-1 
(2) = > fke= > Uh 2" 
1 = k R(z)-1 
= | = . 
Mey ale OY = Y=) 


The z-transform of the waiting time distribution is 


t(z) = E(z”) = ¥(z)o(U(z)) 


_ l-m>d 1-z 1-R (U(z)) (1) 
» R(U(z)-2 1-U(z) © 


O 


In principle this gives the complete distribution of the waiting 
time. 


COROLLARY 2: 


mR "(1) + °U"(1) 


Se ae 2d(1—m d) 


(2) 


COROLLARY 3: 
Var(w ) = ¢"(1) + t"(1)-(t'(1))? 
(6m \R"(1) + 4m?dR "(1) + 6\8U"(1) 
+ 4\3U (1) (1-m d) - 3m2R "(1)?(1-2m d) (8) 
_ + 3\4U"(1)?? + 6XR"(1)U"(1) 


12)7(1—m ))? 


(The derivation of ¢''(1) used six applications of L’Hospital’s 
rule, and took Macsyma all night on a minicomputer.) 


8. EXAMPLES) 

We now apply the above formulas to derive the expected 
value and the variance of the waiting time for messages in 
several standard and important queueing systems. Note that 
the expected value formulas only give the waiting time of a 
message. To obtain the delay of a message in a queue, one 
must add to these formulas the service time. For the queue- 
ing systems in this section, message arrivals are independent 
of queue length. Thus, the variance of the delay of a message 
in a queue is simply the sum of the variance of the waiting 
time and the variance of the service time. 


3.1. Service Time One 


Suppose that all messages take exactly one unit of time 
to be serviced. Then m —1 and U(z)=z. Thus, 


U'(z) = 1 U"(z)= U""(z)=0. 
Substituting into Eq. (1), we obtain 
a wy — LrAIL-R(z) 
t(z) = E(z”) Pes 
Substituting into Eq. (2) we get 
_ Fk") 
Bw) = 2G) 
and substituting into Eq. (3) we get | 
_ 2(3R"(1) + 2R "(1))A(1-d) — 3(1-2d)(R (1)? (5) 
12\7(1-)? 


and 


(4) 


Var(w ) 


We analyze some special cases of this for k-input, s-output 
(k Xs ) switches. 


3.1.1. Uniform Traffic, Single Arrivals 


Suppose that each input port has a probability p of 
receiving one message at each unit of time, and that each 
incoming message has an equal chance of going to any of the 
output ports. Then 


n= D2) 0-4] 


This quickly yields 


k 
R(z) = aa 


Calculating R‘'(1), R'"(1), and R'"(1) and substituting into 
Eqs. (4) and (5) yields 


E(w) = (8) 


2 (1d) 
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and 


1 1 1 
Sadhu (1-7) » [6 - 5A(1+—) + 2*%(1+—)) @ 
12 (1-))? 


3.1.2. Bulk Arrivals 


In many systems the size of a message exceeds the size of 

a transmission packet; a message is transmitted in several 

packets. These packets arrive at the first stage of the net- 

work in one bulk. This can be modeled as in the previous 

example, except arrivals at input ports are in batches. 
Assuming a constant batch size of 6 messages, 

b k 
R(z) = [p24 22] 


8 8 
Using Eqs. (4) and (5), this gives 


(b-1) + (1-2). 


i 21d) 
and 
b2-142X(b 49.5%) — 51) + an%1-4) 
Var(w ) = Se 
12(1-d)? 
These agree with our previous formulas for the case b = 1. 


3.1.3. Nonuniform Traffic 


In many practical situations each input is likely to have 
a distinct favorite output port (e.g. the output port connect- 
ing a processor to its private memory — see [1]). We assume 
that kK —s. (It is not hard to generalize this for k #s, but 
the equations become quite lengthy.) We do assume bulk 
arrivals. Each input port sends arriving messages to its favor- 
ite output port with probability g, and sends them with pro- 
bability (1-q)/k to each output port (including its favorite 
output). The distribution of messages at the output ports is 
the product of two terms: the first term accounts for normal 
messages and is essentially the same as given in 3.1.2, with p 


replaced by p(1-q); the second term accounts for favored 
messages. We get 


1- z= 
R(z) = (1-pot+ pts! 


(1 p(q+=F4) + p(q+h)2"). 


)* —1 


Substituting into Eq. (4), 
d(1-9?\1-2) +0 ~1 


2(1- ) 
Note that, for gq =1 we get E(w) = 0, and for gq =O we 
obtain the same formula as in 3.1.1 (with k =s), as it 


should be. The general formula for the variance is quite 
lengthy. 


B(w ) 


3.2. Constant Service Times 


We now consider the situation when messages can have 
one of several constant service times. We will only consider 
uniform traffic and single arrivals. Thus, as in Section 3.1.1, 


h@) = [1-242] 


3.2.1. Single Size 


First, suppose that each message takes exactly m units 
of time to transmit. This will occur, for instance, when each 
message is composed of an equal number (m ) of packets, and 
the constituent packets of a message are transmitted at con- 
secutive cycles. Then 


(a) == 2 4 
The traffic intensity is now 
mpk 
Pp = npn - 
8 


Substituting into Eqs. (2) and (3), we obtain 


E(w) = ———— 
we) 2(1-p) 
and 
(1-4) [6m - 5p(1+—) + 29°14) 
PY 4 (m1) [2(2m -1) — p(m +1) (9) 
Var(w) = SS 
12 (1-p) 
These coincide, for m = 1, with the equations of 3.1.1. 


3.2.2. Multiple Sizes 
Now suppose there are n service times m),...,m,, and 
service time m,; occurs with probability g;. This will occur 


when there are different kinds of requests. For example, read 
requests are likely to have different sizes than write requests. 


We get 
n ‘iis 
U(z) = dog;z °. 
1 
Thus 
k n 
p= aE dM; 9; 
3 
Substituting into Eq. (2), we obtain 
n 
d » m; (m, ~ 
1 


2 (1-p) 


1 . 
—)9; 


m 1 
‘ p du m;(m; — 9: 


E(w) = 
2 (1-p) }) m; 9; 
1 


The formula for the variance could also be obtained, but it is 
quite lengthy and not particularly enlightening. 


4. LATER STAGES 


We do not know how to analyze the later stages exactly 
as the inputs at successive cycles are not independent. We 
have, however, developed some very useful approximate for- 
mulas for the average and variance of the waiting time. 
These are based on two observations: First, as we progress 
through the network, the waiting time statistics quickly 
approach a limiting distribution. Second, nearly every wait- 
ing time distribution in queueing theory has an average on 
the order of 1/(1—-p) as p tends to one; that is, if w,;(p) is the 
average waiting time at the 7th stage, and w,,(p) is the limit 
as 7 gets large, we expect lim (1-p)w,.(p) to exist. We calcu- 


p> 
lated w,(p) exactly in Section 3, and we expect w,(p) and 
W (p) to have similar qualitative properties, ie., they should 


depend on most parameters in roughly the same way. Hence, 
it seems reasonable to estimate 


def W ool A) 
ve wy(p) 
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It is clear that r(0)—1. We use simulations to estimate 
r(1/2), and then simply linearly interpolate to obtain to 
obtain a value a such that 


r(p) =~ l+ap. (10) 


Then 
Woo(p) (1+ ap) wy(p). 


We will also generalize the formulas to take into account the 
dependence of w;(p) on the stage, the switch size, and the 
message size distribution. This method of interpolation was 
previously applied to queueing systems by Burman and Smith 
[3] using light and heavy traffic theory. The light traffic limit 
is obvious in our case. We do not have a heavy traffic 
analysis for our process, so we rely on simulation instead. 


Using the same ideas, we can obtain an approximation 
for the variance. Let v; be the variance of the waiting time 
at stage 7, and uv, be the variance in the limit. Since the 
formulas for variance have one higher power of p, we expect a 
good approximation of v..(p)/v,(p) to contain (at least) one 


higher power of p, i.e. we obtain a quadratic interpolation for 
the variance. The variance after several stages can be 
approximated by 


Voo SS (Ltapt bp*)v,, 
where a and 6 are constants to be determined. 


In the remainder of this section, we obtain our expres- 
sion for the waiting time and variance in step-by-step general- 
izations. We first estimate them for uniform traffic when 
messages have size one, then size m, and then general size. 
Finally, we consider nonuniform traffic. 


4.1. Service Time One 


Consider service time one, 2X2 switches, single arrivals, 
and uniform traffic. For p = .5, w, = .25 (Eq. (6)), and, 
from the simulations in Table I, w,, seems to be about .3. 
Substituting into Eq. (10) and solving for a gives a ~ 2/5. 
Thus, we find r (p) ~ 1 + 2p /5, so the waiting time 


re 2P)_ Pe 

Table I compares the simulation results with our formulas. 
The waiting time values in the ANALYSIS row are from the 
exact formula for the first stage (Eqs. (6) and (7)), and the 
waiting time values in the ESTIMATE row are from the 
above approximation for the waiting time in the limit. Note 
that the approximation seems to be slightly low for p small 
and slightly high for p large. More complete simulation 
results (not included for brevity) show that r(p) is actually 
slightly concave. An even better estimate could be obtained 
by using a quadratic approximation. 


Using the same technique for 4X4 and 8X8 switches 
gives a a bit less than .2 and a a bit less than .1, respec- 
tively (see Eq. (6) and Table II). This suggests that the above 
formula can be (crudely) extended to k Xk switches by 
linearly including k as a parameter. This gives the waiting 
time 


Ww 


(11) 


Table II compares the simulation results with our formulas. 
In Tables I and II it looks as if w; approaches w 

geometrically.. This suggests a formula of the 

r;(p) = 1+ (1-a'1)(r (p)-1) for some a<1, yielding 


CO 
form 


er ee. Te (12) 
w; (1+ spl Qa-p) 5) 

as the expected waiting time at the ith stage. Looking once 
again at the formula for k = 2 and p = .5 (Table I), gives 
a = 2/5 as a good approximation. It turns out that this 
value of a works reasonably well for general k and p. For 
brevity, we do not explicitly compare this formula to the 
simulations (although the interested reader can easily do the 
calculations). It is by no means surprising that, for a given p 
and k, w; approaches w,, geometrically; what is perhaps 
e vieus is that a single value of a works well for all p and 


Applying the same techniques to variance, we find that a 
reasonable formula for the variance after several stages is 


Vo © (1 feck 2p 
12(1-p )* 


(Since this is only an approximation and since the simulation 
results do not give exact answers, there is quite a bit of free- 
dom in choosing coefficients a and 6 for the p and p2 terms. 
Other choices will surely work just as well or better.) We can 
also estimate the variance at stage 2 to be 


v, & [ + Cr + 21-0) 


(1-2) p (6 - 5p (142) + 2p (+2) 


12(1—p )? 


(14) 


where a = 2/5. 


4.2. Single Service Time 


Consider the case when messages have a single constant 
size. Our model of the first stage is not a particularly good 
model for the later stages: At the first stage a source after 
sending a message can send a new message on the next or any 
later cycle. At later stages, since sources are outputs from 
queues, once a source sends a message, that source will not 
send a message for at least m cycles. This will tend to reduce 
queueing delays at the later stages. 


Later stages can be better modeled by assuming that 
messages take one cycle to be processed, but the cycle time is 
m times as long. Following [12], we use the formula for ser- 
vice time one (Eq. (11)), and, for a fixed p, multiply the time 
to process a message by a factor m, and also multiply the 
average number of packets per cycle by m. In other words, 
for a fixed traffic intensity p, the cycle time is m times as 
large. This gives the average waiting time 


(15) 


1 2 
(1-—) m“p 
W oo Amp, k 


5k “ 2(1- mp) 
For m > 2, this formula is a reasonable approximation at all 
stages after the first, and, of course, we have an exact formula 


for the first stage. Table III compares the simulation results 
with our formulas. 


Let us examine the behavior of the interior stages in 
light traffic. If we allow m to increase and p to decrease 
with mp =p constant, then in time scaled by m, the first 
stage output queues become M/D/1 queues with arrival rate p 
and service time 1. (Actually, the well-known waiting time 
statistics of M/D/1 queues can be obtained as limits of (8) 
and (9).) Now the interior stages are not precisely M/D/1 
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queues in this scaling, because the packets output from previ- 
ous stages must be spaced by at least m time units. 
Nevertheless, it is clear that in light traffic the interior stages 
will resemble M/D/1 queues, but the congestion will be lower 
than at the first stage, since packets will be very unlikely to 
collide with other packets from the same source. That is, the 
congestion will be as if the arrival rate were (1-1/k )p. Using 
the M/D/1 light traffic results 


(1-)p 
A) SSF O (0°) 
and 
(1-4)p 
Var(w) = —— + O(p”) . 


Our approximations should have these properties, too. Eq. 
(15) does satisfy this. 


To obtain an approximation for the variance, we argue 
as before: start with formula for the variance at the first stage 
for unit size messages (Eq. (7)), multiply by m2, change p to 
mp, and then use the light traffic analysis and the simula- 
tions to interpolate. Our heuristic formula is (see Eq. (13)) 


a 3 k k 
(1-2) mp [6 - 5mp (1+) + Amp (142) 


’ 


12(1 — mp )* 
where 2/3 was obtained from light traffic analysis. Light 
traffic analysis is a limiting case for m large; in practice we 
found that 7/10 works better than 2/3 for small and 
moderate message sizes. We match the constants C’, and C, 
to simulation results, giving 


ae i: Amp F 


Yong 


wt 


12(1 — mp )* 
This approximation is still slightly low for m small, as can be 
seen in Table III. Better approximations can be obtained for 
each individual value of m; in particular, Eq. (13) is a much 
better approximation for m =—1. As with waiting times, for 
m > 2, this formula can be used to approximate variances 
for all stages after the first. 


4.3. Multiple Service Times 


As in Section 3.4.2, suppose there are n service times 
M ,...,mM,, and service time m; occurs with probability g; . 


n 
yy g;m™ 
t=1 
approximate formula for the average waiting time, replace the 
size of all messages by their average size m and use the 
approximate waiting time formula from the previous section 
(Eq. (15)). This gives the average waiting time 


n? 


The average service time is m To obtain an 


The values obtained from this formula tend to be a bit 
low. The reason is that we are approximating multiple size 
messages by their average size. Since we are able to calculate 
everything at the first stage exactly, we know how much off 
such an assumption would be at the first stage: simply the 


ratio of the actual expected waiting time and the waiting 
assuming all messages have their average size. Assuming this 
ratio is fairly constant at the different stages, multiplying the 
above formula by this ratio gives a very good approximation: 


1 - 1 
4mp - pe um ma - ra 
Woo & (Lt) 1 
2(1-—mp)(m sore 


An approximate formula for the variance v, could be 
obtained similarly, but, as with the variance formula for the 
first stage, it is quite lengthy. We have, however, obtained 
numerical values from both variance formulas, ie. for v, and 


UV oo: Table IV compares the simulation results with our for- 


mulas. 


5. TOTAL DELAY 


Once we have formulas for the expected value and vari- 
ance of the waiting time at a stage, these can be used to 
obtain approximations for the total waiting time. The 
expected value of the total waiting time is simply the sum of 
the average waiting time at the different stages. In particular, 
for messages of size one, summing the w, in Eq. (12) approxi 
mates the total waiting time for an n stage network as 


(1-4) p 
4 1-—a” k 
" [1+ 20 2] 2(1-p)’ 


where a = 2/5. For m > 2 the average total waiting time 
can be approximated as the average waiting time from the 
first stage (Eq. (8)) plus n-1 times the waiting time at the 
later stages (Eq. (15)), which is 


(m - =)mp (1- +) mp 
2(1—- mp) + (ml) (1+ SE) 2(1—mp) — 


If the waiting times from stage to stage were indepen- 
dent, as is the case with Poisson arrivals and exponential ser- 
vice times, the variance of the total waiting time would sim- 
ply be the sum of the variances at the different stages. Simu- 
lations show that waiting times at neighboring stages have 
fairly low correlation, and the correlation seems to drop 
geometrically as stages become further apart. Thus summing 
the variances should be a good approximation. 


To obtain a better approximation, note that the total 
variance is actually the sum of the covariances between 
stages. Let v,; be the covariance between stage z and stage 
jy. Covariances seem to drop geometrically as stages become 
further apart. In particular, the v;; can be approximated as 


follows: U;; = ;, UV; 541 av;, v; 540 © abu; , 
v; 543 > ab 20, P 0; 544 ab Me es where 
= (1- mp) ame and 6 = (1- rege) . Now summing 


all of the covariances approximates the total variance as 
n _pn-t 
ye pit ed ¥; 


i=1 1-6 

For m=1, we use the v,; from Eq. (14). For m >1, v, is the 
true variance for the first stage (Eq. (9)), and v;, 7 >1, can be 
approximated by the formula for v,, (Eq. (16)). Tables VI 
and VII compare the simulation results with our formulas. 

The distribution of waiting times seems to be about the 
same for all stages. If the distributions were independent, 
then by the central limit theorem, the total waiting times for 
a large number of stages could be approximated by a (trun- 
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cated) normal distribution, whose sum is the sum of the 
expected values and variance is the sum of the variances. 
The central limit theorem actually holds under much weaker 
hypotheses than independence (see, for example, [2]), and we 
expect it essentially to apply here. Now, typically in queueing 
systems, the distribution of waiting times has an exponential 
or geometric tail. Thus, for few stages, we expect a gamma 
distribution with the proper expected value and variance to 
be even a better approximation. Figures 3 and 4 show an 
incredibly good match between the gamma and the observed 
distributions, especially at the tails. The gamma distributions 
were formed with the means and variances given by the esti- 
mates from Tables VI and VII. In practice, these moments 
and the tail of the waiting time distribution are the quantities 
of interest; we believe our formulas are accurate enough for all 
practical purposes. 


So far we have obtained formulas for the total waiting 
time. In order to obtain the total delay in the network one 
has to add to the total waiting time the total service time. If 
service time is one, then the total service time is simply the 
number of stages. In general it is the sum of service times at 
the successive stages. 


The waiting time at one stage may depend on service 
time at a previous stage. However, the correlation is weak, so 
that these random variables are stochastically nearly indepen- 
dent. Thus, the variance of the total delay is approximately 
the variance of the total waiting time plus the sum of the 
variances of the service times. If the service times are con- 
stant, then their variances are zero, so the variance of the 
total delay is exactly the variance of the total waiting time. 
In general, the distribution of the total delay can easily be 
approximated by looking at the distributions of individual 
service times and the distribution of the total waiting time. 


6. CONCLUSION 


We have analyzed the delay experienced by a message in 
a buffered, multistage, packet-switching banyan network. For 
the first stage, we were able to derive the complete distribu- 
tion of the delay for a very general class of distributions, 
assuming messages have discrete sizes. The result is quite 
general: for example, one can use it to derive the Pollaczek- 
Khinchin formulas for M/G/1 queues. The result was used to 
determine exactly the average and variance of the delay for 
several commonly considered distributions. Using the delay 
formulas for the first stage, we developed extremely good 
approximations for the average and variance of the delay at 
later stages. Finally, this allowed us to obtain good approxi 
mations for the full distribution of the total delay of a mes- 
sage through the entire network. 


In order to approximate the delay after the first stage, it 
was essential to have good formulas for the delay at the first 
stage. It was only by building on them that we were able to 
make educated guesses as to the delays at later stages. 


One aspect of our results that is worth stressing is the 
dependency of waiting time on the message size m: For a 
fized traffic intensity p, the average waiting time increases 
linearly in m (see Eqs. (8),(15)), and the variance increases 
quadratically in m (see Eqs. (9),(16)). Thus, while using 
larger messages may save the overhead of duplicating the 
same routing information over several packets, it may 
dramatically increase delays in all but very lightly loaded net- 
works. This point has already been made in [9,12,13], but 
does not seem to be widely appreciated. 


While our simulations seem to indicate clearly that aver- 
age waiting time at successive stages converge, it would be 
nice to be able to prove this result formally, i.e. to show that 
average delay at successive stages can be bounded, indepen- 
dent of the network size. 
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Abstract — A new interconnection network, the Snep- 
tree, is investigated. The Sneptree consists of 2” — 1 identical 
nodes and each node has four links. The links are connected 
to form an augmented complete binary tree where the outgo- 
ing links of the leaves are connected to all the nodes in the 
network. We prove that a complete binary tree with arbitrary 
size can be mapped onto a Sneptree optimally. Hence, the 
Sneptree is well suited for distributed computations with tree- 
structured computation graph, such as divide-and-conquer and 
backtracking. One type of Sneptree, which contains two dis- 
joint spanning cycles and is thus called the Cyclic Sneptree, is 
of particular interest since it can simulate a fully unbalanced 
tree optimally, such as a left/right skewed tree. 


A recursive method is given to generate the H-structure 
layout of the Cyclic Sneptree. The number of crossings and 
the length of the longest wires in the H-structure layout are 
analyzed. A message routing algorithm between any two leaf 
nodes is presented. The routing algorithm, which is of O(n) 
complexity, gives a good approximation to the shortest path. 
The traffic congestion in the nodes at the upper levels is also 
significantly reduced compared to the binary tree case. 


1. Introduction 


Due to the development of VLSI technology, it is now 
possible to construct powerful computers by connecting thou- 
sands of small identical processors into a so-called “processor 
network.” Each processor has independent control and local 
memory. Hence, each processor can run its own program in- 
dependently and asynchronously. Synchronization and com- 
munication are done by message passing between neighboring 
processors. The computation is distributed over the network. 
Hence, high concurrency can be achieved. 


Many different interconnection networks have been stud- 
ied, such as the binary tree [1,11], the mesh, the systolic array 
[4], the boolean n-cube [10], etc.. Some machines are ded- 
icated to some special applications; some are designed as a 
general purpose computing engine. One interesting problem 
which hasn’t been investigated profoundly is the mapping of 
the computation graph onto the implementation network, and 
in particular the mapping of an over-sized problem onto a fixed 
size network to keep the load of each processor balanced. In [7], 
a double-twisted torus simulates an unbounded mesh perfectly. 
The torus introduced a homogeneous processor network which 
relieves the boundary problem from a regular mesh so that a 
bigger mesh can be mapped onto this network automatically 
and optimally. 


In this paper, another homogeneous processor network is 
presented. The Sneptree [12] is a class of augmented binary 
trees with homogeneous nodes. Each node, including the root 
node and the leaf node, has four links. The links are con- 
nected such that a complete binary tree of arbitrary size can 
be mapped onto the Sneptree optimally. 


t This research was sponsored by the Defense Advanced Research 
Projects Agency, ARPA Order No. 3771, and monitored by the 
Office of Naval Research under contract number N00014-79-C-0597 


t Present address: Ametek, Computer Research Division, 610 N. 
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The binary tree has the property that the distance between 
any two nodes is at most 2log,n, where n is the size of the 
tree. Such a network is called “logarithmic.” The Sneptree is 
an expanded binary tree with more links in each node. Some 
connection patterns of the Sneptree are regular and symmetric 
and hence well suited for VLSI implementation. Furthermore, 
it can simulate an unbounded binary tree so that it is best for 
divide-and-conquer type applications. There are some other 
augmented binary tree networks been investigated, such as 
the X-tree[2], the Hypertree(3], and the De Bruijn Network(9]. 
Those networks are different in their connection patterns and 
applications. The comparison between the Sneptree and other 
networks will be given in the conclusion. 


In section 2, the definitions of the Sneptree and the Cyclic 
Sneptree are given and different connection patterns are pre- 
sented. The mapping of a complete binary tree onto the Snep- 
tree is proven to be optimal in section 3. Like a binary tree, the 
Sneptree can be laid out into an H-structure plane nicely. Sec- 
tion 4 presents a recursive method to construct the H-structure 
Sneptree. In section 5, a routing algorithm that routes a mes- 
sage from a leaf node to another leaf node is presented. In the 
conclusion, the Sneptree is compared to other networks, and 
future research directions are discussed. 


2. Definition of the Sneptree 


Definition: An n-level Sneptree is a complete binary tree of 


2” — 1 nodes, links directed from root to leaves, augmented 


with 2” additional Snep links directed out of the leaves, such 
that each node has 4 incident links: 2 directed in and 2 directed 
out. Each node in the tree has an incoming Sneplink, except 
for the root, which has 2 incoming Sneplinks. 


Notice that the Sneptree is defined to be a directed graph 
here for easier understanding. In the real implementation, the 
links should be bidirectional. Furthermore, we call the outgo- 
ing link which points to the left descendant of a node the “left 
link” and the one pointing to the right descendant the “right 
link.” 


There are many possible ways to connect those 2” 
Sneplinks. One example of a planar connection is shown in 
Figure 1. This connection is not of particular interest because 
it ends up with a very unbalanced mapping for a highly unbal- 
anced binary tree, such as a left skewed tree. Another type of 
Sneptree whose Sneplinks are connected to form two spanning 
cycles (i.e., Hamiltonian Cycles) renders an optimal mapping 


Figure 1. A Three—level Sneptree 


for a left(right) skewed tree of any size (i.e., a linear array). 
This special type of Sneptree is called the Cyclic Sneptree. 


Definition: A Cyclic Sneptree is a Sneptree containing two 
link-disjoint spanning cycles. The “left cycle” contains only 
left links and the “right cycle” contains only right links. 


Theorem 1. There are [(2%~1—1)!]? connection patterns for 
the n—level Cyclic Sneptree. 

The proof is given in [6]. Notice that many of these 
((2"-1 — 1)!]* connection patterns are isomorphic because the 
left and the right links are indistinguishable in practice. 


Figure 2. A Three—level Cyclic Sneptree 


Figure 2 shows one connection pattern of the Cyclic Snep- 
tree. The numbers attached to the nodes show the node or- 
dering in the left spanning cycle. Symmetrically, the right 
spanning cycle of Figure 2 can be represented by node se- 
quences (1,5,7,6,2,4,3,1). Such a connection pattern is regular 
and symmetric and it can be generated recursively from the 
smaller structures. This connection pattern is not planar; the 
two crossing Sneplinks between two adjacent subtrees make 
it possible to extend one subtree to the other subtree. This 
particular Cyclic Sneptree is chosen due to its regularity and 
extensibility, which are crucial properties for VLSI implemen- 
tation. Another connection pattern with the same properties 
is compared in the conclusion. We’ll show later that the con- 
nection we choose here is better than the other one. 


3. Mapping of a Binary Tree onto a Sneptree 


The mapping from a complete binary tree onto a Sneptree 
is independent of the connection patterns of the Sneptree. In 
other words, no matter how the Sneplinks are connected in 
a Sneptree, the mapping of a complete binary tree is always 
optimal. This is not true for an incomplete binary tree. The 
performance of the mapping of an unbalanced tree is affected 
by the connection pattern of the Sneptree. 


Before describing the mapping performance, we shall de- 
fine the computation graph and the implementation graph first. 
The computation graph represents the structure of the dis- 
tributed computation and the implementation graph repre- 
sents the network topology of the parallel machine. Cell and 
node are the names for a vertex of the computation graph and 
the implementation graph, respectively. An optimal mapping 
is defined as a mapping such that (1) the adjacent cells are 
mapped onto the adjacent nodes, and (2) the number of cells 
in a single node differs by at most one. 


From now on, we assume that the computation graph is 
an m-level complete binary tree, the implementation graph is 
an n-level Sneptree and m > n. Again, a cell denotes a node 
in the binary tree and a node denotes a node in the Sneptree. 
For this particular mapping problem, we use two figures to 
measure the mapping performance. The first figure is the total 
number of cells mapped onto one single node, so-called the load 
factor, which indicates the total work load of each node. The 
second figure is the number of cells of the same height in the 
binary tree mapped onto one single node, which is an useful 
measure when the computation wavefront goes downward and 
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upward in the tree so that only the nodes at one particular 
level are active at a time. In the following, we are going to 
show that both measures are minimal for the mapping of a 
complete binary tree onto a Sneptree. Therefore, the mapping 
is optimal. 


In an n-level complete binary tree, the root is at level 0 
and the leaves at level (n-1). The height of a node is defined 
to be the distance of that node to the leaves. The height of a 
binary tree is the distance of the root to the leaves, which is 
(n—1) for a n—level complete binary tree. The above definitions 
also apply to a Sneptree. 


The optimal mapping from a complete binary tree onto a 
Sneptree is to map the root of the binary tree onto the root of 
the Sneptree and the two children of a cell onto the two direct 
descendants of a node. 


Theorem 2. All the nodes in the Sneptree contain the same 
number of cells of one particular level, say k, except that each 
node at (k mod n) level contains one more cell, when mapping 
an m-level complete binary tree onto an n-level Sneptree, m>n 
andO<k<m. 


Proof: If k <n, one cell will be mapped onto one node at 
the k-th level in the Sneptree. For k > n, the theorem can be 
proven by observing the construction of the Sneptree. A node 
at the j-th level of the Sneptree has two direct ancestors; one 
is its father at (j-1)-th level, and the other is a leaf, ie. a node 
at the (n—1)-th level. Moreover, the number of cells of level 
k which are mapped onto this node is the sum of the cells of 
level (k—1) which are located in its direct ancestors. In other 
words, 


T(J) = Ty-1(9 — 1) + Te_-1(n — 0), (1) 


where 7;,(7) is the total number of cells at the k-th level of the 
binary tree which are mapped onto one node located at the 
j-th level of the Sneptree forO < k< mandO <j <n. 


j3>0 


The root node, i.e., the node at the 0-th level, has no upper 
level and its two direct ancestors are both from the bottom 
level, i.e. the (n—1)-th level. Combine with Eq.(1), 7}(j) can 
be recursively defined by 


Ty (3) = Tk-1((9 — 1) mod n) + Ty_1(n — 1) (2) 


wheren<k< mandO<j<n. Fork<n, 


fO<k<n,0< 7 <nAgFk; 


._ f0, 
Tei) = {? f0<k<n0<j<nAj=k. 


By induction, we assume that the theorem holds for k, 
k>n. Letek=qxn-+r, then 7;(r) = T(j) +1, for all 7 and 
j #17. We now prove that the theorem holds for k +1. From 
Eq.(2), when r # n — 1, 


Th41((4 +1) mod n) = Ty 41 (r + 1) = Ty (r) + T(n — 1) 


=2xT,(n-1)+1, and 


Th+1(9) = T,((g — 1) mod n) + T,(n — 1) 
=2xT,(n—-1), jg #(k+1) modn. 
Ifr=n-—-1, 


Th41((k + 1) mod n) = Ty 41(0) = T)(n — 1) + T(n — 1) 


=2x7™%(j)+2, and 


Tye+1(9) = Ty ((7 — 1) mod n) + Tj.(n — 1) 
=2xTT(7)+1, 74 (+1) mod n. 


Therefore, Ty,41((K+1) mod n) = Ty41(j)+1, 7 # (+1) mod 
n, holds for any k. 


By induction, the theorem holds. Jy 


Theorem 3. All the nodes at the top (m mod n) levels of the 
Sneptree contain the same number of cells, Similarly, the rest 
of the nodes also contains the same number of cells and the 
amount ts one less than that in the top level nodes, when map- 
ping an m-level complete binary tree onto an n-level Sneptree, 
mon. 


Proof: Let T(j) be the total number of cells mapped onto a 
node at the j-th level of the Sneptree, i.e., T(j) = er. T;,(7). 
Let m = q¢Xn-+r and consider a node at one particular level 7. 
Such a node contains one more cell at (n + j)-th, (2n + 7)-th, 
..., and (¢ x n+ j7)-th levels respectively than the nodes not 
in level 7, when 7 < r. In other words, there are q such levels 
in the binary tree, in which one extra cell is assigned to the 
nodes at level 7, for 7 < r. For 7 > r, there exists only q —1 
such levels. Therefore, we can conclude that all the nodes at 
the top r = (m mod n) levels of the Sneptree contain the same 
number of cells, Similarly, the rest of the nodes also contains 
the same number of cells, and the amount of cells is one less 
than that in the top level nodes. 4g 


Corollary 4. Fork=n,n+1,...,m—1 
ok 
ana? tf O< 7S n-1 and 
T,(3) = 7#kmodn 
gk 
on] +1, j=kmodn. 
and 
27 —1 : 
ee +1, for0<j<(mmodn) 
T(j) = 
2 =4 : 
mo | for (mmodn) <j <n. 


The above corollary can be derived immediately from The- 
orem 2 and Theorem 3. We now get to the conclusion which 
has been addressed at the beginning of this section. 


Theorem 5. The mapping of a complete binary tree onto a 
Sneptree ts always optimal for any connection patterns of the 
Sneptree. 


From the above discussion, it is clear that an arbitrary size 
complete binary tree can be mapped onto a Sneptree optimally 
no matter how it is connected. On the contrary, the perfor- 
mance of mapping an unbalanced binary tree is dependent on 
the connection pattern of the Sneptree. 


Theorem 6. A left or right skewed tree of any size can be 
mapped onto the Cyclic Sneptree optimally. 


Proof: The theorem is true from the definition of the cyclic 
Sneptree. [ff 


Remark A linear array of any length can be mapped onto 
the Cyclic Sneptree optimally if we map the linear array onto 
the left cycle or the right cycle of the Cyclic Sneptree. 


4. Layout of a Cyclic Sneptree 


In this section, we discuss how to layout a Cyclic Snep- 
tree (shown in Figure 2) on a plane. From now on, we call the 
Cyclic Sneptree in Figure 2 “Sneptree” since all the discussions 
in Section 4 and 5 are based on this particular connection pat- 


tern. The H-structure layout for a binary tree [8] is modified » 


to layout a Sneptree. The recursive rule to generate the layout 


22 


is described. Also, the number of crossings and the length of 
the longest wires are analyzed. Finally, we will present a way 
to extend the size of the Sneptree by connecting two identical 
smaller Sneptrees. 


Recursive Generation of H-structure Layout 


Like the binary tree, the Sneptree can be laid out into an 
H-structure plane. Because of the Sneplinks in the Sneptree, it 
is not that straightforward to build the H-structure Sneptree. 
The major concern is to minimize the number of crossings in 
the layout and keep the length of the Sneplinks as short as pos- 
sible. With the two criteria in mind, a recursive construction 
algorithm is designed. 


Figure 3. Two basic three-level H-Sneptrees 


The n-—level Cyclic Sneptree can be constructed recursively 
into an H-structure layout with two given basic three—level 
H-Sneptrees, Ag and Bg (Figure 3). In Figure 3, the node 
numbering is compatible with that of Figure 2. The dangling 
arrows out of node 3 and node 7 are the two links incident into 
the root in a regular 3-level Cyclic Sneptree. These two links 
are dangling in order to extend to bigger structures. 


Let’s define two basic operations on the layout G: 


(a) mirror along x axis : G* 
(b) mirror along y axis : GY 


The recursive rules are as follows: 


1. Construct two 3-level H-structure Sneptrees Az and B3 as 
shown in Figure 3. Ag is the one we intend to construct 
and Bs is an auxiliary layout to be used in constructing 
bigger Sneptrees. Now we like to construct Ay, for all n > 3 
and B,, for n > 3 and n odd. 


2. Given two k-—level H-structure Sneptrees, A, and B,, 
for k > 3 and odd, the (k+1)-level and (k+2)-level H- 
structure Sneptrees can be constructed as shown in Figure 
4. A (k+1)-level Sneptree, Az,1, can be constructed by 
two k-level subtrees, namely A, and By. And a (k+2)- 
level Sneptree, Ay..2, can be constructed by four k—level 
Sneptrees, Az, By, By and Az. The auxiliary (k+2)-level 
Sneptree, By, can be constructed by Aj, Bx, At and 
By. 


(ay Aket 


(c) By.2 


Figure 4. Construction of Ag+ , Agy1 and By+2 


(b) Ak+2 


For example, a 4-level Sneptree is constructed by connect- 
ing Ag and Bz to an extra node and by directing the Sneplinks 
as shown in Figure 5.a. Notice that Ag is planar and Bg has 


one crossing. A4 has five crossings due to the introduction of 
new links, including the two links incident to the root, shown 


as dotted lines in Figure 5.a. The dotted arrows into the root 
node should be connected to the two dangling links coming 
out of the leaf nodes to make it a complete 4-level Sneptree. 
A 5-level Sneptree can be constructed as in Figure 5.b. There 
are in total eleven crossings in the layout; five in the left half of 
the graph, which is exactly A, except for the two links of the 
root coming out to the right; four in the right half as shown 
in the figure and two from the incoming links (dotted lines) of 
the root in the middle of Figure 5.b. 


(o) As 


Figure 5. Construction of Aq, As using A3 and B3 


Since the Cyclic Sneptree is not planar and the Sneplinks 
are not of constant length, we would like to know the number 
of crossings and the maximum length of the Sneplinks. 


Theorem 7 gives the number of crossings in Ay, and By. 
By, has one more crossing than Ay, and both figures are ap- 
proximately 3/8 of the total number of nodes in the Sneptree. 
The two crossings introduced by the incoming links of the root 
mel in Ay are not counted. The proof of Theorem 7 is given 
in |6}. 


Theorem 7. The number of crossings in An is 3x (2"-3—1) 
and the number of crossings in By is 3 x (2"—-% — 1) +1. 


Assume that any single node in the layout is a square with 
area a, i.e., each side of the node is ,/a in length and the 
wire width is negligible compared to ,/a. The area of the 
H-structure Sneptree is the function of node area a and the 
height of the Sneptree n. The zero wire width assumption is 
reasonable because the number of wires passing through any 
two adjacent nodes in the layout is bounded by a fixed number. 


Furthermore, we assume that the four links of each node 
may be pulled out of any side of the node. Two or more links 
may come out of the same side. The length of the wire con- 
necting two nodes is the shortest distance from the center of 
one side of the source node to the center of the nearest side of 
the other node. The wire has to route around all the nodes in 
the way. 


Theorem 8 shows that the length of the longest internal 
wire in an H-layout Sneptree is about 1/4 of the width of the 
layout. It is only (3/2),/a longer than the longest wire of the 
H-layout binary tree of the same size. The proof of Theorem 
8 is given in [6]. 


Theorem 8. The longest internal wires of an n-level H- 
structure Sneptree are the two wires connecting the root and 
two leaf nodes at the left and the right corners. The length ts 


Ja(3 x 2"/2-3 + 1/2), 


n> 3 and even 


fa(2(r-8)/2 + 1/2), n> 3 and odd 
2./a, n<3 


23 


Extension of the Sneptree 


From the above discussion, it is clear that we can layout a 
Sneptree of any size onto a single chip as long as the chip capac- 
ity is not exceeded. Here we present a method to extend the 
Sneptree by connecting two identical H-structure Sneptrees, 
which is modified from the recursive construction technique of 
binary tree from a single chip proposed by [5]. 


Figure 6. Extension of the two (m-1)-level 
Sneptrees to an m-level Sneptree 


Let one chip consist an (m-1)-level H-structure Sneptree 
with four dangling links and a single processor with its four 
links. Two of the four dangling links in the Sneptree are out 
of the leftmost and the rightmost leaf nodes, respectively. The 
other two are the incoming links to the root node. There are 
eight connectors in a single chip as shown in the solid box in 
Figure 6. 


Figure 6 illustrates how to connect two such chips into 
an m-level Sneptree. The resulting layout contains one m-level 
Sneptree with four dangling links and a single processor, which 
is now able to extend to bigger structures recursively. 


5. Routing Algorithm for Leaf Nodes 


In this section, a leaf node routing algorithm for the Snep- 
tree is presented. It is motivated by the opportunities to map 
a linear array onto the leaf nodes of the Sneptree and to utilize 
the Sneplinks to shorten the communication distance between 
two arbitrary leaf nodes. 


The design of the routing algorithm is constrained by the 
following criteria: communication distance (to find a route as 
short as possible), congestion constraint (to use the extra links 
to avoid traffic jam at the upper level nodes), and time con- 
straint (to keep the routing time as low as possible). 


The time to route a message from a node x to another 
node y is the sum of the message transmission time and the 
processing time at the source node and each intermediate node. 
Let tp and t, be the time of one processing step and the time 
of one message transmission between adjacent nodes, respec- 
tively. Suppose < x = 20, %1,...,2p-1, 2k = y > Is the route 
which the message is sent through and f(t) is the number of 
processing steps necessary to compute the following route at 
the intermediate node z;. The total routing time is 


k-1 


Y, f(itp tk x te 


7=0 


In an n-level binary tree, k is bounded by 2(n — 1) and f(t) 
is constant for all ¢ so that the routing time is O(n). Notice 
that the bitwise operations, such as detecting if one node is in 
the subtree of another node, is assumed to take constant time. 
In a Sneptree, it is obviously that the shortest distance of any 
two nodes is not longer than that in the binary tree due to 
the Sneplinks of the Sneptree. Hence, the second term, k, of 
Eq.(3) for the Sneptree is smaller than that for the binary tree. 
To keep the total routing time for the Sneptree in the same 


(3) 


order of magnitude as for the binary tree, we have to keep 
f(#) constant in the intermediate nodes. As a consequence, 


the algorithm can’t always find a shortest route. It finds a” 


route which is shorter and less congested than the one in a 
pure binary tree in O(n) time. 


The leaf node routing algorithm is presented in the follow- 
ing subsection and the program is given in [6]. 


The Routing Algorithm 


Before describing the algorithm in detail, we shall first 
define the breadth-first normal ordering of the nodes. 


Definition. The breadth-first normal ordering is an address- 
ing method for a binary tree. The nodes in an n-level binary 
tree are numbered from 1 to 2” —1. The root node has address 
1, and the left descendant and the right descendant of a given 
node x have addresses 22 and 22+ 1, respectively. 


Suppose that each address is represented by an n-bit bi- 
nary number; the addresses of the left and the right descen- 
dants of a nonleaf node are derived by shifting its address one 
bit to the left and adding 0 or 1 to it. 


With such an addressing scheme, the binary address of the 
lowest common ancestor of any two leaf nodes can be easily 
decided, which is an n-bit binary number with leading zeros 
followed by the common prefix of the binary addresses of the 
two leaf nodes. Furthermore, the binary addresses of the left 
and the right corner leaf nodes of a subtree with height h have 
h trailing 0’s and h trailing 1’s, respectively. 


In our routing algorithm, the source node computes and 
selects the shortest route between the source node and the 


: : : mm P leh 
destination node. Then, a four-variable message carries all the 


routing information necessary for a receiving node to determine 
the next node on the route. The intermediate nodes need not 
recompute the shortest routes. 


Suppose a message is sent from a leaf node x to a leaf 
node y in a Sneptree of height n. Without loss of generality, 
we assume x<y; i.e., node x is to the left of y. Let A be the 
lowest common ancestor of x and y, B and E be two direct 
descendants of A, and triangles BCD and EFG be the two 
subtrees containing x and y (see Figure 7). In the sequel, UV 
denotes the shortest path between two nodes U and V, and 
|UV| represents the length of this path. Four possible routes 
between x and y are (xB,BAE,Ey), (xD,DE,Ey), (xB,BF,Fy) 
and (xD,DEABF,Fy). The lengths of these four routes are 
|xB|+ |Ey|+2, |xD|+|Ey|+1, |xB|+|Fy|+1, and |xD|+|Fy|+4, 
respectively. In order to find the shortest route among the four 
candidates, we need to compute |xB| ,|Ey|, |xD|, and |Fy|. No- 
tice that |xB| and |Ey| are bounded by the height of the triangle 
BCD (or EFG), and |xD| and |Fy| are bounded by twice of the 
height of the minimal subtree containing x and D (or y and F). 
The routing algorithm takes advantage of the Sneplinks within 
the triangles to find the shortest paths xB,xD,yE, and yF. 
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Figure 7. Four Possible Routes Between x and y 


The length of xB (or yE) can be computed recursively. The 
shortest distance between B and a leaf node is 2 regardless of 
the height of B. The leaf nodes that have distance 2 from B are 
the two inner corner leaf nodes of the two subtrees of node B, 
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and the shortest paths take one of the treelinks to a descendant 
of B and then take the Sneplink to the leaf (see Figure 8.a). 
Let a and b be the two direct descendants of B and let c and d 
be the two corner leaf nodes that have distance 2 from B. Then 
the leaf nodes that are at distance 3 from B are the nodes at 
distance 2 from nodes a or b, as well as the nodes at distance 
1 from nodes c or d. There are six such nodes: two (al and 
a2 in Figure 8.b) are at distance 2 from a, two (bl and b2) 
at distance 2 from b, and cl and dl are at distance 1 from 
c and d, respectively. Applying this technique recursively, we 
can find the shortest path between any leaf node and node B. 
The shortest path xB for an arbitrary leaf node x is a path 
starting from node B, following the treelinks down to a certain 
level of the tree, then taking the Sneplink to a leaf node and 
following the shortest route from this leaf node to node x (see 
Figure 9.a). 


To find xB, the algorithm finds a sequence of leaf nodes 
whose binary addresses differ from x only in trailing bits, and 
their trailing bits are all 1’s or all 0’s. Those leaf nodes can 
be routed to node B through a Sneplink so that the distance 
to B is shorter that the height of B. The length of a route 
from x via one of such leaf node, say z;, to B is the sum of 
the shortest distance of x to z; and the distance of z; to B. 
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Figure 8. The Leaf Nodes with Distance 2 or 3 from Node B 
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Figure 9. The Shortest Paths xB and xD 


After all such routes have been computed, the shortest one 
is the candidate to the shortest route of xB. If the shortest 
distance is longer than the height of B, then the direct route 
from x to B (going upwards through treelinks) is the shortest 
route. For instance, let the binary address of x be 00111010, 
where we ignore the leading bits that are the common prefix 
of the binary address of x and y because they are irrelevant 
in computing xB. The height of B is 7 so that the shortest 
distance of x and B won’t exceed 7. The first leaf node which 
can take advantage of one of the Sneplinks is x;=00111011 (2; 
is derived by changing the LSB of x to 1), which is of distance 
1 from x and 6 from B by taking the Sneplink. Hence, the 
distance is 7 by routing through 2,;. The second leaf node 
is 22==00111111, which happens to be node c in Figure 8.a 
and has distance 2 from B. The distance of x and zg can be 
computed recursively and it turns out to be 4. Therefore, the 
distance of xB by routing through z2 is 6, which is shorter 
than 7. Since there are no other leaf nodes which can take 
advantage of the Sneplinks, we can conclude that the shortest 
route of xB is from x to 22, taking the Sneplink to the right 
descendant of B, and then up to B.The length of the shortest 
route is 6. 


The distance of xD can be derived during the computation 
of xB. Let the lowest common ancestor of x and D be t, the 
right descendant of t be s, and the leaf node at the other end 
of the Sneplink out of s be u. Then, the route (Bs,su,ux) is one 
of the candidates for the shortest path Bx. Hence, the distance 
of ux is computed while computing xB and the shortest path 
between x and D is (Ds,su,ux) whose distance can be derived 
immediately (see Figure 9.b). 


In the computation described above, we need to find the 
shortest route from x to another leaf which is a corner node of 
a subtree containing x. Again, this distance can be computed 
recursively. For instance, to route xu in Figure 9.a, we try to 
find a sequence of leaf nodes starting with x and ending with 
u. Each pair of adjacent nodes in the sequence has Hamming 
distance 1. We now route the message through the nodes in 
the sequence. The distance from x to any intermediate node 
can be computed recursively by the previous value. Let u be 
a right corner leaf node of a subtree and let the node sequence 
be (x = 20,21,.-.,2, = u), where 2; is derived from 2;_ 
by changing the least significant O of z,_; to 1, and k is the 
Hamming distance of x and u. Notice that the address of u has 
m trailing 1’s, where m is the height of the minimal subtree 
containing x and u. The recurrence relation is 


(4) 


where 7 is the position of the bit that differs in 2;_, and 2;,. 
All the bits to the right of the j-th bit of z;_, and 2, are 1’s 
(Figure 10). The shortest route from 2;_1 to 2; takes the left 
Sneplink of z;_; to an ancestor node of x; and then takes the 
right tree links down to z;. The distance is j, i.e., the height 
of the lowest common ancestor of z;_, and z;. Furthermore, 
the shortest route between xp and 2; could be either the route 
(xox;-1, %;-12;) or the route containing only treelinks and 
passing through the lowest common ancestor of zo and z; (the 
distance of this route is 2 x 7). In the previous example, the 
shortest route from x=00111010 to x2=00111111 is taking the 
Sneplink to leaf node x;=00111011, then Sneplink again to the 
ancestor of zg and taking the right treelinks down to x2. The 
distance is 1(x to 71)+3(z,22)=4. 


|zor;| = min(|zo2%;_-1|+ 3, 2x 9) 


x i-1 x i 
Figure 10. The Shortest Path Between z;_, and 2; 


Similarly, if u is a left corner leaf node (with trailing 0’s), 
|xu| can be derived by computing the distance of x and a series 
of intermediate nodes whose addresses are derived by changing 
the least significant nonzero bits of x to 0 until it reaches u. 


Now, we can select the shortest path among the four 
candidates, (xB,BAE,Ey), (xD, DE,Ey), (xB,BF,Fy) and 
(xD,DEABF,Fy). So far, all the computation and decision 
being made are accomplished at the source node x. To achieve 
O(n) time performance, we don’t want to repeat the computa- 
tion in any other intermediate nodes along the route. Hence, 
the routing information should be sent to the intermediate 
nodes to guide them to select the proper next node along the 
route. It appears that a four-variable message is enough to 
carry the route information and avoid extra computation. 


When the shortest route goes through xB, two figures are 
needed to guide the route. Suppose the message is routed from 
x to another leaf node u, then take the Sneplink to a nonleaf 
node v and send up to node B, (Figure 8.b), we first need to 
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know the route information of xu. If the route xu follows the 
treelinks up and down to some intermediate leaf node, (i.e., 
when 2) is the smaller one in Eq.(4).) we record the highest 
point of this route, otherwise it is zero. The second figure we 
need to know is the distance of B and v. Let’s call these two 
figures ml and hl. When xB contains treelinks only, m1 is 
zero and hl is the height of node B. Similarly, m2 and h2 are 
the corresponding information needed to describe the route yE. 


. Furthermore, we need a direction flag to guide the message to 


either the left descendant, the right descendant or the father 
node. 


When route xD is in the shortest path, the only infor- 
mation we need to know is the highest point that the route 
reaches through the treelinks. (i.e., when 27 is the smaller one 
in Eq.(4).) Let’s call it 11 and when the route contains no 
upward treelinks, !1 becomes zero. Similarly, /2 is the corre- 
sponding information needed to describe route yF. 


Let a four-variable message be (level1, dir, level2, dest). In 
general, the routing information for the route in triangle BCD 
is carried in the first two variables of the message. The third 
variable carries the information for the route in triangle EFG. 
The last variable is always the destination. More specifically, 
levell carries the value of /1 or m1 depending on which route 
is selected. Variable dir is usually a three-value variable used 
to select the next node in the route when a leaf node or the 
highest nonleaf node is reached. When xB is selected, hl is 
also carried by dir. Variable level2 carries either the value of 
h2 or 12 depending on whether yE or yF is selected in triangle 
EFG. The value of m2 has to be reproduced by a specific node 
in route Ey when Ey is selected, since there is no room to carry 
the value in a message. That specific node is the lowest nonleaf 
node traveling down from E through treelinks, from where the 
route takes the Sneplink to a leaf node and then takes the 
shortest route to y. Such a node corresponds to node v in 
route xB (see Figure 9.b). The second variable is also used to 
select one of the four possible routes. When dir is used to carry 
the direction information of the route in the triangle BCD, the 
route information for node A,B and E is carried by the fourth 
variable instead. For instance, the value of the fourth variable 
is negative when yF is selected. The routing information for 
triangle EFG will be resumed at node A,B or E by moving the 
information carried by the third and the fourth variables back 
to the first two variables. 


Performance 


The computation time in the source node z is O(n). When 
Ey is selected, one of the intermediate nodes along route Ey 
needs to reproduce the value of m2 in k steps, where k is the 
height of this specific node. When xB is selected, a few nonleaf 
nodes along the route need to compute the height of the lowest 
common ancestor of themselves and the destination node. Such 
bitwise operation is again assumed to take constant time. In 
conclusion, only the source node and at most one intermediate 
node need to do some computation in O(n) time. From Eq.(3), 
we can conclude that the routing algorithm takes O(n) time to 
route the message from the source to the destination. 


The result of the routing algorithm gives a good approxi- 
mation to the shortest path of xy. Furthermore, the routing al- 
gorithm always find the shortest path within the triangle ACG. 
This routing algorithm uses only the links local to the minimal 
subtree containing the source and the destination nodes. The 
Sneplinks external to this subtree are never considered. As a 
consequence, the two Sneplinks of the root node are never used 
for routing. Because of this restriction, the routing algorithm 
does not always compute the shortest path. (For example, the 
route from the left corner leaf to the right corner leaf has a 
distance 2, whereas our algorithm chooses a route of length 
twice the height of the tree.) However, this restriction has 


many advantages. The algorithm is simple and yet computes 
nearly optimal routes, and the traffic of the upper level nodes 
is reduced. 


In a binary tree, the nodes at the upper levels are the 
most congested nodes because half of the leaf nodes have to 


route through the root node to communicate with the other 


half of the leaf nodes. In case any leaf node is communicating 
with all the other leaf nodes, the root node has to transmit 
about half of the messages and the nodes one level down the 
root have to transmit 5/8 of the messages. Then, the traffic at 
each node decreases level by level from 13/32, to 29/128,.... 
In a Sneptree, four routes may be chosen to route a message 
between two arbitrary leaf nodes and only two of them pass 
through the lowest common ancestor of the two leaf nodes. 
Assuming the four alternatives are equally possible, the traffic 
at the common ancestor is reduced to a half of the binary tree 
case and the traffic at the nodes one level down the common 
ancestor is reduced to three quarters since three of the four 
routes pass through that node (see Figure 7). In case any leaf 
node is communicating with all the other leaf nodes, the traffic 
at the top level nodes become 1/4, 7/16, 19/64, 43/256,... 
of the total amount of messages. The figures show that the 
traffic at the top level nodes is reduced to about a half of the 
binary tree case. The actual figures depend on the height of 
the Sneptree. The traffic at the nodes of the same height is no 
longer the same and the exact figures need more analysis. 


The simulation result shows that the average routing dis- 
tance of any two leaf nodes is getting closer to the optimal 
average distance when the Sneptree is bigger. The simulation 
also shows that for some specific communication patterns, the 
routing result is almost optimal, such as shift by 2° operations, 
ie. routing one leaf to another leaf at 2* distance apart. Fig- 
ure 11 shows the optimal results and our routing algorithm 
results of the average distance of any two leaf nodes and the 
average distance of a perfect shuffle operation. Figure 12 shows 
the average distance of shift by 2* operations, the curves for 
routing results and optimal results overlap in Figure 12. 


6. Conclusion 


The Sneptree is a versatile interconnection network for 
distributed computation. The boundary problem of a binary 
tree is eliminated in the Sneptree so that the mapping of an 
over-sized computation tree is done automatically. Moreover, 
a complete binary tree of arbitrary size can be mapped onto 
a Sneptree optimally. And a left/right skewed tree can be 
mapped onto a Cyclic Sneptree optimally. 


The Sneptree is also suitable for VLSI implementation. 
It is possible to build a Sneptree of any size in a single chip 
with area proportional to the total number of processors. The 
H-structure layout of the Sneptree is regular and can be re- 
cursively constructed. The number of crossings due to the 
extra links is proportional to the number of nodes of the Snep- 
tree. The longest wire length is about the same as that in 
an H-structure binary tree. Furthermore, the Sneptree can be 
expanded easily by connecting two or more chips together. 


The leaf node routing algorithm allows us to take advan- 
tage of the extra links in the Sneptree. Hence, a shorter and 
less congested route between any two leaf nodes can be found 
in O(n) time. The routing algorithm gives a good approxima- 
tion to the optimal solution. In some special communication 
pattern, such as shifting by 2k the average routing result is 
almost optimal. Besides, the traffic at the upper-level nodes is 
reduced to about a half of the traffic in a pure binary tree. 


Comparison with Other Networks 


Like the Sneptrees, the X-tree [2] is an augmented binary 
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Figure 11. The average routing distances 
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Figure 12. The ave. routing dist. of shifting by 2* ops. 


tree with identical nodes. Three ports per node, four ports 
per node and five ports per node are considered. The degree 
of each node is not fixed but the maximal degree is limited 
by the number of ports per node. Besides the binary tree 
connection, the extra ports can be connected arbitrarily. The 
main purpose of the X-tree is to provide fault-tolerance and 
uniform message traffic. 


The Hypertree [3] is a binary tree with extra horizontal 
links (i.e., the links connecting the nodes located at the same 
level). The horizontal links provide a set of n-cube connections. 
Four ports per node and five port per node are considered. 
Similarly, the main concern of the Hypertree is to provide fault- 
tolerance and shorten the distance between two arbitrary leaf 
nodes. 


De Bruijn Networks [9] are a class of fixed degree logarith- 
mic networks with arbitrary number of nodes and degree. A 
De Bruijn Network with (2" — 1) nodes and degree 4 happens 
to be a Sneptree. Such networks are good for a communica- 
tion network since the optimal routing path can be decided 
with local information and fault-tolerance is easily provided. 


Comparing with the other similar networks, the Sneptree 
is the only network which can simulate an over-sized binary 
tree. The X-tree and the Hypertree contain extra links between 
sibling nodes so that it can simulate ring connection or n-cube 
connection. They cannot handle the mapping of an over-sized 
problem well. The de Bruijn network with degree 4 is one 


type of Sneptree. The connection pattern is neither cyclic, 
symmetric nor regular. There is no way to layout the network 
or extend the network. 


Different Connection Patterns 


From Theorem 1, we know there are many different con- 
nection patterns for the Cyclic Sneptree. It will be interesting 
to compare the performance of different connection patterns 
of the Cyclic Sneptree in terms of the communication distance 
and the mapping performance of an unbalanced tree. 


(b) 
Figure 18. A Planer Cyclic Sneptree 


(a) 


Figure 13.a shows another Cyclic Sneptree. the numbers 
attached to the nodes show the node ordering in the left span- 
ning cycle. Symmetrically, the right spanning cycles can be 
represented by node sequence (1,5,4,3,2,7,6,1). Such connec- 
tion pattern also has regular structure and hence can be gen- 
erated recursively. It is interesting to observe that this connec- 
tion is planar if we switch the position of the leaf node pairs 
(3,7) and (6,4) as shown in Figure 13.b. Comparing the Cyclic 
Sneptree shown in Figure 2 with this one, the latter one (Fig- 
ure 13.b) contains four duplicate links, i.e., (2,3), (4,5), (3,4) 
and (6,7) while the former one (Figure 2) has only two du- 
plicate links, i.e., (3,4) and (6,7). The duplicate links prevent 
the Sneptree from connecting more nodes together. Hence, the 
second connection pattern doesn’t perform as well as the first 
one in terms of communication. 


There are many other connection patterns for the Cyclic 
Sneptree, some of them may perform better than the one we 
chose in terms of the communication distance between arbi- 
trary two nodes and the mapping performance of an unbal- 
anced tree. But only the two connection patterns discussed 
above can be constructed recursively from the smaller Snep- 
trees without breaking the internal Sneplinks in the smaller 
structures. This property is important for VLSI implementa- 
tion. 


Figure 14. A three—level Exchange Sneptree 
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The Exchange Sneptree is a different type of Sneptree. In a 
Exchange Sneptree, the outgoing Sneplinks of the leaves in the 
left half of the Sneptree are directed to the incoming Sneplinks 
of the nodes in the right half plus one incoming link of the 
root, and similarly for the other half of the Sneplinks. 


One example of the Exchange Sneptree is shown in Figure 
14. This connection is symmetric but neither extensible nor 
cyclic. This connection has a very nice property: no matter 
which node the root is mapped onto, it results in a nearly opti- 
mal mapping for a complete binary tree. As a consequence, we 
found that mapping performance of a unbalanced binary tree 
onto such a Exchange Sneptree is better than the same map- 
ping onto a Cyclic Sneptree. The properties of the Exchange 
Sneptree need further investigation. 
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Abstract: Non-uniform traffic distributions in a multi- 
stage network characterized by “hot spots” — destina- 
tions getting more than their share of traffic—can cause 
dramatic reductions in the maximum throughput of the 
network. In this paper we develop an analytical model 
predicting how long a “hot spot’ must be persist in 
shuffle/exchange networks before its its full effect is felt. 
The model predicts that hot spots will disrupt network 
traffic severely in a very short time: 10 to 50 instruction 
execution times in a shared-memory machine. This result, 
verified by simulation, leads to the conclusion that if 
stringent measures are not taken to ensure uniformity, the 
performance of large multistage networks will be sub- 
stantially worse than has been previously predicted. 


1.0 Introduction 

Multistage interconnection networks with distributed 
routing have often been proposed as a means of connect- 
ing large parallel or distributed computing systems. 
However, for such networks it has been shown that sta- 
tistically non-uniform traffic patterns—patterns contain- 
ing a hot spot [6] that gets more than its share of the 
traffic—can cause severe performance degradation for all 
network traffic, not just traffic to the hot spot. For ex- 
ample, as little as 0.125% imbalance in a 1000-way net- 
work can limit network throughput to less than 50% of 
its maximum value. This is independent of network 
topology, redundant paths, or mode of use of the network 
(e.g.: message passing, shared memory, circuit vs. packet 
switching, etc.). This effect was discovered in the IBM 
RP3 project [5,1,4], and first reported in [6]. It was also 
reported there that the technique of ‘“‘combining” mes- 
Sages in the interconnection network could solve the 
problem for some cases of interest. 

However, the analysis and simulation reported in [6] 
does not address a crucial issue: Over what time interval 
must a non-uniform pattern be sustained in order to reach 
tree saturation? This is important, because it is a critical 
measure of how uniform the traffic must be to avoid hot 
spot problems. Statistical uniformity is much more easily 
achieved when averaged over hours (for example) than 
when averaged over microseconds. 
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We address that issue here by developing a model, 
verified by simulation, of how long a hot spot must persist 
before its effects are fully felt. This provides a lower 
bound on the interval over which uniformity must be 
measured. 

Unfortunately, the result is that the required interval 
is quite short indeed. For example, with a 1024-way net- 
work of 4-way switches containing 4-element queues, a 
0.125% hot spot non-uniformity will have its full effect 
within (approximately) 10. to 50 times the minimum time 
to traverse the switch. (See Figure 4.) 

We can only derive a crude lower bound for the time 
for a network to recover from a hot spot; that appears to 
be a more complex process. However, both that lower 


bound and our simulations demonstrate that the recovery 
time is much longer than the onset time. 

Before deriving these results and comparing them 
with simulation, a brief overview of the hot spot effect 
will be given. A discussion of possible remedies, and of 
our conclusions, ends the paper. 


2.0 Hot Spot Contention and Tree Satu- 


ration 
Here we summarize the results presented in [6], with a 
slight addition. 

Consider a two-sided packet-switched multistage 
network, with p ports on each side, connected to message 
sources on one side and message sinks on the other, such 
as the Omega network [3] illustrated in Figure 1. 

Suppose the traffic pattern is initially uniform, with 
messages emitted from each source at a rate r 
(0 < r < 1). Then, at some time after a steady state has 
been achieved, the traffic pattern is altered to direct a 
fraction h, 0 < h < 1, of all references are aimed at a spe- 
cific sink: the hot sink. I.e., each source emits r(1 — h) 
messages uniformly distributed, and rh messages to the 
hot sink. 4 is the hot spot rate. As a result, the hot sink 
receives two components of traffic: r(1 — A) from the 
uniform background, and rhp from the hot spot. 

If h is large enough, the rate into the hot sink will be 
unity due to the rhp term. If this happens, the queues in 


the network switch closest to that sink will fill. This 
causes the preceding switches’ queues to fill; then the next 
preceding; etc. Finally, a tree of switches rooted at the 
sink and extending to every source is saturated. This is 
called tree saturation, and is illustrated by the marked 
switch queues in Figure 1. 

Once tree saturation is in effect, every message from 
any source to any sink must cross the saturated tree and 
so is delayed. In effect, all the network traffic is gated by 
the speed at which the single hot sink can dispose of its 
messages. In the steady state, this occurs when the total 
traffic rate into the hot sink (r(1 — #) + rhp) equals unity. 
in [6] it was shown that this has a dramatic effect as the 
system is scaled up in size, as noted in the examples cited 
in the present paper’s introduction. In the steady state, 
hot spot effects are independent of network topology, fi- 
nite queue size, etc. (However, the timing analysis pre- 
sented here does depend on these factors.) 

Beyond what was presented in [6], we note here that 
tree saturation is a finite queue effect. But in order to 
eliminate it, the queues in the final stage of the network 
must be large enough to accommodate the maximum hot 
spot traffic of the entire system. In other words, their size 
must be equal to M x the number of network ports, where 
M is the maximum number of messages that can be si- 
multaneously outstanding from each source. Thus if 
queue sizes are taken into account, the total network size 
has another factor equal to the number of network ports. 


This raises the network size to O(M x N log N), rather 
than the usually-cited O(N log N). The factor of N? ne- 
gates any size advantage over a full crosspoint switch. 


3.0 Modelled Behavior 

The network behavior modelled here assumes that the 
network is initially in a steady state with uniformly dis- 
tributed traffic flowing through it at a rate r. At time 0, 
all the sources simultaneously change their traffic patterns 
to include a hot spot. r does not change, but a fraction h 
of r sufficient to cause tree saturation is now directed at 
a hot sink. The throughput of the network now declines 
until at a time Tit reaches a steady-state minimum caused 
_ by tree saturation. We wish to estimate T as a function 
of r, h, network size, etc. 

Packet switching is assumed, with one packet per 
message. In the analysis, one time unit is required for a 
packet to move from one switch to the next. For reasons 
explained later, the results shown in the figures are scaled 


to be in units equal to the minimum time to traverse the 
switch. 


4.0 Model of Onset 


It is convenient to imagine the messages sent to the hot 
sink after time O as being colored red, and all other mes- 
sages colored white. The total rate at which each source 
emits red messages is 
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r(1 — A) 
Pp 


R 


rh + 


R is not simply rh because 1/p of the messages from the 
uniform background are sent to the hot sink, where p is 
the number of network ports. 

Rather than dealing directly with the complex dy- 
namics of message flow through the network, we will 
count the red messages in the network. On the one hand, 
the number N of red messages is a function of time and of 
their total arrival and departure rates from the network 
as a whole. On the other hand, when tree saturation is 
reached there is a steady-state number N of red messages 
in the network that is a function of the input traffic and 
the amount of buffering available. So equating N in both 
formulations can tell us how long it takes to reach satu- 
ration. 


4.1 Arrival and Departure 


To estimate the arrival and departure rates, we make a set 
of assumptions that, overall, amount to the general as- 
sumption that the onset of tree saturation happens fast 
and suddenly — too fast and suddenly for the internal 
dynamics of the network to have much effect on arrival 
and departure rates until the point of saturation is 
reached. 

We assume that the total arrival rate of red messages 
is constant (i.e., is Rp) until tree saturation is reached; and 


then it drops instantly to the tree saturation value. Under 
this assumption, the number of red messages generated by 
time JT is simply RpT. Our simulations do not verify a 
constant arrival rate: the input rate does decline with 
time. Nevertheless the final results are adequate. 

_For the departure rate, we assume the following: 

At T, the first red messages enter the first switch 

stage. They make their way through the tree of switches 
that will be saturated, gradually becoming more and more 
concentrated, until they reach the hot sink. Then: 
1. Until the hot sink is reached, there is no effect on the 
rate at which messages are transported. The first red 
messages reach the hot sink at a time D(r) that equals 
the average delay through the network at a total input 
rate of r. 
At D(z), the concentration of red messages is imme- 
diately sufficient to saturate the hot output port; i.e., 
after D(r) that port emits messages at a rate of unity. 
After D(r), all the messages emitted by the hot output 
port are red. 

While somewhat unrealistic, these assumptions are 
conservative. They overestimate the departure rate, and 
thus indicate that the saturated tree fills with red mes- 
sages sooner than it actually will. 

With those assumptions, the number of messages 
that have left the network at time T is simply T — D(r). 
Then, since the number of red messages arriving by time 
Tis RpT, N = RpT + T — D(r). Solving this for T yields 


K is the size of each individual switch, i.e., number of in- 
put and output ports, so K' is the number of switches at 
each stage that lie within the saturated tree. qg is the size 
of each queue; (K + q) is the total storage available for 
messages aimed at the hot spot in each switch stage. (The 
additional K is due an additional buffer on each input port 
used in the simulation; this is discussed later). 

For each switch at stage i in the saturated tree, mix, 
is the fraction of red messages in its queue during steady- 
state tree saturation: 

AR, 


° i 
nix, = 

' AR, + AW; 
AR, and AW, are respectively the arrival rates of red and 
white messages at stage 1: 


4 


K' 


because p/K' is the number of sources in the subtree 
leading to each switch in stage i. 


i 
AW, = (1 ~ h) x S 4 
K 


Recall that 7r(1 — #) is the total rate of uniform back- 
ground traffic, part of which is directed at the hot sink. 
Since there are K' possible destinations for the cool traffic 
_ N-D?) 

Rp-1 


T= 


The denominator becomes zero as the combinations of R 
and p reaches a point inadequate to sustain an output rate 
of unity. J.e., as that point is approached the time to sat- 
uration approaches infinity, which is expected. 

To estimate D(r) we use the well-known formula for 
the average queue length in a switch stage [2]: 


sae -+) 
2(1 -— 7) K 


Where r is the steady state rate and K is the number of 
ports of each switch in the network. Then 
D(r) = S(1 + B(r)). Because it assumes infinite queues, 
B(r) will again produce conservative results (faster trans- 
portation than reality) for high total rates r. 


B(r) = 


4.2 Steady-State Population 


In the steady state of tree saturation, N is the sum of the 
number of red messages n, at each network stage (we 
count stages from 0, starting at the stage closest to the 
sinks): 


S—-1 
N= SK’ x (K + q) x mix; 
i=0 
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at stage i, and one of them is the hot sink, the fraction 
above follows. 

If we substitute back into the expression for mix, 
substitute the original expression for R, and simplify, we 
obtain 


where H = h/(1 —h), the ratio of hot to cool packet 
generation. | 

Finally substituting back into T, with slight simplifi- 
cation, we get 


S-1 


(«x + 0S Kms) — S(1 + B(Y)) 


i=0 
Rp —1 


Tos 


5.0 Simulation 

To verify the above model, we ran simulations of the sit- 
uation described above for a number of cases. These re- 
sults are plotted with the analytical predictions in 
Figure 2 through Figure 4. 

The switches simulated had two non-standard char- 
acteristics that match those of [6], and serve to make the 
simulated network act more like the ideal network mod- 
elled: 

1. Each output queue can simultaneously accept K mes- 
sages in one time unit. While fairly realistic for 
K =2, this is undoubtedly unrealistic for larger 
switches. 

2. Each input port to a switch has an additional one- 
message lookaside buffer that is not counted in the 
queue size. This allows the queues to be more fully 
utilized, since without it there must be K empty posi- 
tions in every queue of a switch for any of the switch’s 
predecessors to be enabled to send messages. This iS 
the source of the additional K buffers per switch that 
was included in the prior analysis. 

A complication arose in deciding what to measure as 
the time at which the switch reaches saturation. Our for- 
mula effectively assumes that all the queues in the satu- 
rated tree fill up simultaneously, and this is clearly not the 
case. What we did was find, from the simulations, the 
average red message occupancy of the queues in steady- 
state tree saturation. Then the time to saturate was taken 
to be the time at which 80% of that steady-state value 
was reached. 80% was chosen because at approximately 
that value there is a single message slot unused in each 
queue. 

All plotted points are the means of 200 simulation 
runs each. 


6.0 Results 


A surprisingly short amount of time is required to reach 
tree saturation. - 

Figure 2 shows the time to tree saturation as a 
function of the initial uniform background rate r for val- 
ues of A ranging from 0.125% to 16% in factors of two. 
It assumes a 64-way Omega network where each 2-way 
switch has queues of size 4. The time unit is not 7 as de- 
rived above, but rather 7/S, the minimum time required 
for a message to traverse the switch in one direction; in 
our formulation, that equals the the depth of the network. 
This unit was chosen for two reasons: First, it allows 
meaningful comparison across different network and 
switch sizes; it turns out that, when expressed in this unit, 
the lime to saturate is relatively constant across network 
and switch sizes (10-50, for 4 element queues). Second, 
in a shared memory system, it is typically comparable to 
the time required to execute a single instruction in a 
processor. (It is not identical] to the time required to per- 
form a complete memory reference; that time includes 
two trips across the network—request and reply—as well 
as memory access time and other delays.) 

The dotted lines at the top of the figure mark the 
minimum background rate below which each plotted hot 
spot rate will not cause tree saturation; this is equivalent 
to the maximum rate sustainable with the associated hot 
spot rate. Thus the graphs can be interpreted as follows: 
Pick a given initial background rate (point on the lower 
axis). Proceed vertically to the curve corresponding to 
the hot spot rate of interest. The time that curve indicates 
(shown on the vertical axis) is the amount of time re- 
quired for that hot spot rate to drive the network 
throughput from the initial rate to the asymptote associ- 
ated with the hot spot rate (dotted line). 

Figure 3 shows the same information, but in this 
case for a 256-port network with 4-way switches. 
Figure 4 shows the predicted results for a large (1024 
port) network; this was not simulated. As can be seen, the 
onset time is very short. 

As can be seen, our predicted results match the sim- 
ulated results reasonably well except for two situations: 
_ very high rin the 256-port network; and very low r in the 
64-port network. At high rin the 256-port network we 
are pushing the actual maximum capacity of the network 
and would expect all the approximations we are using to 
break down. At low r in the 64-port network, there is a 
breakdown in our assumptions about total arrival and de- 
parture rates: Onset is a more diffuse process with more 
time for complex internal feedback effects. However, in 
this case our estimates are on the optimistic side. 


7.0 Recovery Time 

Figure 5 shows the throughput of a 64-way network as a 
hot spot of 16% is applied and then removed within a 
total background rate of 0.4. As can be seen, the recovery 
time is substantially longer than the onset time. A ra- 
tionale for this follows. 
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Intuitively, many sources ‘“‘cooperate”’ to saturate 
the tree; but only one sink (the hot one) operates to elim- 
inate saturation. The time to recover ‘‘normal” traffic 
flow should be related to the time to remove all red mes- 
sages from the switch after the sources stop generating 
them. If the hot output port runs at the maximum rate 
(unity), this time is simply equal to the steady-state count 
of red messages N during tree saturation, derived previ- 
ously. This is longer than the time to saturation, since the 
onset time Tis (N — D(r))/(Rp — 1). One might imagine 
that after a time 2mix,_,N/K, for K-way switches, all the 
switches closest to the sources would be clear; so after 
that time, 1 — 1/K of the traffic would resume its normal 
rate. Then after an additional time equal to 2mix,_,N/K, 
1 — 1/K? of the traffic will resume its normal rate; etc. 
The slow rise of throughput shown in Figure 5 tends to 
indicate that something like this is occurring. However, 
we have not yet compared this to simulation results, and 
consider it unlikely to be correct, for the following reason: 
Until all the red packets are gone, they should tend to de- 
jay uniformly distributed messages that happen to be di- 
rected to the hot sink; and by filling queues, this will 
affect other messages. This may cause continued con- 
gestion even after all the red messages have left the net- 
work. 


8.0 Discussion 


The time required to reach tree saturation is distressingly 
short. What points of leverage can be used to improve the 
situation? 

Increasing the switch size (K) actually makes the 
situation worse: Even when, as we have done, the units 
are the depth of the network—giving larger switches a 
large advantage—smaller switches saturate more slowly. 

Increasing the queue size also helps. But the queue 
size is a linear factor in the total saturation time, so very 
large queues are necessary to make a substantial differ- 
ence. E.g., to get saturation time up to the range of 100 
switch traversals, queues on the order of 40 elements are 
needed. This is unreasonable with present technology. 

The addition of redundant paths will also help, but 
only because the total queue storage available rises with 
the number of paths. So this decrease is also linear. 

One thing that certainly can help is over-design, in 
the sense of using the network only at at traffic rates less 
than the rates where the expected hot spot activity will 
cause tree saturation. This adds significantly to the ex- 
pense of the network. Furthermore, at present there is 
little experience available to define exactly what degree 
of non-uniformity to expect in general. 

As discussed in [6], ‘“‘ccombining” of identical mes- 
sages within the switch nodes themselves can eliminate 
the problem completely. However, combining only works 
when the the hot spot is caused by references to identical 
entities at the sinks (e.g., identical memory locations). 


When the hot spot occurs because many sources are ac- 
cessing many different entities that happen to occupy the 
same sink, combining cannot help. What may help in that 
case are techniques to ensure that non-identical references 
are scattered uniformly among the destinations, such as 
the combination of interleaving and randomization used 
by RP3 [5]. How well this will work in practice is not yet 
known; if it does, simultaneously using both this tech- 
nique and “combining”’ (also present in RP3) may solve 
the problem, at least for shared memory systems. It is not 
obvious at this time how to solve the problem for systems 
based on other computational models. 

Global control over routing can avoid the problem 
completely, of course, as discussed in [6]. 

To summarize: 


1. Very little perturbation, over a very short time, can 
drastically reduce network throughput. 


2. Recovery after the perturbation takes much longer 
than than the onset of the problem. 

This leads to the conclusion that multistage networks 
with distributed routing are unstable under non-uniform 
traffic loads, in the sense that they tend to “fall into” tree 
saturation easily. For large networks in particular, e.g., 
networks of size 512 or greater, if stringent measures are 
not taken to maintain a uniform traffic pattern, swift on- 
set and slow recovery makes it very probable that at least 
partial tree saturation will always be present; and thus 


large multistage networks may not perform anywhere near 
as well as has previously been predicted. 
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Time to Soturation (Unit = time to traverse switch) 
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Figure 2. Time required to saturate a 64-port Omega network: The switches have 
queue size 4, and 2 inputs and outputs. The solid curves are the predicted 
values for h ranging from 0.125% to 16% by factors of 2. the dots connected 
by dashed lines show simulation results. The dotted lines show the sustainable 
throughput after tree saturation for the values of h used. 


Time to Saturation (Unit = time to traverse switch} 
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Figure 3. Time to saturate a 256-port Delta network: The switches have queue size 4; 
other elements are the same as Figure 2. 
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Time to Saturation (Unit = time to traverse switch} 
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Figure 4. Time to saturate a 1024-port Omega network: Information and are the same 
as Figure 2. 


Time 


Figure 5. Onset and Recovery from a Hot Spot: Hot spot percentage, throughput, and 
delay as a function of time. The network is the same as Figure 2’s, and the hot 
spot is 16%. The arrows indicate the time for onset and recovery. 
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Abstract 


Concurrent requests to a shared variable by many processors 
on a shared memory machine can create contention that will 
be serious enough to stall large machines. This idea has been 
formalized in the ‘“‘hot spot”? traffic model [PfNo85], where a 
fixed fraction of memory requests is for a single shared vari- 
able. ‘‘Combining,” in which several requests for the same 
variable can be combined into a single request, has been sug- 
gested as an effective method of alleviating this contention. 
The NYU Ultracomputer [GGKM83] and the IBM RP3 
[PBGH85| machine use “‘pairwise” combining, in which only 
two requests for the same variable can be combined at a 
switch. We study the effectiveness of combining. In particu- 
lar, it turns out that pairwise combining cannot handle hot 
spots if the machine size is large enough. We suggest ways 
to overcome this weakness. 


1. Introduction 


The popularity of shared memory parallel computers, 
where processors and memory modules are interconnected 
through a multistage network, can be seen in several current 
projects, including the University of Illinois Cedar machine 
[GKLS83] [KDLS86], the NYU Ultracomputer [GGIKM83] 
[EGKM85], and the IBM RP3 machine [PBGH85]. Sharing 
the memory in a parallel computer suggests that there is a 
possibility of many processors requesting the same variable 
at the same time (concurrent requests). This can create 
congestion in a machine, and the congestion becomes more 
serious as the number of processors in the machine (machine 
size) increases. To reduce congestion, when several requests 
directed at the same shared variable meet at a switch, they 
can be combined into a single request, which is forwarded 
toward the shared memory. When the response from the 
memory returns, the switch satisfies all of the requests, one 
at a time. The idea of reducing the congestion in this way, 
known as ‘‘combining,’”’ has been suggested as an effective 
way of allowing concurrent requests to a common location; 
combining can be found in the Columbia CHoPP [SuBK77], 
the NYU Ultracomputer, and most recently the IBM RP3 
machine. (See [KrRS86] for a general discussion of what 
types of memory requests can be combined.) 
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Pfister and Norton ([PfNo85] suggested that the 
effectiveness of combining could be studied with the ‘“‘hot 
spot” traffic model: a fixed fraction of the total memory 
traffic is concurrent requests to a single shared variable. Hot 
spots capture the effect of all of the processors continually 
accessing a common variable. Pfister and Norton argue that 
hot spots will seriously degrade the performance of any 
machine that lacks combining, and that this effect is quite 
general. They also discuss how well the hot spot model cap- 
tures reality. 


In this paper, we study the effectiveness of several 
different combining schemes. In particular, we will see that 
the pairwise combining scheme used in the NYU Ultracom- 
puter and in the IBM RP3 machine is not powerful enough 
to handle hot spots. We suggest ways to modify their 
designs in order to overcome this weakness. 


2. The Model 


There have been many studies on the performance of 
multistage interconnection networks for processor-memory 
connection (see [Sieg85] and the references therein). One 
common traffic model for these studies is: A stream of 
memory requests from each processor is an independent, 


identically distributed random process; each processor’s 
requests are uniformly distributed to all of the memory 
modules. This uniform traffic model does not capture the 
effect of traffic with requests to a single shared variable. To 
represent such traffic patterns, we use the hot spot model 
[Pf{No85]: each request has a (finite) probability q of being 
headed to the same shared variable. The hot spot model is 
nonuniform in the sense that the requests are not uniformly 
distributed onto the memory modules. There are two types 
of request streams: the noncombinables, which are uniformly 
distributed to the memory modules as in the (usual) uniform 
model, and the combinables, which are headed to the same 
shared variable (and hence the same memory module). 


We consider a buffered square banyan network [GoLi73] 
as the multistage network for interconnecting the processors 
and the memory modules. Square banyan networks include 
Omega networks [Lawr75] and Delta networks [Pate81]. 
(For details and general characteristics of multistage net- 
works, see, for example, [Feng81], [IXrSn82], [Sieg85].) 


A network is composed of n stages of 2X2 (crossbar) 
switches with FIFO queues (i.e. buffers) at each output port. 
We assume that the network is packet-switched and synchro- 
nous, so that packets can be sent only at times ¢,, 2é,,---, 
where ¢, is the network cycle time. Without loss of general- 
ity, we assume ¢,=1. We make the following further 


assumptions: 
e Each request is a single packet. 


e Each queue can accept at each cycle up to two distinct 
requests, one from each input port. If at some cycle a 
queue has only one free location and two requests are 
directed to it, the queue randomly accepts one of the 
two (the other request remains on the queue of the pre- 
vious stage). 


e The enqueuing process of a request and the dequeuing 
process are overlapped (i.e. while the request in front of 
the queue, if there is one, is being removed, other 
requests can be inserted onto the queue). 


e The service time of a request in a queue is the same as 
the cycle time. So, the delay of a request at a switch is 
the number of requests ahead of it in the queue. 


e Each processor has an infinite queue for requests. If a 
request is blocked from entering the first stage it is 
placed on the queue, and the processor continues issuing 
requests. 


A square banyan network has a complete tree leading 
from the processors to each memory module (Figure 1). The 
tree that combinable requests traverse will be called the fan- 
in tree. Our main concern is with the average queuing delay 
in the fan-in tree. 


Combining works as follows: When several combinable 
requests meet at a switch they are combined into a single 
request, which is forwarded toward the shared memory. A 
record of this is kept at the wazt buffer. When the response 
from the memory returns, the switch satisfies all of the 
requests, one at a time (and the record is removed from the 
wait buffer). To concentrate our attention on queuing 
delays, we assume in Sections 2-6 that wait buffers have 
infinite size. Also, we will consider the delay of a request 
only from the processors to the memory modules, tem- 
perarily ignoring the delay on the return trip. Section 7 con- 
siders finite wait buffers, and their effect on the delay of a 
request in both directions of the network. 


We will distinguish queue size and queue length. Queue 
size is the number of requests a queue can store at one time. 
We use infinite queue to mean that the queue size is infinite, 
and finite queue to mean that the queue size is finite. 
Queue length is the number of requests stored on a queue at 
some particular time. We will use equivalent definitions for 
wait buffer size and wait buffer length. 


We consider several different combining schemes. In 
each case, we will consider what happens both with finite 
queues and with infinite queues. Infinite queues provide a 
nice yardstick to compare the more practical finite queue 
schemes. For finite queues, unless otherwise specified, we 
will always consider queue size four. This is large enough so 
that for the traffic loads considered, the performance under 
uniform traffic is almost as good as with infinite queues. 


A network is stable if in steady state average delays in 
the network are uniformly bounded. This is an important 
property of a network. Under the unzform traffic model, 
buffered multistage interconnection networks are generally 
believed to be stable for ‘‘light’ traffic (see 
[KrSn83],|KrSW86)}). 


We assume that at each cycle each processor issues a 
request with probability r, i. r is the rate of requests. 
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Each request has probability g of being a combinable 
request. Let r, be the rate of combinables (ie. hot spot 
requests), and r, be the rate of noncombinables. Then 


and r, =(l-q)r. 


Let r’ be the rate of requests at the stage « of the fan-in tree 
(r =r). Let r; and r, be the rate of combinable requests 
and noncombinable requests, respectively, at the stage 7 of 


r, =r 


0 0 
the fan-in tree (r, =r, and r, —r,,). 


3. No Combining 


In this section we will consider the performance of sys- 
tems without combining. It is obvious that the combinables 
will create congestion in the network. The question is, how 
much will this degrade performance? 


3.1. Infinite Queues 


Assume the queues have infinite size and there is no 
combining. Recall that the rate of requests at the first stage 
is r, =r(1-q) and r, =rq. Since there is no combining, the 
rate of combinable requests keeps doubling at each stage 
approaching the root of the fan-in tree. In particular, the 
rate of requests at stage 1 will be 

3 


r 
r =r, +2r,. 


For any finite value of r,, after several stages, the requests 
will be arriving at each queue at a greater rate than the 
queue can forward them. Networks large enough to see this 
effect will be unstable. For example, consider the case of 
r =0.25 and g =0.01. Even with g so small, by the ninth 
stage the arrival rate of the combinables alone will be 
r, =1.28, so the queuing delay will be unbounded. 


In practice one expects short intensive periods of ‘‘hot 
spot”’ contention. If there are not too many stages of the 
fan-in tree in which the rate of requests is greater than one, 
the system may still provide acceptable performance. Only 
the combinable requests will suffer extraordinary delays, 
along with the relatively few noncombinable requests travers- 
ing the fan-in tree near its root. 


3.2. Finite Queues 


With finite queues the situation is worse. Pfister and 
Norton [PfNo85] noticed a very interesting phenomenon they 
call tree saturation. When the queue at the root of fan-in 
tree becomes full, the two queues feeding it can no longer 
send requests to it. They too will become full and stop the 
four queues feeding them from sending requests. Eventually 
the entire fan-in tree will consist of full queues. All of the 
queues at the same level of the fan-in tree can together 
satisfy combinables only at the same rate as the root satisfies 
them. In other words, at the zth level from the root, each 
queue can satisfy combinables only at a rate 1/2’ as fast as 
the root does. So, although each queue at this level has on 
average only 1/2’ as many combinables as the root, with 
respect to combinables the queues are not progressing any 


faster. Thus, progress of the whole system is governed by 
the service rate at the “hot spot”; noncombinables will suffer 
delay proportional to the queue size on each stage of the 
fan-in tree traversed. 


Kumar and Pfister [KuPf86] have observed that a rela- 
tively short period of hot spot contention will produce tree 


saturation. Furthermore, after the processors stop issuing 
hot spot requests, the network takes a long time to return to 
normal. 


4. Pairwise Combining 


Ideally, one would like to combine all of the combin- 
ables that reside concurrently on a queue. This, however, 
makes the combining process complicated, and also creates 
congestion at a wait buffer when the response returns from 
memory. To simplify the combining process and to avoid 
contention at the wait buffer, the NYU Ultracomputer and 
IBM’s RP3 machine support combining only a pair of 
requests at a switch. This section studies the effectiveness of 
such pairwise (or two-way) combining. 


4.1. Infinite Queues 


We did simulations to check the effectiveness of pair- 
wise combining with infinite queues. Our concern is whether 
congestion at the hot spot still occurs. (Recall that r, is the 
traffic load of combinables from each input port of a switch 
at stage « of the fan-in tree.) In our experiments, r, 
increased rapidly until r, +r, reached 1.0 (see Figure 2). 
This shows that with pairwise combining and infinite queues 
large networks are unstable. 


The reason for congestion even with combining is that a 
combinable request does not always encounter another com- 
binable request to combine with. Whenever a combinable 
request does not combine, it will be added to the traffic of 
the combinables coming out of the queue. Thus, the rate of 
combinables will necessarily increase towards the root of the 
fan-in tree. It is conceivable that this rate approaches some 
limit less than 1-r,, in which case the network would be 
stable. However, our experiments show this simply does not 
happen: the rate of combinables increases without bound. 


4.2. Finite Queues 


It may seem a priori that finite queues will always pro- 
vide worse performance than infinite queues, since infinite 
queues have more storage capacity. However, this is not 
necessarily so: Suppose at stage z of the fan-in tree, a queue 
becomes full. Then, the two queues at stage 7-1 of the fan- 


in tree feeding this queue will become blocked (at least for 
requests destined to the full queue). This will increase the 
chances of these two queues becoming full, thereby blocking 
the queues at stage 7-2 that feed them, and so on. Thus, if 
the rate of requests is large enough to create congestion at 
the root of the fan-in tree, the whole fan-in tree will tend to 
become congested. The overall affect on a message travers- 
ing the fan-in tree will be that its total delay will be fairly 
large at every stage, which contrasts with infinite queues 
where the delay is large only near the root. This means that 
combinables will spend more time near the leaves of the fan- 
in tree, and therefore have more chance of combining near 
the leaves. This will reduce the traffic rate of combinables 
which in turn will improve the overall performance of the 
network. (Recall that with pairwise combining, if a combin- 
able traverses a stage without combining, it increases the 
rate of combinables for all later stages.) 


We performed simulations on networks of nine stages 
with finite queues and pairwise combining. The queue size 
was assumed to be four. For the traffic of r=0.6 and 
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q =0.1, we observed tree saturation: the average waiting 
times at each stage of the fan-in tree was approximately 
equal to the queue size. Waiting times of requests at each 
processor’s queue seem to increase without bound as the 
number of network cycles simulated increased. Although we 
did not observe tree saturation for lower traffic loads, we 


expect that it would occur in larger machines. (See Figure 
3.) 


Since the probability of combining increases as a com- 
binable request stays in a queue longer, larger sized queues 
should help combining, which in turn can help avoid tree 
saturation. One might think that the tree saturation 
reported here conflicts with the results of Pfister and Norton 
[PfNo85], where pairwise combining was effective in handling 
hot spots with queue sizes of only four. Although there were 
some minor differences in our two models, which could 
account for the different results, the main difference was that 
they were simulating a network with only six stages. We 
believe that adding a few more stages to their network would 
produce tree saturation and make their network unstable. 
Minor changes in switch design cannot overcome the inherent 
weakness of pairwise combining, at least not without making 
the delays at each stage of the fan-in tree unacceptably long. 


5. Unbounded Combining 


Unbounded combining allows any number of combin- 
ables to be combined into a single request at a queue. 


Although the combining of the Columbia CHoPP is very 
similar to unbounded combining, our study is not directly 
applicable CHoPP because of its “‘repetition filter memory ’’, 
which allows the combining of incoming requests with 
requests already in the wait buffer. 


We have done extensive simulations of networks with 
infinite queues and unbounded combining. The networks 
seem to be stable and provide reasonable delay irrespective 
of the machine size and the traffic load. The traffic of the 
combinables adds only slightly to the average queuing delay 
of the noncombinables alone. It seems that unbounded com- 
bining eliminates the contention on the fan-in tree because 
there can be at most only a single combinable request wait- 
ing in a queue at any given time. 


Simulations show that with unbounded combining, 
finite queues provide only slightly larger delay than do 
infinite queues. When compared to infinite queues, delays 
are just about the same at the first few stages and slightly 
larger at all the later stages. 


6. Bounded Combining 


We have so far considered two extreme combining 
schemes: unbounded combining and pairwise combining. 
Unbounded combining provides good performance, but seems 
to be expensive (even to approximate); pairwise combining 
suffers from tree saturation, but is relatively easy to imple- 
ment. We suggest a compromise scheme, bounded combin- 
ing, where more than two, but at most a predetermined con- 
stant number of, combinables can be combined into a single 
request at a queue; in k-way combining the bound is k. 
Bounded combining is easier to implement than unbounded 
combining; the hope is that it will provide approximately the 
same performance. The question is, how large does k have 
to be? 


In the experiments with unbounded combining, we 
observed that a combinable request coming out of the 
switches at the later stages represents on average only 
slightly more than two combinables. This suggests that 
three-way combining, i.e. at most three combinables can be 
combined into a single request at a queue, will be effective. 
Simulations show that three-way combining performs almost 
as well as unbounded combining for both finite and infinite 
queues (see Figure 4). This indicates that pairwise combin- 
ing may be slightly too restrictive with respect to the 
number of combinables it supports. 

7. Wait Buffers and Return Queues 


Up until now we have considered the delay of a request 
only from the processors to the memory modules. For the 
return trip, there must be two return queues exiting each 
switch passing responses from the memory modules towards 
the processors. The performance of a network will be sensi- 
tive to the size of these return queues. We have assumed 
that the size of wait buffers is infinite. This is unrealistic in 
practice. The wait buffer size is an important factor for 
good performance, because combining cannot take place if 
the wait buffer is full. 


A combining of k requests is represented as k-1 pair- 
wise combinings, i.e. it uses kK-1 wait buffer locations. When 
the response returns from memory, all k-1 locations are 
immediately freed and the k response messages are placed on 
the return queue. 


This section considers the effect of wait buffer size and 
return queue size on queuing delay. Our main concern is to 
determine the proper size of wait buffers and return queues 
for three-way combining to obtain performance close to that 
of unbounded combining with infinite return queues and 
infinite wait buffers. 


7.1. Infinite Queues, Returns Queues, and Wait 
Buffers 


To get an idea of the appropriate size of wait buffers, 
we measured the average length of the buffers assuming 
infinite queues, return queues, and wait buffers. Although 
the unbounded and three-way combining schemes avoid 
congestion by inserting more combinables into the buffer at a 
time than pairwise combining, our simulations show that the 
average length of the buffers with pairwise combining is 
actually unbounded while it is quite moderate with three- 
way combining (see Figure 5). The reason for this is that the 
average length of the buffers is proportional to queuing 
delays. 7 


Suppose combining takes place in a switch at stage 7 of 
an n-stage network. Then, a record of the combining will 
remain in the wait buffer until a response from the memory 
arrives at the switch some time later. So, the average length 
of the wait buffer is determined by the average number of 
combinings and the average number of cycles until the 
memory responds. Let c, be the average number of combin- 
ings per cycle at the switch and ¢; be the average response 
time from memory (to the switch) for a combinable request. 
Then, the wait buffer is a queuing system with arrival rate c, 
and service rate 1/t,. The arrival rate c, is determined by 
the traffic load and the position (7) of the switch in the net- 
work. The service rate is determined by the queuing delays 
at stage 7 and later stages. Given fixed traffic load and fixed 
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network size, the average length of the wait buffer will be 
unbounded if there is severe enough congestion at later 
stages for c, >1/1;. 


Since the service rate 1/t,; is smaller for switches closer 
to processors, one may worry about the average length of the 
wait buffers at earlier stages. However, this is counterbal- 
anced to some extent by the fact that there is less contention 
in the earlier stages so that fewer combinings take place. 
Notice that the wait buffer lengths become unbounded as the 
network size increases, for any fixed arrival rate c,; at the 
wait buffer, irrespective of the combining scheme. The wait 
buffer size needs to grow with the network size. 


7.2. Finite Queues, 
Buffers 


To see the effect of small wait buffer sizes, we did simu- 
lations with queues of size four, infinite return queues, and 
wait buffers of size six. As can be seen in Figure 6, three- 
way combining with ‘“‘small’’ wait buffers performs as badly 
as pairwise combining does. The reason is that the buffers at 
the later stages are almost always nearly full, and three-way 
combining effectively changes to pairwise combining. 


and Wait 


Return Queues, 


To see the effect of small return queue sizes, we did 
simulations with queues of size four and infinite wait buffers. 
It turns out that, for three-way combining, return queues of 
size.four are not large enough to provide good performance. 
This may seem surprising, since (forward) queues of size four 
are sufficient, and the responses are just returning along the 
same path that the original request traversed. The reason is 
that on the return path combinables are returning in bursts, 
since a combinable response can split into two or three 
responses. Thus, each return queue in a switch is effectively 
a queuing system with the same traffic intensity as the (for- 
ward) queue in the same switch, but with fewer, larger-sized 
packets. The former system will provide worse performance 
and require larger queue sizes (see [ICrSW86]). With three- 
way combining, return queues of size of ten obtained approx- 
imately the same performance as infinite return queues. 


In our experiments for moderate traffic loads, return 
queues of size ten and wait buffers of size fifteen seem to be 
large enough to obtain performance close to that of 
unbounded combining with infinite sized queues and wait 
buffers (see Figure 7). Neither return queues of size eight 
and wait buffers of size fifteen nor return queues of size ten 


and wait buffers of size ten produced good performance. 


8. Conclusion 


Shared memory machines have the potential of conges- 
tion due to concurrent requests to a shared variable. Since 
hot spot contention becomes more serious as the machine 
size grows, congestion can severely degrade the performance 
of “‘large’? machines. To avoid potentially serious conges- 
tion, pairwise combining was suggested in the NYU Ultra- 
computer and IBM RP3 machine as an effective way of elim- 
inating congestion. 


We studied the hot spot traffic model, where a fixed 
fraction of the total memory traffic is for a single shared 
variable. As observed by Pfister and Norton [PfNo85], large 
networks with finite queues and no combining suffer from 
tree saturation. With finite queues, even pairwise combining 
has the potential of tree saturation creating unbounded delay 


no matter how “‘light’’ the traffic load is, for large enough 
machines. If hot spots are a real-life phenomenon, pairwise 
combining as suggested for the NYU Ultracomputer and the 
IBM RP3 machine is too restrictive. Three-way combining 
resolves the congestion. It remains to be seen whether 
three-way combining can be realized efficiently in hardware. 


A combining network must be carefully balanced. 
There are many parameters: the network size, the bounded- 
ness of the combining, the queue size, the wait buffer size, 
the queue size on the return path, etc. It is not obvious how 
any particular choice of these parameters will behave. For 
example, we have seen that changing finite queues to infinite 
queues, which one might expect would improve performance, 
can actually degrade performance. 


One must be very careful in interpreting our results. 
We do not believe that processors are likely to concurrently 
access the same shared (synchronization) variable for 
extended periods of time. If hot spots are only transient, 1.e. 
if there are short, intensive periods of hot spot contention, 
pairwise combining may very -vell combine enough to pro- 
vide acceptable performance. One might consider the 
(steady state) hot spot model suggested by Pfister and Nor- 
ton to be a conservative worst case scenario. 


We have restricted our attention to square banyan net- 
works composed of 2 X2 switches, and to riessages of length 
one. We believe our results generalize to other network 
topologies, other switch sizes, and other message size distri- 
butions. Any interconnection network will have to have a 
tree, maybe implicitly, leading from every processor to any 
given memory module. This will create the possibility of 
tree saturation when there are hot spots, but also the oppor- 
tunity for combining. With k Xk switches, it seems that k- 
way combining is not enough; some fraction slightly higher 
than that will be necessary. Longer messages seem to 
increase the amount of combining, but not enough to avoid 
tree saturation with only pairwise combining. 
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Figure 1. Fan-in Tree on a 3-stage Square Banyan Network 
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Figure 2. Traffic Load of the Combinables at Later Stages 
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Figure 3. 
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Figure 4. Delays with Bounded Combining 
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Abstract 


This paper examines the issue of shared memory emulation for 
performance prediction of applications running on multiproces- 
sors. It emulates memory contentions of different target mul- 
tiprocessors on a given multiprocessor. This approach provides 
not only the average performance measures, but also the instan- 
taneous values of memory contentions which are essential to 
find bottlenecks of user’s application programs. 


An algorithm producing an exact emulation result is presented 
and its implementation trade-off is discussed. To solve these 
problems, a heuristic approximation approach is introduced. 
Experimental results on a uniprocessor system show the ap- 
proach gives a reasonable accuracy. In order to alleviate the 
emulation overhead, parallel implementation of the algorithms is 
investigated. | 


1. Introduction 
With the advent of commercial parallel processors we enter a 
new era of proliferation of parallel computation models for 
problem solving. One of the known characteristics of such en- 
deavors deals with the nonlinear performance behavior of paral- 
lel programs with respect to the speed-up ratio as a function of 
the number of processors. As it is practically and economically 
not attractive to build a new parallel processor each time we 
may want to evaluate a parallel application on a new architec- 
ture with a different number of processors, emulation tech- 
niques are employed. Such techniques are likely to generate a 
ratio of 50,000 to 100,000 per one multiprocessor instruction 
step. Hence, the emulation process itself has to be executed in 
parallel on a multiprocessor. 


As we look closer at the nature of this emulation, one can see 
that multiprocessor systems share some of their resources such 
as memories, buses and !/O devices to facilitate parallel access. 


Sometimes, an access to a shared resource has to be serialized. 


- because the shared resource may be occupied by another ac- 
cess. This situation is called contention and is related to all of 
the shared resources. The more processors a multiprocessor 
system has, the more accesses it requires in unit time, so the 
probability of contentions is increased accordingly. In general, 
due to the contention phenomenon mentioned above, the per- 
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formance of an application will not be. improved linearly with the 
number of the processors. 


A practical approach to this issue is an emulation facility run- 
ning on a host multiprocessor and emulating bus and memory 
contention of another multiprocessor. When contrasted with 
the performance modeling methods, the emulation method 
provides, aside from better accuracy, instantaneous perfor- 
mance profiles. The performance modeling provides, in 
general, statistical average performance profiles which are in- 
sufficient to accurately indicate detailed user application perfor- 
mance bottlenecks. 


In this paper we are exploring the problem of emulating shared 
memory multiprocessors on shared memory multiprocessors, as" 
well as indicating and evaluating algorithms for achieving this 
proposed goal. 


2. Basic Structures 


2.1. Memory Access Model 

First, we define the memory access mechanism of the target 
multiprocessor. The multiprocessors we want to emulate are 
shared memory multiprocessors. In this system, global 
memories and processors are connected by a shared bus. 
These memories are divided into several banks, called memory 
modules. Access requests are transmitted from each processor 
to the destination memory module through the shared bus. 
Then, read/write operation is taking place with results returned 
from/to the memory module. 


Because there are more than one processor and each 
bus/memory can accept only one request at each cycle, an ac- 
cess request may have to wait if bus or destination memory is 
servicing another request. : . 


2.2. The Emulation Problem 

Suppose we have a host multiprocessor system with P proces- 
sors. An user of this system may want to know how much per- 
formance improvement of the user’s program will be gained if 
there are more processors in this multiprocessor system. The 
purpose of the emulation facility is to. provide such a virtua! mul- 
tiprocessor system within the host multiprocessor system. More 
specifically the goal and the problem can be defined as follows; 


The goal: | 
Performance prediction for applications running 
on a target multiprocessor when the number of 
processors are larger than that of the host mul- 
tiprocessor. We are interested in the instantaneous 


values for the performance vector. 
The problem: 


1. Emulation of bus and memory contention for a 
shared-memory shared-bus multiprocessor. 


2. Do step 1 using a multiprocessor with a fixed num- 
ber of processors. 


lf we consider N processes (N > P), each of which is to be 
assigned to each of N processors of the target multiprocessor 
system, they can not be executed at once on the P processor 
multiprocessor system. Therefore we adopt a divide-composite 
approach to this problem. The emulation algorithm proposed 
here consists of the following three steps; 


1. Divide N processes PROC, vee PROC,, which are 
to be assigned to N processors of emulated mul- 
tiprocessor, into K groups Q,---Q,, each group 
has P processes. That is: 

PROC, --- PROC, — Q, 
PROC, 4°°° PROC,, — Q, 


PROC >» PROC, + Q, 


(K-1)P +1 

2.Execute P processes in each group on the P 
processors for a certain fixed time span 7. During 
execution construct the memory access profile con- 
sisting of time stamp, process id and memory 
module id. (See Fig. A) The data collection 
mechanism will be disussed in Section 3. 


3. Composite memory access profile for multiproces- 
sor of N processors from K individual profiles of 
Q,:--Q,. (Also see Fig. A) This is done by 
eliminating bus/memory contentions among 
groups. Details of this part will be explained in Sec- 
tion 4. 


Step 2 and 3 are iterated until all processes are terminated. 


2.3. Related Works 

Several performance prediction methods based on statistical 
models have been discussed elsewhere. [1]-[3] In these 
methods, distribution of memory requests is assumed to be a 
statistical distribution and fixed all over the period. Statistical 
models are quite useful for general case analysis of multiproces- 
sor systems and when performance averages are sufficient, 
however application programmers may want to know the perfor- 
mance of their own programs rather than of the generalized 
program, as well as the instantaneous values of the perfor- 
mance measures in order to deal with performance bottlenecks. 


The purpose of our research is to provide application 
programmers a way of performance prediction for their 
programs. Memory access distribution of such programs will 
vary time to time and process to process, and these irregularities 
will cause adaptability problems for statistical models. Alter- 


nately, we present an emulation mechanism for memory conteh-. 


tion. The novelty of this work is related to the availability of 
instantaneous values for contentions for a specific user applica- 
tion, as well as the use of a multiprocessor for emulating a mul- 
- tiprocessor. . 
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3. Data Collection Mechanism 

Since the memory access profiles have to be constructed in 
parallel with the execution of a user’s application program, the 
data collection procedure should be performed without affect- 
ing the application program. Toward this end, a hardware sen- 
sor which is transparent from the user’s program is preferable to 
its software counterpart. If memory access data is collected by a 
software sensor, the overhead time for the sensor may disturb 
the execution of the application program and the emulation 
result will be different from the exact solution. 


4. The Composition Algorithm 

In this section we will present two versions of algorithms, the 
first is simple and produces an exact emulation, however it re- 
quires an infinite memory space. The second one is a modified 
version of the first algorithm which can be implemented within a 
bounded memory space. However, it provides an approximation 
of the problem. 


4.1. The First Algorithm 

Compared with the bus and the memory access, the bus cycle 
is the shortest of the two. Accordingly, we will consider the bus 
cycle as the unit of time; we call it time s/ot. Intuitively, the algo- 
rithm goes on as follows (see also Fig. B as an example): 


At first, read K profiles into memory. These profiles are con- 
structed during execution phase and stored somewhere. 


At time slot 7: 


e Accesses of time 7 (in Fig. B, "a","e" and “i") in 
those profiles are considered as candidates. 


e Select one ("a") from the candidates and have un- 
selected accesses ("e" and "i") wait for the next 
slot. 


In general at time slot t: 


e Accesses which are not selected in the previous 
stages and accesses which have time stamp f¢ are 
considered as candidates (for example, at time 3 
"i" "f","b" and "j" are candidates). 


e For each candidate, check whether it is ready to be 
requested by the process. This check is done by 
seeing the time slot when the latest access of the 
process was accepted. 


e Also check whether the destination memory module 
is ready to accept it by comparing memory cycle 
and the access interval. 


e Select one from the candidates which passes the 
above feasibility check and have others wait for the 
next time slot. 


e Each candidate has priority for selection. Priority is 
based on the group to which it belongs and the 
original access time stamp. Priority on groups is 
dynamically changed to emulate round robin 
strategy of multiprocessor system. 


After the time slot T (end of cycle): 


e We have a composite profile of length T and a set of 
accesses which are not yet selected, called 
overflows. To proceed to the next time slot (7+ 7), 
memory access profiles for the next cycle are 
needed because some accesses in the next cycle 
can be performed in the next time slot. So, 
overflows are combined with the memory access 
profiles of the next cycle and considered as the in- 
put for the next cycle. In Fig. B, access "!" and "m" 
‘are overflow accesses of the first cycle (t=7--- T) 
and should be considered as candidate accesses at 
the top of the next cycle (t= 7+ 7). 


Continue the above procedure for the next cycle 
(t=7+1---2T)andsoon. 


4.2. The Second Algorithm 

Although the algorithm described in the previous section is 
straightforward and produces exact emulation result, we should 
do some modifications when its computer implementation is 
considered. 


At the end of every cycle(t=7,2T---), we have a set of 
overflows. The amount of overflows can be estimated by the 
difference between the number of access requests (Reg) and the 
capacity of the shared bus (Cap) when Req > Cap. The over- 
flows should be added to the requests for the next cycle. When 
Reg and Cap are constant over cycles and Req > Cap, we have 
the following relation: 


| Overflow | = | Overflow,_,| + Reqg— Cap 


where | Overflow | denotes the amount of overflows after j-th 
cycle. Thus, 


| Overflow = j( Req — Cap). 


Since the above relation shows the amount of overflows grows 
monotonically and infinitely, the overflows will eventually be- 
come too large to keep them in memory. 


To overcome this problem, we modified the algorithm so that it 
suspends reading the next profiles until the amount of overflows 
becomes smaller than the limit instead of reading them at the 
end of each cycle. 


The modified version of the algorithm may overlook access 
requests because some of the access requests are not read into 
memory. This will cause overestimation. However, the degree 
of overestimation is considered to be fairly small for the follow- 
ing reason. At first, overflow accesses have higher priorities 
than accesses in the next profiles because of the first-in first-out 
nature of the selection strategy. And, the greater overflows be- 
come, the less likely it is that the next access can not be 
selected from overflows. This means that if we have moderate 
amounts of overflows, composition procedures can produce 
good approximation results without reading the next profiles. 


5. Experimental Results 
In order to evaluate both versions of the algorithm, we have 
experimented with the proposed algorithms. Since the mul- 
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tiprocessor system on which the emulator will run is currently 
under construction [4], we implemented the algorithms on a 
VAX 11/780 and simulated the behavior of the emulation using 
various simulated access profiles at different request rates 
generated from random distribution. The access rates of 
profiles are arranged to vary from time to time. 


Experiment results for the case of P= 100 and N=200 (i.e., 
emulating 200 processors on a multiprocessor system with 100 
processors) are shown in Fig. C. In this figure, the results of the 
first (exact) version and modified (approximated) algorithm are 
indicated by circle and box, respectively. In the approximated 
version of the algorithm, the limit of overflows is set to 27 (T:time 
span). 


Fig. C shows the relation between required bus access fre- 
quency, which is the total of frequencies of all groups, and 
resulted (composed) bus access frequency. In the ideal case, 
the relation between them follows the dashed line in Fig. C. 
However, in general, the actual relation is the dotted line below 
the ideal line. The maximum difference of composed access 
frequency between the first and modified algorithm is about 
1.5%. This shows that the modified (approximated) algorithm 
gives a good approximation of the first (exact) algorithm. 


6. Outline of Parallel Implementation 

In this section, we discuss the implementation of the emulation 
algorithm in a multiprocessor. In Fig. D, the algorithm is 
described in a pseudo programming language. It has a triple 
nested loop for time slots, groups and candidate accesses in 
each group. Within the inner double loop each candidate’ ac- 
cess is examined as to whether processor and memory are 
ready for access. If more than one access are feasible at a time 
slot t, one access is selected based on the priority mentioned 
earlier. 


Notice that feasibility check procedures for candidate ac- 
cesses are mutually independent, allowing these procedures to 
be invoked simultaneously. The number of accesses to be 
checked (= the number of procedures invoked) becomes large 
and increases in proportion to the number of emulated proces- 
sors. However, if candidates are allocated to the processors in 
the descending order of their priorities, unnecessary check 
procedures can be suppressed. Simultaneous feasibility check 
and priority-based processor allocation will make the check pro- 
cedure work within a small number of iterations and also will 
make the algorithm work nearly in proportion to the length of 
time regardless of the number of emulated processors. 


The experimental program on a uniprocessor (VAX 11/780) 
takes about 20sec. to compose two memory access profiles of 1 
ms; that is, emulation time / execution time ratio is 20 sec. / 2x1 
ms = 10,000. But, from the above discussion, we can expect 
high performance gain when the algorithm is finally imple- 
mented on a multiprocessor. Basically, search processes can 
run concurrently if they belong to the same time slot, and, even 
if they belong to a different time slot, they can run in parallel 
under the control of shared variables. If, for example, we imple- 
ment the algorithm on a multiprocessor with 100 processors and 
gain 50 times performance gain, then the emulation overhead 
will be alleviated to 10,000 /50 = 200. 


7. Conclusion —~ 120 
In this paper, we have presented two versions of a shared go 
memory emulation algorithm for multiprocessor systems. The [400 
first algorithm can emulate exactly but requires an_ infinite : 
amount of memory. Then, we modified this algorithm to Pe 
eliminate this implementation problem. The modified version re- < 
quires a finite amount of memory and experiment results show w 
that the modified algorithm can be a good approximation for the a 
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the first algorithm. Also, the algorithm proposed here is suitable 
for parallel processing. 
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/* composition for one cycle */ 
procedure composition 
if bufferspace is enough then read nextprofile; 
/* loop for time slots */ 
fort:= 1to7 do begin 
/* loop for groups */ 
for all groups / do begin 
/* loop for candidate accesses */ 
for all candidate access j do begin 
/* check process and memory */ 
/* access intervals Bi f 
feasiblecheck(access[i,j}); 
end 
end 
if more than one feasible accesses found 
then se/ectone based on priority; 
remove the selected access from the buffer; 
if bufferspace becomes enough 


1 2 3 T T+l oT /* perform suspended read */ 
then read nextprofile; 
--——_—_——_—— end 
end composition; 


gi: (2 tolel---| f fal fal fof [---| TT tT | Figure D: Description of the Composition Algorithm 
qo: Lele] tel---] | | tail pt tales] Pty 
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Abstract — — This paper describes a set of 
experiments designed to measure the behavior of the 
Butterfly Parallel Processor in the presence of memory 
"hot spots". The experiments were motivated by a 
paper by Pfister and Norton [3] that reported results 
from simulation studies on multistage switching 
networks for shared memory parallel processors. 
results indicated that, for machines with a large 
number of processors, very slight non-uniformities in 
memory reference patterns can lead to severely 
degraded performance for the entire machine, including 
processors that avoid referencing ‘the hot memories. 
The results were explained in terms of a phenomenon 
called "tree saturation” where traffic to the hot 
memories backs up into the switch and interferes with 
other switch traffic. The experiments reported here 
show that those results do not generalize to the 
Butterfly Parallel Processor. The access time for a 
memory that contains a hot spot is degraded, but the 
presence of the hot spot has little effect on the 
performance of programs that avoid the hot memory. 
Furthermore, tree saturation does not occur in the 
Butterfly Switch. 


Their 


INTRODUCTION 


This paper describes a set of experiments that 
measure the behavior of the Butterfly Parallel 
Processor [2] in the presence of memory “hot spots”. 
The experiments were motivated by a paper on memory 
hot spots by Pfister and Norton [3] that presented 
results of simulation studies of the switching network 
for RPS, a research parallel processor being developed 
at IBM Yorktown Heights. 


The simulation results showed that non- 
uniformities in memory reference patterns, which make 
certain memories “hot”, can have a devastating effect 
on the performance of an entire machine, including 
processors that avoid referencing the hot memories. 
Pfister and Norton explained their results in terms of a 
phenomenon called "tree saturation”, where traffic to 
the hot memories backs up into the switch and 
interferes with other traffic, including that to non-—hot 
memories. Their results indicated that for machines 
with a large number of processors (>=100) even slight 
non-uniformities in reference patterns can lead to 
tree saturation and severely degraded performance for 
the entire machine. 


Pfister and Norton claim generality for their 
results, stating that they apply to all multistage 
blocking networks. Furthermore, their paper claims 
that attempts to avoid the problem, such as providing 
multiple paths through the network, do not really help. 
Finally, the results are used to motivate the use of a 
second, combining switch in the RP3 architecture. 


The switching networks studied by Pfister and 
Norton were multistage shuffle exchange switches 
similar in topology to the switch used in the Butterfly 
Parallel] Processor. However, there is one key 
difference in switch operation: the switches studied 
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were "blocking”, whereas the Butterfly Switch is not. 
In a blocking switch, when contention for an output 
port of a switching element occurs, the path from the 
source to that switching element is held until the 
desired output port can be obtained. When contention 
for an output port occurs in a non-blocking switch, 
the message encountering the contention is rejected 
(to be retransmitted later) and switch resources 
associated with it (i.e., the path to the point of 
contention) are released. When the message is 
retransmitted, it again competes with other messages 
for switch resources. 


Thus, like the switches studied by Pfister and 
Norton, the Butterfly Switch is multistage. However, 
unlike them, it is non-blocking. Because the Butterfly 
Switch is non-blocking, the behavior of a program on a 
Butterfly system can be expected to be less severely 
affected by non-uniformities in memory reference 
patterns (caused either by the program itself or by 
other programs on the machine). 


Nonetheless, obvious questions to ask are: how 
does the Butterfly Parallel Processor perform in the 
presence of memory hot spots? Does it exhibit tree 
saturation? Does the architecture break down in large 
configuration for programs whose memory reference 
patterns exhibit moderate or even very slight non-— 
uniformities? 


The experiments described below show that the 
results presented by Norton and Pfister do not 
generalize to the Butterfly Parallel Processor. The 
access time for a memory that contains a hot spot is 
degraded, but the effect of switch contention is very 
small, even when severe non-uniformities in memory 
reference patterns are present. The experiments 
indicate that tree saturation does not occur in the 
Butterfly Switch. 


THE BUTTERFLY PARALLEL PROCESSOR 


This section presents enough information about 
the Butterfly Parallel Processor to understand the 
experiments described in this paper. More information 
about the Butterfly machine can be found in [2]. 


The Butterfly Parallel Processor is composed of 
processors with memory and a multistage switch that 
interconnects the processors. A Butterfly system can 
be configured with from 1 to 256 processors. One 
processor and memory are located on a single board 
called a Processor Node. All Butterfly Processor Nodes 
are identical. Collectively, the memory of the 
Processor Nodes forms the shared memory of the 
machine. All memory is local to some Processor Node; 
however each processor can access any of the memory 
in the machine, using the Butterfly Switch to make 
remote references. From the point of view of an 
application program, the only difference between 
references to memory on its local Processor Node and 
memory on other Processor Nodes is that remote 


references take a little longer to complete. (The 
typical memory referencing instruction takes about 6 
microseconds when the data referenced is remote and 
about 2 microseconds when it is local.) The speeds of 
the processors, memories, and switch are balanced to 
permit the system to work efficiently in a wide range 
of configurations. 


Each Butterfly Processor Node contains a 
Motorola MC68000 microprocessor (or a MC68020 with a 
MC68881 floating point co-—processor), at least 1 MByte 
of main memory, a co-—processor called the Processor 
Node Controller, memory management hardware, an 1/0 
bus, and an interface to the Butterfly Switch. I/0 
connections can be made to each Processor Node, 
making 1/0 configuration very flexible. 


The Butterfly machine supports a very efficient 
operation for transferring blocks of data from one 
Processor Node to another. The block transfer 
operation is implemented by Processor Node Controller 
microcode. Once initiated, a block transfer occurs at 
the full 32 MBit/second bandwidth of a path through 
the Butterfly Switch. 


THE EXPERIMENTS 


Two experiments were conducted to measure the 
performance of the Butterfly Parallel Processor in the 
presence of hot spots. The objective of the first 
experiment was to time execution of a typical progran, 
first in an environment without any hot spots, and 
then in one where N processors were used to generate 
a hot spot. A matrix multiplication benchmark program 

[1] was chosen. The objective of the second 
experiment was to determine the effect hot spots have 
on typical memory references by systematically 
measuring the behavior of the machine under non- 
uniform memory reference patterns. This was done by 
timing remote read, write, and block transfer 
operations for various memories, first in an 
environment without any hot spots, and then in an 
environment where N processors were used to generate 
a hot spot. 


Hot spots were generated in two different ways: 


1. Via read and write references. N processors 
were used to make a given memory hot by 
reading and writing the same location in that 
memory. This was accomplished by having 
each processor execute the tight loop: 


for (i = @; i < count; i++) 
* hotmemp = * hotmemp; 


where hotmemp is a pointer (short *) to a 
location in the hot memory. 


2. Via block transfer. N processors were used 
to make a given memory hot by using the 
block transfer operation to copy data from 
that memory to their local memories. This 
wes accomplished by having each processor 
execute the tight: loop: 


for (i =; i < count; i++) 
Do_bt (hotmemp, localp, numbytes); 


where Do_bdt initiates a block transfer that 
moves numbytes bytes from the location 
beginning at hotmemp in the hot memory to 
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the location beginning at localp in the 
processor’s local memory. 


The difference between these two methods is in the 
duration of the switch messages they generate. Simple 
read and write references use the switch in 2 
microsecond bursts. Each iteration of the loop 
generates 3 messages, 2 for the read and 1 for the 
write. Block transfers are broken into 256 byte 
packets, each of which uses the switch in 64 
microsecond bursts. Each iteration of the block 
transfer loop generates 2 messages for each packet, a 
short request message and a 64 microsecond response 
message. 


Although all Processor Nodes in a Butterfly 
system are functionally equivalent, there is a 
distinguished King Node that is special in two ways: it 
is the node to which the console terminal is connected; 
and it controls the machine while the operating system 
is being booted. Because a terminal handler and 
window manager run on the King Node, it appears 
about 8%-10% slower than the other nodes to 
application programs. To ensure that the 
measurements were not affected by the processing 
requirements of the terminal handler and window 
manager, the King Node was avoided in both 
experiments. 


The experiments were run on a 128 processor 
Butterfly system. When the experiments were run, 16 
processors had been temporarily removed to configure 
several smaller systems, leaving 112 processors in the 
system. Since the King Node was not used, 111 
processors were available for the experiments. The 
switch for this system has 4 columns (stages) of 4— 
input 4-—output switching elements, and is configured 
to contain 2 paths between each pair of Processor 
Nodes. 


EXPERIMENT #1: MATRIX MULTIPLICATION 


The matrix multiplication program was timed in a 
number of environments: 


1. Without any hot spots. 


2. With a hot spot generated by read and write 
references, using only cool memories for the 
matrices. That is, both the hot memory and 
the memories of processors used to generate 
the hot spot were avoided. 


3. With a hot spot generated by read and write 
references, using both the hot memory and 
the coo] memories for the matrices. 


4. With a hot spot generated by block transfers, 
using only cool memories for the matrices. 
As in (2) above, both the hot memory and the 
memories of processors used to generate the 
hot spot were avoided. 


5. With a hot spot generated by block transfers, 
using both the hot memory and the cool 
memories for the matrices. 


Data 


For runs involving a hot spot, 100 processors 
were used to generate the hot spot. This left 11 
processors with cool memories. 


Matrix Size = 192x192 


No hot memory 


Hot memory — 100 processors doing 


simple read/write references 
Avoid hot memory 
(11 cool memories) 
Use hot memory 


(11 cool memories + 1 hot memory) 


Hot memory — 100 processors doing 


768 byte block transfers 
Avoid hot memory 
(11 cool memories) 
Use hot memory 


(11 cool memories + 1 hot memory) 


Table 1: 


Time (seconds) 
Number processors. 


2 4 8 11 
65.75 32.73 16.37 8.22 aad 
66.02 32.97 16.67 8.47 6.27 
67.55 33.72 17.18 8.67 6.359 
66.07 33.15 16.62 6.50 6.25 
92.01 46.51 235.42 12.05 6.90 


Data from matrix multiplication benchmark program. 


All runs used square matrices of size 192x192. 
This size was chosen because: 


1. The run time for the matrix multiplication is 
long enough to give statistically interesting 
results, and short enough to run a series of 
experiments. 


2. The matrix multiplication benchmark is 
written in a way thet makes analysis of the 
results simpler when the matrix dimensions 
are multiples of 6 (see below). 


The date obtained by timing the matrix 
multiplication benchmark on successively larger 
processor configurations for each set of experimental 
conditions is shown in Teble 1. 


Discussion 


When the matrix multiplication program avoids the 
hot memory, the presence of the hot spot has 
negligible impact on the program's performance: there 
is less than 1% increase in execution time. When the 
program uses the hot memory, the impact depends 
upon the way the hot spot is generated. There is a 
small increase in run time when the hot spot is 
generated by read and write references (2.76% in the 
single processor case) and a substantial increase when 
the hot spot is generated by block transfers (40% in 
the single processor case). Since block transfer 
operations keep the memory busy longer than single 


read and write references, this result is not surprising. 


Switches for larger Butterfly machines are 
typically configured with alternate paths to make the 
machine resilient to failures in switching elements 
(which almost never occur) and to reduce contention 
within the switch. For example, as mentioned in the 
previous section, the switch for the 128 processor 
machine used in these experiments has one alternate 
path (for a total of two paths) between each pair of 
nodes. The data presented above was collected with 
the alternate switch paths enabled. Measurements 
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were also made to determine the sensitivity of the 
timing data to alternate paths by repeating the 
experiment with the alternate paths disabled. 


Use of alternate paths within the switch makes a 
small difference. When the hot spot is generated by 
read and write references and the hot memory is used, 
the program runs about 1% slower when the alternate 
paths ere disabled. When the hot spot is generated by 
block transfers and the hot memory is used, the 
program runs about 2 1/2% slower when the alterriate 
paths are disabled. 


The following is an analysis of the program's 
behavior when running on a single processor in the 
presence of a hot spot generated by block transfers. 
It shows that the increase in execution time is due 
almost entirely to the increase in time required to 
access data in the hot memory. 


The matrix multiplication program uses the 
block transfer operation to make local copies 
of matrix rows and columns before accessing 
the individual elements to multiply and add. 


To multiply matrices of size 192x192, 36864 
dot products must be computed. The program 
is written to compute dot products in groups 
of 36. This involves 12 block transfer 
operations to obtain 6 rows and 6 columns. 
Thus, 12 block transfers yield 36 results, each 
result requiring 1/3 block transfer. 

Therefore, the program performs 12288 block 
transfer operations. . 


Twelve memories were used to hold the 
matrices, one of which was hot. Therefore, 
1/12 of the block transfers can be expected 
to be delayed due to the hot spot. The block 
transfer delay from a hot memory was 
measured separately by timing a 768 byte 
block transfer from a cool memory, and then 
timing it again when the memory was made hot 
by 100 processors doing block transfers from 
it: 


Reference times (microseconds) 


read write bt—from bt-to 
256 768 256 768 
bytes bytes bytes bytes 
No hot memory 
remote 15.41 7.87 111.38 317.17 112.08 339.26 
Hot memory — 100 processors doing simple read/write references 
cool memory 16.78 8.75 112.35 316.28 113.94 348.19 
hot memory 701.93 306.80 473.99 1393.59 276.61 - 478.09 
Hot memory —- 100 processors doing 768 byte block transfers 
cool memory 15.95 9.02 112.88 315.97 113.26 335.85 
hot memory 17410.04 153.30 8178.84  25820.95 254.55 827.14 
Table 2: Data from remote reference experiment. 


mmm nee nn LL LLL LL SCAT SECA St epee 


Time to block transfer 


768 bytes (microseconds) 1. Single word (4 byte) read references; 
No hot memory 322.18 t=-*p; 
Hot memory is 25885 .81 where ¢ is a variable in local memory and p 
Ls By keculese hoe is a pointer (int *) to the word to be read. 
Therefore, the additional time for the matrix 2. Single word (4 byte) write references; 
multiplication program to perform block 
transfers from the hot memory should be pet, 
about: | 
where ¢ is a variable in local memory and p 
(1/12) * 12888 » (25885.81-322.18) = 26.18 seconds is a pointer (int *) to the word to be 
written. 
The measured increase in the execution time 
for the matrix multiplication program for a : cia areas of data from the remote 
single processor was y: 
92.01 — 65.73 = 26.28 seconds Do_bt (remotep, localp, numbytes) 
where remotep i int 
Thus, the performance degradation resulting from the on a remote ae ar ear oop eae . ay 
hot memory is due almost entirely to contention at pointer to an area in local memory and 
that memory. The effect of switch contention on numbytes is the number of bytes ta be copied 
program performance is negligible, even with severely to local memory. 
non-uniform memory reference patterns. 
4. Block transfer of data to the remote memory; 


Note that communication (accessing remote 


memory) accounts for about 6%' of the execution time 
of the matrix multiplication program. Our experience 
with the Butterfly Parallel Processor is that 
communication typically accounts for 4%-10% of the 
execution time for an application. Because a relatively 
small part of total program execution time is due to 
communication, remote memory reference times must be 
severely degraded before memory hot spots can have a 


Do_bt (localp, remotep, numbytes) 


where localp is a pointer to a block of data 
in local memory to be copied, remotep is a 
pointer to an area on a remote node, and 
numbytes is the number of bytes to be copied 
from local to remote memory. 


The measurements for a given reference type were 
made by timing a tight loop that included the memory 
reference: 


signficant effect on overall program performance. The 
purpose of the second experiment was to measure the 
effect memory hot spots have on remote memory 


references as opposed to overall program performance. 
Start_timer; 
for (i = @; i < loopcount: i++) 
Moke_reference; 
Stop_timer; 


EXPERIMENT #2: REMOTE REFERENCES 


The second experiment timed references made 
from a given processor node to memory on every other 
processor node. Four types of references were timed: 


In addition, the empty loop was timed to measure loop 
overhead: 


1 = 100% « (12288 bik xfers * 322.18 microsec/bik xfer) / (66.02 sec). 
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Start_timer; 
for (i = @; i < loopcount: i++) ; 
Stop_timer; 


Data 


Runs that involved hot spots used 100 processors 
to generate the hot spot. Therefore, in those runs 
there was 1 (remote) hot memory, 10 (remote) cool 
memories, 1 (local) cool memory, and 99 (remote) 
memories for processors generating the hot spot. 


The timing data in Table 2 shows average times 
for one iteration of the memory referencing loop for 
the various memory reference types under the 
conditions indicated. For the first set of data, which 
was collected without any hot spots, the “remote” 
reference times were computed by averaging the loop 
‘times measured for each of the 110 remote memories 
and dividing by loopcount. Data from the hot memory 
measurements was treated similarly. For example, the 
“hot memory” reference times were computed by 
dividing the measured times through the reference loop 
by loopcount; and the "cool memory” reference times 
were computed by averaging the loop times for the 10 
cool memories and dividing by loopcount. Loopcount 
for this data was 10000. The loop overheads for each 
of the conditions were measured as described above, 
and factored out of the data. That is, the times 
presented exclude the measured loop overheads. 


Discussion 


When there is a hot memory, references to cool 
memory are slowed down slightly. This is probably due 
to contention within the switch; switch messages used 
to reference cool memory collide with the switch 
messages used to make the memory hot. 


When there is a hot memory, simple references to 
cool memory are slowed down about the same amount 
as block transfer references to cool memory. For 
example, remote reads from a cool memory when the 
hot. spot is generated by read and write references are 
slowed by 1.29 microseconds (16.70 versus 15.41), and 
256 byte block transfers from a cool remote memory 


are slowed by .97 microseconds (112.35 versus 111.38). 
This is not surprising since the slow down is due to 
the increased time for initiating successful message 
transmission through the switch, and the increase is 
independent of message size. 


References to the hot memory are substantially 
slower. For most types of references a memory made 
hot by block transfers is slower than one made hot by 
read and write references. The major exception is that 
Simple writes are slower when the memory is made hot 
by read and write references than when it is made hot 
by block transfers (306.80 versus 153.30). This is due - 
to. the buffering strategy in the Processor Node switch 
interface which, in effect, gives preference to simple 
writes: when the memory is hot due to read and write 
references, the write being timed must compete with 
the writes making the memory hot; whereas when the 
memory is hot due to block transfers, there are no 
other writes to compete with. 


CONCLUSIONS 


The principal conclusion to be drawn from these 
experiments is that the results reported by Pfister and 
Norton do not generalize to the Butterfly Parallel 
Processor. While memory contention has an important 
effect on program performance in a Butterfly system, 
switch contention does not. 


The matrix multiplication experiment showed that 
non-uniformities in memory reference patterns have 
very little effect on the behavior of a program that 


‘avoids the hot memory. When the hot memory is 


avoided, its presence has virtually no effect on a 
program’s performance, even if the non-uniformities 
are large. 


If a program uses a hot memory, the performance 
degradation due to the hot memory depends on the 
extent to which the hot memory is used by the 
program. That is, the program is appreciably slowed 
only when it references the hot memory. Although the 
memory reference experiment showed slight slow down 
in references to the cool memories, the matrix 
multiplication experiment showed that the slight slow 
down has negligible impact on overall program 
performance. 


There is no evidence that the tree saturation 
phenomenon described by Pfister and Norton occurs in 
the Butterfly Switch. Severe non-uniformities can lead 
to a small increase in contention within the switch, but 
the saturation effect simply does not occur. 
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ABSTRACT 


When a large number of processors try to access a 
common variable, referred to as hot-spot accesses in [6], 
not only can the resulting memory contention seriously 
degrade performance, but it may also cause tree satura- 
tion in the interconnection network which blocks both 
hot-spot and regular requests alike. It is shown in [6] 
that even if only a small percentage of all requests are to 
a hot-spot location, these requests can cause very serious 
performance problems, and networks that do the neces- 
sary combining of requests are suggested to keep the 
interconnection network and memory contention from 
becoming a bottleneck. 


Instead we propose a software combining tree con- 
‘cept and show that it is effective in decreasing memory 
.contention and preventing tree saturation because it dis- 
‘tributes hot-spot accesses over a software tree whose 
nodes can be dispersed among many memory modules. 
Thus it is an inexpensive alternative to expensive com- 
bining networks. 


1. INTRODUCTION 


A large, shared-memory multiprocessor system like 
Cedar [1], the Ultracomputer of NYU [2], or the RP3 of 
IBM [3], may contain hundreds or even thousands of pro- 
cessors and memory modules. Multistage interconnection 
networks such as the Omega network [4] or its variations 
[5] are usually employed to provide communication 
between these processors and memory modules. 


In these systems, any variable shared by these pro- 
cessors will create memory contention at some memory 
modules. Those shared variables could be locks for pro- 
cess synchronization [15], loop index variables for parallel 
loops [12], etc. Even though accesses to these shared 
variables (called hot-spot accesses in [3,6]) may account 
for a very small percentage of the total data accesses to 
the shared memory (typically less than 10% are observed 
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in most applications), this memory contention can create 
a phenomenon called tree saturation [6], and can cause 
severe congestion in the interconnection network. It is 
shown [6,14] that tree saturation due to hot-spot conten- 
tion can seriously degrade the effective bandwidth of the 
shared memory system. 


Various schemes like combining networks used in 
the IBM RP3 [3] and NYU Ultracomputer [2], or the 
repetition filter memory in the Columbia CHoPP [7] has 
been proposed to eliminate such memory contention. 
The basic idea of these schemes is to incorporate some 
hardware in the interconnection networks to trap and 
combine data accesses when they are fanning in to the 
particular memory module that contains the shared vari- 
able. Because data accesses can be combined in the 
interconnection network, it is hoped that memory con- 
tention at that memory module can be eliminated. 


However, the hardware required for such schemes is 
extremely expensive. It is estimated [6] that the extra 
hardware increases the switch size and/or cost by a fac- 
tor between 6 and 32, and this is only for combining net- 
works consisting of 2X2 switches. With kXk switches (k 
> 2), the hardware cost will be even greater. The extra 
hardware also tends to add extra network delay which 
will penalize most of the regular data accesses that do 
not need these facilities, unless the combining network is 
built separately as in RP3 [6]. 


Furthermore, the effectiveness of the combining net- 
work depends very much on the extent to which such 
combining can be done. If such combining is restricted 
as described in [8], i.e., if the number of requests that can 
be combined is restricted to k in a kXk switch, then the 
effectiveness of the combining network can be limited. 
Unless this combining is unrestricted, tree saturation can 
still occur even in a combining network [8]. 


In this paper, we are studying this problem from a 
different perspective. We assume a shared memory mul- 
tiprocessor system like Cedar [1] with a standard, 
buffered Omega network providing interconnection [9], 


‘and without expensive combining hardware. In addition 


we use a hardware facility in the shared memory modules 
to handle necessary indivisible synchronization opera- 
tions for the shared variables [10]. Regular memory 


accesses bypass this hardware without delay and, hence, 
will not be penalized. Each memory module will handle 
memory accesses, including those memory accesses to 
shared variables, one at a time. } 


To eliminate memory contention due to the hot-spot 
variable, a software tree is used to do the combining. 
This idea is similar to the concept of a combining net- 
work, but it is implemented in software instead of 
hardware. We will show that this scheme can achieve 
quite satisfactory results as compared to more expensive 
hardware combining. 


2. HOT-SPOTS AND TREE SATURATION 


The phenomenon of how hot-spot accesses can cause 
tree saturation is briefly described here. For a more 
detailed analysis and discussion, please refer to [6]. 


Assume N is the number of processors in the sys- 
tem, and there are also N memory modules in the shared 
memory system. Each processor issues r requests to the 
shared memory per network cycle (0 <r <1). Among 
those requests, h percent of the requests are hot-spot 
requests. Thus, in each network cycle, there are Nrh 
hot-spot requests and r(1-/) normal requests directed to 
the "hot" memory module for a total of Nrh +r(1-A). If 
each memory module can accept 1 request per network 
cycle (i.e., the maximum rate), the maximum network 
throughput per processor is 


H=1/(1+h(N-1)) (1) 


and the total effective memory bandwidth for the shared 
memory system is 


B=N/(1+h(N-1)). (2) 


Fig. 1 shows B as a function of N for various h. 
This clearly shows that in a system with 1000 processors, 
hot-spot traffic of only 1% can limit the total memory 
bandwidth B to less than 10%. 


Notice that this discussion assumes that hot-spot 
requests can continue to be issued from a processor even 
if that processor still has an unsatisfied hot-spot request 
pending in the network. In many applications this is not 
true, because hot-spot requests are usually related to 
some kind of synchronization operation: A processor usu- 
ally has to wait for the outcome of the synchronization 
operation before it can issue another request to the syn- 
chronization variable. So, the issuing rate from a proces- 
sor is inherently limited. We will address these issues in 
more detail in later sections. | 


3. SOFTWARE COMBINING TREES 


To illustrate the principle of a software combining 
tree, let us assume that we have a variable whose value is 
N and that we want each processor to decrement this 


variable so that when all processors are finished, the 


value will be zero. This is a common way of making sure 
all processors are finished with a given task before 
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Asymptotic Total Memory Bandwidth 
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Number of Processors and Memories 


Fig. 1. Asymptotically maximum total network bandwidth as a function of the 
number of processors for various fractions of the network traffic aimed at a 
single hot spot. 
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hot-spot location 


level 2 
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Fig. 2. A tree with fan-in = 10 to the hot-spot location. 


proceeding with a new task, for example, and is one 
cause of hot-spot accesses. Now suppose that instead of 
one single variable, we build a tree of variables, assigning 
each to a different memory module, as shown in Fig. 2. 
If N = 1000 and assuming a fan-in of 10, we have 111 
variables each with value 10. We partition the proces- 
sors into 100 groups of ten, with each group sharing one 
of the variables at the bottom of the tree. When the last 
processor in each group decrements its variable to zero, it 
then decrements the value in the parent node. Thus, we 
have increased the total number of accesses from 1000 to 
1110, but instead of having one hot spot with 1000 
accesses, we have 111 hot spots with only 10 accesses 
each. It should be clear that this will result in a 
significant improvement in throughput rate and 
bandwidth, and the simulations we describe later verify 
that éven if we account for the increase in total accesses, 
the improvement is still quite significant. It should also 
be clear that a three-level tree with fan-in equal to 10 is 
not necessarily optimal, but that the optimal point 
depends on access times and on other factors. 


Another basic operation that can be implemented 
with a software combining tree is busy-wazt. Here it is 
assumed that processors are waiting for a shared variable 
to change in some way. Presumably some other proces- 
sor will cause this change. We build a combining tree as 
before, this time assigning one processor to each node in 
the tree. Each processor monitors the state of its node 
by continually reading the node. When the processor 
monitoring the root node detects the change in its node, 
it in turn changes the state of its children’s nodes, and so 
on until all processors have detected the change and are 
able to proceed with the next task. 


This idea, in a sense, is not very different from a 
hardware combining tree built from 10X10 switches, 
except that the combining buffer that would be inside 
each switch now resides in a shared memory module in a 
software combining tree. One distinct advantage for a 
software combining tree is that we can tune our perfor- 
mance by changing the fan-in of each node without 
incurring any hardware cost. 


3.1. Modeling of Software Combining Tree 


We will classify hot-spot accessing in two ways. 
First, accesses will be lamited or unlimited depending on 
whether a given processor can have only one or more 
than one hot spot request outstanding. We let 7 denote 
the number of outstanding hot-spot requests. Second, 
the number of accesses will be fized or variable depending 
on whether the total number of accesses is fixed, or 
whether the total number varies depending on the 
number of conflicts or some other factor. For example, 
assume we are adding a vector of numbers to form a 
sum. Then each processor can have more than one out- 
standing request to add an element to the shared sum, 
but since we assume the addition is done indivisibly by 
logic in the memory, the total number of requests gen- 
erated by all the processors is fixed. This case is 
unlimited-fized. A case like that described earlier where 
processors are decrementing a counter to see who is the 
last processor is limited-fized. A third example is illus- 
trated by busy waiting where the processors may all be 
waiting for one processor to complete some task. Each 
processor continually reads the value of a shared variable 
until the value changes, for example from zero to one. 
Thus the number of requests to the hot spot depends on 
how soon the variable gets reset, and this case is limited- 
‘vartable. Notice that a barrier synchronization [11] can 
be implemented by a counter decrement (limited-fized), 
followed by a busy wait (limited-variable) triggered by 
the final processor which decrements the counter. 


When we implement combining trees for hot-spot 
accesses, it is important to minimize the possible memory 
contention, and so it is preferable that all shared vari- 
ables in a software combining tree (i.e., the nodes of the 
tree) reside in separate memory modules. The largest 
combining tree we can construct for a hot spot is a tree 
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Fig. 3. Lower bound of network bandwidth under 
various hot-spot rates (7 = unlimit). 


with minimum fan-in, i.e., a fan-in of 2. The total 
number of nodes in a combining tree with N leaves is 
N/2+N/4+ °°: +2+1=N-1. Hence, it is always 
possible to spread those nodes across N separate memory 
modules. Our simulations in this study assume all of the 
nodes in a software tree to be in separate memory 
modules. 


We also assume the following system configuration: 
(1) There are two identical, back-to-back, uni-directional 
Omega networks: one is for traffic from processors to the 
shared memory; the other is for traffic from memory 


returning to the processors. Both networks are packet- 


switching, pipelined networks. 

(2) Each network consists of 2X2 switching elements with 
an output buffer of finite size at each output port of a 
switching element. The fan-in capability of each output 
port is 2, i.e., it can accept two simultaneous requests 


_from its two input ports. One request is forwarded to 


the next stage and the other is stored in the output 
buffer. If the output buffer is full, no more requests are 
accepted by the output port. In our simulations, we 
assume the size of the output buffer to be 4. 

(3) When a software combining tree is used, accesses to a 
shared variable are distributed over the nodes of a tree 
instead of a single shared variable, and additional 
memory accesses are needed to access these nodes in the 
tree. This occurs both in a counting operation where the 
last processor to decrement a node must also decrement 
the parent node, and in the busy-waiting case where each 
processor except at the leaves must propagate the state 
change to the node’s children. These extra accesses are 
accounted for in our simulations. 

(4) All requests are of the same length. In our simula- 
tions, we assume each request consists of only one packet. 
(5) The access time of a memory module is 1 network 
cycle, i.e., the time for a request to go through a switch- 
ing element when no conflict exists. 
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Fig. 4. Upper bound of network bandwidth under 
various hot-spot rates (n = 1). 


3.2. Possible Overhead in a Software Combining 
Tree 


As mentioned earlier, constructing a software com- 
bining tree creates many shared variables. Therefore, 
more hot-spot traffic is created even though that traffic 
generates less memory contention. 


As before, let us assume that the hot-spot rate from 
a processor is rXh, and the software combining tree has 
a fan-in of k for each node. For fized-type access pat- 
terns, the fractional increase in hot-spot traffic will be 


log, N-1 
1-(k /N 
y orh/k = ee 
k-1 
[=1 
When & = 2, the increased hot-spot traffic is 


rh(1-2/N), which approximates the original hot-spot 
traffic for large N. This means that the hot-spot traffic 
can not be more than doubled after all of the extra hot- 
spot traffic is included. As we will see later in our simu- 
lations, the decreased memory contention will more than 
offset the increased hot-spot traffic if h is less than 30%. 


For variable access patterns, the additional accesses 
caused by the combining tree are difficult to quantify 
because the number of accesses is not fixed to begin with. 
In practice, since busy-waiting is often the cause of vari- 


able access patterns (with 7 = 1), and the number of 


accesses for a busy-wait operation depends on how 
quickly the state change is propagated to the children in 
the tree, the total number of accesses could even be less 
than that required by a single shared variable because 
the state change can be propagated more quickly by the 
combining tree than by N accesses to a single shared 
variable. 
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Fig. 5. Average delay versus bandwidth for a size 256 network. 
(A varies from 0 to 32%) 


4. BOUNDS ON BANDWIDTH 


4.1. Unlimited Hot-Spot Requests per Processor. 


In a packet-switching Omega network, with finite 
buffers in each switching element and with hot-spot rate 
h = 0, we still cannot achieve 100% memory bandwidth 
because of conflicts in the network [9]. These conflicts 
are also possible if a crossbar switch is used. If we 
assume R to be the maximum request rate reaching a 
memory module when no hot-spot exists, then in a steady 
state, R is also the maximum request rate allowed for a 
processor. Therefore, we can consider R to be an abso- 
lute upper bound on the bandwidth per processor. 


The value of R depends on the network buffer size, 
the length of a request, and the network switch size, etc. 


[13]. However, as A increases, the request rate to the hot 
memory module, i.e., r(1-h) +rhk, will increase from R 
to 1. Tree saturation will occur when the request rate to 
the hot memory module approaches 1, and the maximum 
processor request rate r will decrease. Hence, we have 


R <r(1-h) +rhk <1. 


By rearranging the above equation, we have the follow- 
ing: 


R/A+h(k-1)) <r <1/(1+h(k-1)) 


1/1+h(k-1)) is equal to 1 when A is 0. Since the 
absolute upper bound is R (R <1) as discussed before, 
we can have a tighter upper bound by using R, i.e., 


R/(1+h(k-1)) <r <R. (3) 


Notice that Eq. (3) also shows a lower bound for the 
maximum processor request rate r when a software com- 
bining tree is used with a fan-in of k, and 7 is unlimited, 
i.e., even when 7 is unlimited, the maximum bandwidth 
cannot be worse than NR (1+h(k-1)). 


We obtained R from simulations, and in Fig. 3 we 
plot lower bounds for various system sizes with h varying 
from 0% to 32%. Notice that those curves are in a very 
narrow range, i.e., the lower bound in Eq. (3) seems to be 
tight at least for systems up to size 1024. The top dotted 
line in Fig. 3 shows R, the maximum bandwidth we can 
get when there are no hot-spot requests. 


The degradation factor in Eq. (3) is 1+h(k-1). This 
degradation factor is independent of the system size and 
reaches a minimum when k = 2. Given unlimited hot- 
spot requests, i.e., 7 >1, the optimal software combining 
tree for maximum memory bandwidth has a minimum 
fan-in of 2. 


4.2. Single Hot-Spot Request per Processor 


If the hot-spot request rate is limited (7 = 1), then 
there cannot be more than N hot-spot requests in the 
system at any time. For systems with instruction look- 
ahead or with data prefetching capability, regular 
requests still may be issued while a hot-spot request is 
pending. However, this case is not different from that of 
unlimited hot spot requests with a very small h: when h 
is very small, it is unlikely that there will be more than 
one hot-spot request pending at any time. 


Hence, when 7 = 1, we will only consider the case 
where no additional requests, hot or regular, are issued 
by the processor when there is a pending hot-spot 
request. Thus, the bandwidth depends on the delay of 
the hot-spot requests. The request rate from the proces- 
sor is further restricted by any increased delay. If a 
software combining tree is used to eliminate the memory 


contention caused by the hot-spot requests, the limiting: 


factor for the memory bandwidth will only be 7; the 
inherent nature of the hot spot that prohibits further 
processor requests. 
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During a long period of time T, there will be rT 
requests generated from a processor, among which rhT 
requests are hot-spot requests. The processor will be 
barred from issuing any request for a total period of 
rhTC, where C is the average round-trip delay for a 
hot-spot request. The processor can issue a request only 
for a total period of T-rhTC. Within that period, 
r(1-h)T regular requests are issued. Hence, the real 
issuing rate for regular requests is r(1-h)T/(T-rhTC). 
This rate can not be greater than 1, i.e., 


r(1-h)T/(T-rhTC) <1. 


This equation can be rearranged to obtain an upper 
bound for r: 


r<1/(1-h+hC) (4) 

As expected, the maximum rate of r is greatly 
dependent on the hot-spot delay C’. This bound gets 
tighter as the hot-spot rate h gets larger. When h = 1, 
the equality in Eq. (4) will hold. Fig. 4 shows this upper 
bound for various hot-spot rates h with minimum hot- 
spot delay of C = 2log,N. For N = 1000 and h = 8%, 
the upper bound will be around 40% of the total 
bandwidth. Notice that the upper bound in Eq. (4) 1s 
valid even for a hardware combining network because it is 
a bound imposed by the inherent nature of the hot-spot 
request (i.e., 7 = 1). 


5. SIMULATION RESULTS 


To study the effectiveness of a software combining 
tree, we performed several simulations for N = 256, with 
h varying from 0% to 32%. Fig. 5 shows the delay and 
maximum bandwidth when neither a software combining 
tree nor a hardware combining network is used. Follow- 
ing each curve from left to right, each point represents a 
larger value of r. As shown in [6], while r increases, 
bandwidth increases while delay stays relatively constant 
up to a point of saturation. After the saturation point, 
bandwidth ceases to increase while delay gets worse. 
This clearly shows low bandwidth and increased average 
network delay results. The maximum bandwidth of 
0.63N is achieved when h = 0. 


Fig. 6 represents fized-type access patterns with 
unlimited 7, and shows the use of a software combining 
tree to reduce hot-spot contention. The fan-in’s for the 
software combining trees are varied from k = 16 to 2. 
The improvement is quite significant compared to the 
result in Fig. 5(a). According to Eq. (3), the minimum 
degradation factor for the bandwidth can be obtained 
when the software combining tree has the minimum fan- 
in. In Fig. 6 we can see that when k = 2, the degrada- 
tion is indeed the smallest. 


As presented in section 3.2, the hot-spot traffic can 


be nearly doubled by the extra hot-spot traffic created by 


the software combining tree with the minimum fan-in k 
= 2. In Fig. 6 A is indicated as the original hot-spot 
request rate; the results shown there already include all 
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Fig. 6. Average delay versus bandwidth for unlimited-fixed 
access patterns. (N = 256, h varies from 0 to 32%) 
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Fig. 7. Average delay versus bandwidth for limited-fixed 


access patterns. (N = 256, h varies from 0 to 32%) 


extra hot-spot traffic. This shows that with an original 
hot-spot request rate of 16%, the degradation remains 
small. The elimination of the hot-spot contention, 
indeed, more than offsets the results of increased traffic. 


Fig. 8 represents limtted-vartable access patterns, 
wherein no additional requests are issued by a processor 
while it has a hot-spot request pending, but the total 
number of requests allowed over time is not fixed. The 
upper bound on the bandwidth given in Eq. (4) will 
depend on C, the average delay of the hot-spot requests. 
The value of delay C' includes the overhead from travers- 
ing the software tree, busy waiting in the intermediate 
nodes, and the possible memory contention. From these 
figures, we can see that the optimal fan-in & for the 
software tree is no longer k = 2, but rather at around k 
= 4. The increased fan-in k allows for a lesser number 
of levels of nodes in the tree, thus reducing the time 
required for requests to traverse the tree. 


Furthermore, when h increases, the upper bound in 
Eq. (4) becomes tighter. There is less traffic in the net- 
work due to the restriction that no more requests will be 
issued when a hot-spot request is pending. In this case, 
the turnaround time for a request can actually be 
improved as Fig. 9 shows. 


We also simulate some cases for fixed-type access 
patterns with 7 = 1 (Fig. 7). If we take into account the 
fact that busy waiting is not required in this kind of 
access pattern, we can see that the results are quite simi- 
lar to those from our simulations of variable-type access 
patterns discussed above. In fact, the average hot-spot 
request delay, i.e., C in Eq. (4), is smaller in this case. 
Also, as shown in Eq. (4), we can expect an improved 
maximum rate r. 


The lower dotted lines in Fig. 5 through Fig. 9 are 
the average delay of a request through the network 
assuming the buffer size in each switching element is 
infinite. These values are calculated based on the analyt- 
ical model in [16]. 


6. DISCUSSION 


Our simulations show that the software combining 
tree effectively eliminates tree saturation caused by hot- 
spot contention. However, the main purpose of the 
software combining tree differs slightly from the original 
purpose of the hardware combining networks [2,6]. 


Hardware combining networks were originally pro- 
posed to speed up hot-spot requests by combining those 
requests in the interconnection network and in this way 
eliminate memory contention at the hot memory module. 
Because such memory contention creates the serious side 
effect of tree saturation that can adversely affect even 
regular requests [6], such requests must also be processed 
through the hardware combining network. Although it 
alleviates the problem of tree saturation, hardware com- 
bining can cause extra delay in processing regular 
requests. 
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Fig. 8. Average delay versus bandwidth for limited-variable 


access patterns. (N = 256, h varies from 0 to 32%) 
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Fig. 9(b). Average delay of hot-spot requests versus 
bandwidth for limited-variable access patterns. 
(N = 256, A varies from 0 to 32%) 


Software combining trees seem to effectively relieve | 


regular requests from the side effect of tree saturation, 
without the expense of hardware combining networks. 
The beneficial side effect from this scheme is that the ser- 
vice time of hot-spot requests decreases. Theoretically, 
this improvement cannot be as good as a hardware com- 
bining network with unrestricted combining capability: 
in a software combining tree, a hot-spot request must 
traverse the interconnection network log,N_ times, 
whereas in a hardware combining network the request 
must traverse the network only once. However, in a real 
implementation, unrestricted combining in a switch is 
impossible due to the complexity of the switches in a 
hardware combining network. This will inevitably 
hamper the effectiveness of a combining network [8], and 
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also introduces increased delay due to the extra 
hardware. It is difficult to determine at what system size 
requirements necessary to prove the hardware combining 
network to be the optimal method of speeding up hot- 
spot requests. The effect of the somewhat slower hot- | 
spot requests will have on total system performance, if 
the rate is very small, also remains to be seen. 
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Abstract 


The method of simultaneous iteration with shift 
is extended to extraction of m-eigenpairs of a 
general eigenvalue problem of large order n, 
n>>m, in a parallel processing environment. The 
algorithm combines the power method and the 
Jacobi technique and reduces to performing four 
basic operations: decomposing a banded matrix 
of order n into upper and lower triangular 
factors; forming a product of matrices that may 
be banded, rectangular or square; extracting all 
eigenpairs of m-order problems by a generalized 


Jacobi technique; and forward/backward 
substitution. 

Parallel implementation of the algorithm is 
discussed in detail. The analysis accounts for 
computation and communication costs, and 
utilizes a parallel processing architecture of 
the ensemble type. Expressions for the 


computational efficiency and speedup are defined 
as a function of the problem and hardware 
parameters. Selected representative problems 
exhibit efficiencies ranging from 60% to 98%. 


1. Introduction 


Parallel processing is particularly attractive 
for large computationally intensive problems 
where hours of sequential computing might be 
reduced to minutes of parallel computing. 
Unlike vector computers such as the Cray or 
Cyber 205, parallel computers of the ensemble 
type, such as the Cosmic Cube[1] and _ the 
CHIP[2], require the use of computational 
strategies that permit decomposing the solution 
of a problem into concurrently executable 
computational tasks (or processes). Each task 
is executed on one processor which is a member 
of a network of communicating processors that 
operate in parallel. Communication between 
tasks is inevitable, since all tasks contribute 
to the solution of a single problem. However, 
such communications must be minimized in order 


to achieve the full potential oof these 
machines. This offers the challenge of 
restructuring existing algorithms and 
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discovering others so that they are best suited 
for the parallel processing environment. 


To this end, the present work addresses a 
strategy for the parallel extraction of the 
eigenpairs of the general eigenvalue problem: 


AX~ABX 


ad 


(1) 


where A and B are Hermitian banded matrices of 


large order n and half bandwidth b<<n. As is 
the case in many technologically important 
problems, the interest here is in computing. only 
the set of eigenpairs (Ag,Xg), f=l,...,m 


residing within a prescribed range A;<Apd\y, 
where m<<n. For this class of problems, a 


mixture of power methods and the Jacobi 
technique appears to be most suitable, and is 
therefore explored herein. 

2. Power Methods 
When B is positive definite, the general 


eigenvalue problem in (1) may be reduced to the 
standard form: 


Ax-\x ; AsEtS (2) 
Furthermore, if q(A) and R(A ) denote any 
polynomials of A") it can be shown[3] that 


(QA) RA) -lyr oy = integer, has the same 
eigenvectors as + ; with eigenvalues 
(Q(AQ)RQAG))*, provided that det R(A)¥ 0. By 
starting with any vector yv, iterations of the 
type 


-1 
QCA) ROAD a = Ue 


(3) 
u, = ¥,/max (y) k= 1y2aeee 


lead to (4) 


Lim = x 
k ad a 


where Xp is the pth eigenvector of (1) such that 


QQ, /RO, D|f max (5) 


ph) ty ee 


= lea, ROD 


The algorithm implied by (3) is a power method 
which enables one to extract the eigenpairs of 
(1) selectively. For example, by taking 


RA) = (A- AD (6) 
where A, is a prescribed shift, (3) and (5) 


reduce to the method of inverse iteration with 
shift: 


QAd=r 5 


~ w~k ~k-1 
wu = Y/max (Yk = 1,2,.6. (7) 
and [[1/QA, - 4,)llmax = I/O, - AO 


= Pe ere 


The iterations will converge to the eigenvectors 
with eigenvalues closest to the shift \,. 


It is known airs when A and B are _Hermitian 


(i.e. A=AH B=BH ), all eigenvalues A. i=\i, 

dome 1 2 eat “are real; and if A and B are also 
positive definite then A; >0 for i = 1,2,. re 
When A and B are Hermitian, if the labels of the 


eigenvalues are such that Ay < AQ < ...< Ap, 
then by Sylvester’s law of inertia of quadratic 
forms[3], 
the converged eigenpair is such that: 


- if rA_ < Le 
p=nN 9 1 p 


(9) 
p > AG 
where n’is the number of negative diagonals d,; 
in the factorization: 


(A - ry B) = ui diag (d; ) U 


en boy. Geo 


(10) 


Equivalently, ’is the number of pure imaginary 
diagonals of the upper triangular matrix U in 
the Cholesky factorization of: 
uy 


eo aw 


(A - 2, B) = (11) 
The algorithm given by (6), (7) and (8) can also 
be carried out with m iteration vectors, Ve: 
simultaneously. In this case, instead of the 
normalization in (7) one may orthonormalize the 
set of m vectors, V,, with respect to A and B. 
This can be done either by Gram- Schmidt, Givens 
or Householder orthogonalization; or by 
replacing V, with an orthogonal set of m-basis 
vectors U, which span the same subspace, S;, as 
Ny. In the latter scheme, the orthogonality 
property is derived from satisfaction of the. 
relation between the projected operators A and B 
onto the S, subspace. This is accomplished by 
the following recursive relationship: 


(A-A BW =BU 


~k-1 (12) 


and the transformation 


U = Qo k = Ost,2s- ++ Kiax (13) 
Initially, for k = 0, VY, is any set of m 


linearly independent vectors. 
steps, Q, is determined so as 
eigenvalue problem of order m<<n: 


A, O = B, Q, diag (2) 


in which, 


H 


For succeeding 
to satisfy the 


(14) 


it may be shown that the index p of. 
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Normalization of QO, with respect to By: i.e., 


(Ox Be Q=I) is enforced such that 


ue BU =1 


se) (15) 
With Equations (12) to (15), the resulting. 
eigenpairs are assured to converge to: 

Lim U = [x x geeeg ] 6 
k =o ~k 12° ~22 xm0’K (16) 

Lim diag(Q, ) = 
ee ? (17) 
diag(), 9 -r»O > Aog 7 Ayre AW 7 Ke : 


where the indices 12, 22,...,m2 are those of Aj 
of (1) which reside nearest the shift A>. 


3. Generalized Jacobi Technique 


The generalized Jacobi technique[4] is most 
suitable for obtaining all eigenpairs of the 
reduced general eigenvalue problem in (14). In 
the generalized Jacobi, both A, and By, are 
diagonalized simultaneously by a sequence of 
orthogonal Sea ae aaa of the form: 


T T 
Agoret = Pp e+ By pe i Ay By Bo e+ BL 
(18) 


T T ,T 
Aeyret “Fy e+ Ep 2 By By Bo +++ 2B, 
where Pr is a rotation matrix: 
1 
L ‘ 
‘ 1 e 2. % are edi 
1 . 
19 
B- : (19) 
i 1 
8, a: ee 
l 


Each transformation in (18) involves one pre- 
and post- multiplication, and will simulta- 
neously reduce to zero the symmetric 
off-diagonal terms (ij) and (ji) of Ar ttl and 
By r+1 if: 


w rR) 
_3 _3 
> + (sign w,) (; 


(A.. B.. - B.. 
1 jj ij jj 


-w,/6 


where = 


> 


A Dear 
(20) 


> = (Ai, Bes - Byy AG) 


og Eg ay? 


A single transformation as defined above will 


k,r 


result in changing only the row and column pairs 
‘of A, and B, corresponding to the ij pair of 


Equation (19) in the manner listed in Table (1). 

All remaining terms of Ay y and By r remain 
unchanged, until subsequently operated ‘upon by a 
transformation involving their row or column 
pair. In the "cyclic" generalized Jacobi. 
technique used here, no attempt is made to 
determine which off-diagonal’ term is largest so 


TABLE (1) 


Changes in the i,j rows and columns of Ay r41 and By r+) due to a 
single transformation using Py of Equation (19) 


T 
Element rr Ar Lm f. Beir LP 
2 

ii (i, ¢ Bie By 2854 Meese (Pan * Ba By * 2 8G Bear 

2 2 
ji (As; +4; Ai, + 2 a5 at (Bs +45; Be, +2 ass Bear 
ij = ji 0 0 
6 = 1,2,.-6,m : & éi e 6 ¢ j 


i6 = 6i (Aig + Bi 4 ieee (Bi « ¢ Bs Boke 


j8 = 6j (Aye + 5 Aree (Big + 415 Fiske 


a eC aA SL CS SDS TA STI LES DT STOO TIES IS 


as to reduce it to zero first. Instead, all 
off-diagonal terms are operated upon in row 
major order, sweep after sweep. As such, 
diagonalizing the m x m matrices A, and B, of 
(14) will require several sweeps, Jpay; each 
sweep consisting of m(m-1)/2 orthogonal 
transformations of the type in (19). As _ the 


number of sweeps increases, both Ak r+1 and 


Be r+1 Will approach their diagonal form, in 
which case the eigenpairs are simply: 
diag (a) = diag (A, /B ey ral (21) 


Qe = 2, Fy -+- F, disg LNB Dy a) 


4. Parallel Implementation of Simultaneous 
Iteration 


Preliminaries: 


The steps of the algorithm described in Section 
2 are summarized in Table (2). The algorithm 
consists of four essential operations: (i) 
decomposing a banded Hermitian matrix of large 
order n into upper and lower triangular factors; 
(ii) forming the product of matrices that may be 
banded, rectangular or square; (iii) extracting 
all eigenpairs of the m-order problem by a 
generalized Jacobi and (iv) solving for the 
unknowns of an algebraic set of equa-tions by a 
forward pass, and then by a backward pass. 


The objective of this section is to construct 
parallel implementation strategies for the above 
steps, and to derive expressions for the speedup 
and efficiency of the individual steps and of 


the complete algorithm. The speedup = and 
efficiency are the most common measures for 
evaluating the performance of algorithms on 


N-identical processors operating in parallel. 
Both measures are related. The speedup is the 
ratio of the time required to execute a given 
algorithm on a single processor computer to the 
time required to execute the same algorithm on a 


parallel processing computer composed of N 
processors of the same type. The efficiency is 
defined as the speedup per processor. 

Equivalently, the efficiency is the ratio of 


time spent in computation to the total time 
required for computation, interprocessor 
communication and idling, if any. 
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l<s<<b. 


TABLE (2) 


Simultaneous Iteration Algorithm 


1. Obtain the triangular factors of 
(A-AgB) =U u 


NEN 


nen nen 


[] 
Nita 


nan Am. 


rol. U 


2. Choose m-independent vectors Y,=[v}, 
V2e-++ Malo and set diag (M).1 =0 


3. Construct the product Zy.1=(A-AoB)Vy, k=O 


4. Compute Ay,-VEZ,.1 


-_ = NU. U 
6 B,-viW, Asn nam 
| oe c }- 0) 
7. Extract m orthonormal eigenpairs of AEH. AAG 
Me — BQk diag (Oy) ee | 


(ea -[] 


Atm mam 


8. Compute U,.=-V,O, 


9. Convergence check: if H -%-1)/MH ise, 
set Upr(X).¥2---- Emlk 


diag(O, + A,)ediag(A;,.. 
otherwise continues 


Aga, exit 


Ni} -U 


10. Form Z,=BU;, 


MAH New 
11. tra a forward pass compute ¥;,) from NY l|- [] 
U Xk+17 Zk ABA Aum 
12. With a backward pass compute V,,) from NI If= (] 
Wie “Xkel 7 
an nam 


. Set k <-- k+l, and go to 4& 
To quantify the complete processing costs, two 
architectural parameters, q and yp, are used. 
The computational cost is characterized by q, 
which is defined as the time required to perform 
a floating point operation. One floating point 
operation is assumed, on the average, to consist 
of a multiplication followed by an addition. 
The cost of interprocessor communication is 
characterized by yw, which is defined as_ the 
ratio of the time required to transmit one 
floating point number from one processor to its 
nearest neighbor, to the time required to 
perform a floating point operation. In the 
current state-of-the-art computers, yp is of 
order unity. The cost of propagating data 
through a network of N-processors depends on the 
transmission speed yq, communication topology 
among processors, the number of data items to be 
transmitted, and whether all processors should 
receive the same data or each should receive 
different data. In computers such as the Cosmic 
Cube, the time required for transmitting v 
floating point numbers to N-processors is[5]: 


T, = v(log, N) pq (22) 
if the same y data items are sent to all N 
processors, and 

T, = (v+N - 2) nq (23) 
if different processors are sent different sets 
of v data items each. 


In the following, operations on matrices of 
large order are reduced to operations on smaller 
partitions of size s. It will be assumed that 
Further, by proper padding with zeros: 


Mod(n,s) = 0 
Mod(b,N) = 0 
Mod(b,s) = 0 
and 
b = 2Ns <<n (24) 
ns, = n/s 


bs = b/s = 2N 
ny = 1 + Int ((ng-1)/N) 
i. Parallel Matrix Decomposition 


Two strategies were given in Reference [6] for 
the parallel implementation of decomposition of 
a banded Hermitian matrix of large order, n, 
into its triangular upper and lower factors 
U,u". These stratagems use two different 
interpretations of the Cholesky factorization. 
Their performance is nearly equal. With the 
definitions and notations outlined above, the 
time required to factor an n x n matrix with 
half bandwidth b using N-processors is [6]: 
‘5 1), B 1 
T, = qnsN [2(1 + 5) + © (10 + 4)] (25) 
where the coefficient of pw is due _ to 
interprocessor transmission of data. \Expression 
(25) as well as others to be developed in 
subsequent sections are required for estimating 
the efficiency of the entire algorithm. 


ii. Matrix Product 

Three schemes for performing the matrix product 
G = AB are given in this section, depending on 
whether A and B are banded, square or 
rectangular. 


a. Inner Product of Two Rectangular Matrices 


Let A and B be n x m matrices, each of which is 


partitioned into blocks A;, B;; i = 1,...,ng 
with block size s x m. Assuming that each of 
the N-processors can store (2ms+m ) floating 
point numbers, the inner product may _. be 
expressed by: N 
os > 0 (26) 
Q=1 
ON 
H 

h = 
as £9 2G A(k-1)N+2 2(k-1)Ne2 (27) 

k=1 

N 
or (28) 


H 
Co = » A(Q-1)ny+k 2(2-1)ny +k 
k=l — 


The choice between placing the operands A; and 
By in the N processors according to ‘ 

i = (k-1)N + 2 of Equation (27), or 

i = (£-1)ny + k of Equation (28) may be 
dictated by the step previous to the matrix 
multiplication. With N processors, N values of 
Ce, £=1,...,N can be computed in parallel. Each 
processor performs ny serial floating point 
operations on their operands. No interprocessor 
communication is needed, and all Gy are 
completed in a time of order nym‘q(l+s). The 
cascade sum of all Ge is performed next 
according to Equation (26) in a _ time 
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proportional to m2q(1+4) logoN during which the 
final results are collected in the host 
computer. The total cost of the inner matrix 
product is, therefore: 


Thi - mq [n, (1 +s) + (1 + p) log, N) (29) 


b. Product of a Rectangular Matrix by Square 
Matrix 


Let A and B denote n x m and m x m matrices, 
respectively. Again, assuming that each 
processor can hold (2ms+m2) floating point 
numbers, and that only A is partitioned into 


s x m blocks A;; i=1,... Ng, the product may be 
written as: | (30) 


Ci~ih 
where the resulting product GC is conformably 
partitioned as A into G; blocks. Since n,oN, 
the parallel implementation of Equation (30) can 
be accomplished by assigning each processor the 
task of computing ny of the G; partitions 
according to either 


: BO _ 
Co = S(k-1)Ne2Q ~ or Cy = #(0-1)mytk B31) 
k = 1,2,.--5Ny 


To begin with, the same B and the appropriate 
partition of A as defined by Equation (31) must 
reside in each processor. Then all processors 
perform their products simultaneously in a time 
of order (nym2sq) . The final results, now 
available in N partitions in the 
N-processors may be transmitted to the host 
computer in a time proportional to pq(smt+N-2). 
Therefore, the total cost is: 


b 


Ty =q [n,, aie +p (sm + N - 2)] (32) 


c. Product of Banded by Rectangular Matrix 


Here A is n x n banded matrix with half 
bandwidth, b, partitioned into n, x ng square 
blocks according to Equation (24). On the other 
hand, B is rectangular n x m partitioned 
row-wise into n, rectangular blocks, each 


s xm. For parallel implementation, the product 
¢ = AB may be written as: ‘ 
= ; = A.. B. 
C. =) Sin 5 Sig Dis ~j (33) 
Vel k=1 
i=1,...,n,; j=i-2N+4(2-1)+k; and f=1,...,N 


For a given row partition i, the appropriate 


s x s partition A;, and s x m partition of B; 
corresponding to “k=l must reside in the 
appropriate processor f=1,...,N according to the 
indices above. The product and partial sum 
defined by Equation (33) are then computed in 
parallel using N processors. When this is 
repeated sequentially for k=2, 3 and 4, one 
obtains the results of the ith row partition of 
C. Other row partitions are processed in 
sequence for all i=1,...,n, and the final sum is 
continually accumulated in one processor. The 
completed results are performed in time 


T., = nmq (4s + p log, N) 
11 (34) 
iii. Parallel Generalized Jacobi for Extracting 


the Eigenpairs 


The fact that each of the (m/2)(m-1) trans- 
formations of (18) and (19) systematically 
annihilates the ij,ji terms of A;,,;4 and 

By rt: and thereby causes only their respective 
ith and jth rows and columns to change, suggests 
considerable concurrency in a parallel imple- 
mentation of the generalized Jacobi. By 
selecting a set of m/2 non-overlapping pairs of 
i,j indices, it is possible to perform m/2 
parallel transformations on Ax r and m/2 similar 


transformations on By, The process’ is 
repeated (m-1) times, each time using a 
different set of non-overlapping indices. 
Completion of all (m/2)(m-1) transformations 


constitutes a typical Jacobi sweep that may be 
repeated until diagonalization is achieved. 


For example, the first set of m/2_ trans- 
formations may be Pi(1,2), Po(3,4), 
Py/2(m-1,m). Their execution will result in: 

AY rem/2 7 

BE dm = Lam) oe BpC304) BpCLs2) Ay, By (e2) BoCB04) oes Byyglm - Lom) 
B. r+m/2 ~ (35) 
Po jg - lam) ee P 30344) Pr(L,2) Beye B,(142) Po(304) o++ Byygtm - bom) 
Suppose that N=m/2 processors are used to 
perform Equation (35). Each processor may be 
assigned one pair of tthe (1,2),(3,4),... 
indices, for which it needs to store only the 


corresponding row and column pairs of Ay x and 


Be xi i.e., a total of 4m values. Considering 
any two processors n and ¢, the interaction 
between any two transformations P,(i,j) and 


Pe(p,q) of the m/2 transformations in (35) is 
visualized schematically by the dotted (i,j) and 
dashed (p,q) rows and columns in (36): 


‘ i ' 
. ! t 
oe @ ee 6 oe @ ° ¢ 
e t e | 
e 1 ® ' 
oe 1 e ry 
— +---4----- e---$----| P 
° t e i 
¢ f e | 
{ e { 
. ' ‘ { (36) 
e 4 e { : 
oe ¢ egoetuvreeretee ¢ ee Jj 
° ‘ ° i 
e t . } 
r t od ] 
I Bi iia, Malian alteiiniintend 9 
e { e i 
e 4 e t 
e ) ° \ 
‘ e t 
. P J 4 


As indicated by Table (1), the ii,jj,ij, and ji 
terms of rows i and j residing in processor 7 


are affected only by transformation with 
P,(i,j). However, all other if and j& terms 
(d=1,...,m; 6AiAj) are affected by column 


changes stemming from the remaining [(m/2)-1] 
transformations. For example, the ip, iq, jp, 
jq terms are changed not only by Py (i,j), but 
also by Pe(p, q), thus becoming: 
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A, and B, are transformed by the P's. 


A, = A, ‘ 
( ip k,r+m/2 “ip ’ qp Aig 


= A. 
(A; ky r+m/2 Asa " “pq ip 


(A 


(37) 
jp k,r+m/2 = Asp * 
A. ak. 

¢ jq)k,rem/2 ja” “pq ‘ip 
where from Table (1), 


Pie = (As, + Bs A 


iG = (As, + B 


6 assumes p or q: 

jp kr Ae = (A, + a; A. 
ji Ai esr 3 Aig = (As, bs O34 oy 
However, because the m/2 indices do not overlap, 


each of the [(m/2)-1] transformations affect 
different terms of rows i and j. 


Instead of computing the eigenvectors at the end 
of the last sweep according to (21), data trans- 
mission can be eliminated if the eigenvectors 
are continuously updated at the same time that 
Only the 


simple scaling by diag (1/ Biidk.r+1 is 
deferred to the _ end. Similar to (35), an 
intermediate matrix of eigenvectors Q, is 
updated by m/2 transformations Pio--->Pms2: 

Qin? a ae ae) one Fs? (38) 


In a parallel implementation of Equation (38), 
using m/2 processors each storing a row pair 
Qis: 93 g, bal, ,»m with non-overlapping i and j 
indices, each: processor updates four terms 
associated with each of the m/2_ trans- 
formations. For example, the following four 
terms of the i,j row pair residing in any of the 
m/2 processors will be changed by a typical 
transformation Pe (p,q): 


(Qk, r+m/2 ~ (Qip Ss Bop Qig kyr 


(Qk r+m/2 (Qig . “Pq Qi kr 
(39) 


(Q, eo r+m/2 a0. + Bo. 8 


jp jp qp jak.r 


Ci eyrem/2 = iq * “pg per 
Again, since the m/2 indices do not overlap, the 
transformations in (39) affect different terms 


of rows i and j of Q. All processors can 
perform Equation (39) simultaneously on 
different rows. Thus, with N=m/2 processors, 


each storing two rows of each of A,, B, and Q 
corresponding to a pair of non-overlapping 
indices, the steps of the parallel generalized 
Jacobi technique are summarized as follows: 


1. Each of the m/2 processors computes one pair 
of rotation coefficients a;, and £4; according 
to (20). This requires 12q computational 
resources. 


2. Each processor sends the two values of aj; 
B; 3; of Step (1) to all other [(m/2)- “1 
pL ocessors in the network. In doing so, ie 
resources are consumed. 


3. Each processor updates: (a) its ii and jj 
values of the current rows of A, and B, 
according to the first two expressions of Table 
(1) using its own aj; and Byi: (b) the 
remaining terms of its current rows of A, and By, 
having column indices pq that match the rotation 
coefficients supplied by other processors in 
Step (2). Equations (37) are used for this 
purpose; (c) all terms of the stored pair of 
rows of Q; having column indices which match the 
m/2 pair of rotation coefficients. This is done 
according to Equation (39). 


This 3-part 
resources. 


step requires (12+12m+2m)q 


4. A new set of m/2 pairs of rows with unique 
non-overlapping indices is acquired by the m/2 
processors. This may be done by a systematic 
row exchange according to the scheme described 
in Appendix (A). According to this scheme, 
corresponding rows of A, and B, are exchanged 
among processors, while the Q, rows remain 
stationary in their original processors. This 
requires 2muq resources. 


5. One Jacobi sweep is completed when Steps (1) 
to (4) are repeated (m-1) times, at which time 
the current values of off-diagonals in the 
current rows of A, and B, are tested for 
smallness according to: 


2 —€ ‘ 2 —€ 
2 ry = 
(B.A /B;; Bk < 10 ; (B; 6/Bs 5 Bk < 10 


6 =1,2,....m 3; S#i # j (40) 
A "pass" is reported to halt the sweeps when all 
off- diagonals are small enough, and when all 
eigenvalues computed according to Equation (21) 
pass the convergence test 

- - —-€ 
tod. - ast yas « 10 

ii’; 


ii ii (41) 


A total of 12mq resources are required each time 
the above convergence tests are performed. 


6. Final collection of the converged eigen- 
values and vectors from all N=m/2 processors 
requires 2.5 mypq resources. 


Assuming that on the average Jpay - 
needed for convergence, computation of the m 
eigenvalues and vectors by the _ parallel 
generalized Jacobi requires total computer time 
of the order: 


sweeps are 


Tap 2m a (14 + 2.5 p) JL (42) 
iv. Forward and Backward Passes 
When solving for the n x m unknowns X in 

UUE-¢ (43) 


a forward pass is first required to find Y, 
then a backward pass is performed to compute X; 
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UC 5 UX -¥ (44) 
All a, U,U# are n x n with half bandwidth b, 
partitioned into on, rows. Each row is 
partitioned into b, column blocks of size 

s x s. Similarly, C, XY and X are n x m 


rectangular matrices, also partitioned into n, 


rows Cj, Yi and X;; i=l,...,ng, each of size 
s xm. From Equations (44): 
i-l 
Y, sul E - \ U , 
~i wii | ~i ~ik ~k (45) 
k=i-b +1 

s 

and itb.-1 


-y 


k=i+l 


(46) 


viz x, | 


With N=b, /2 processors, Equation (45) may be 
executed in parallel for a typical row partition 
i=1,...,n, in five substeps. To begin with, as 
a result of performing the forward pass of Equa- 
tion (45) on the (i-1)th row partition, two sets 
of Y and JU partitions will have been stored in 
each processor (except the first). For example, 
processor #1 will have stored C;, Ujy, Yj-1, and 
Ui i-1 processor #2 will have stored 

Yi-2> Yi i-2: Yi-3 and Ui i-3: and processor #N 
will have stored Yi ps, Vi i-bs: Yi-ps+1, and 
Ui i-bs+l: During the first substep, all 
processors simultaneously multiply and add their 
contents. Thus, a typical processor such as #2 
computes —(Yi i-2 Yi-2 + Yii-3 Yi-3), and #N 
computes -(Ui i-bs Yi-bs + Wi i-bs+1 Yi-bs+1)- 
The exception is processor #1 which performs — 
-~-(-Gj + Uiji-l XYi-1)- The time required for the 
first substep is of order 2ms2, and the result 
is a partial sum residing in each processor. In 
the second substep, the cascade sum of the 
partial sums in all N processors is formed by 
adding the m x s contents of pairs of processors 
in the lower-numbered processor of each pair. 
This requires simultaneous passing of one m x s 
data block between the next neighbor processors 
of each pair. As a result, the full cascade sum 
of the quantity in parenthesis in Equation (45) 
is readily accumulated in processor #1 in total 
time of the order (1+u(msq)logoN. 


In the third substep, processor #1 computes Y; 
by multiplying U;; by the total sum obtained in 


the second substep in time of order ms“q/2. All 
other processors remain idle during this 
substep. The purpose of the last two substeps 


is to store the needed data in the appropriate 
processors in preparation for performing the 
forward pass on the (i+l)th row partition. 


Thus in the fourth substep, all Y; blocks are 
shifted one storage location toward the 
higher-numbered neighboring processor. For 
example, Y;.4 is transferred from processor #1 
to processor #2; Xi-2 replaces Yi.3 in the same 
#2 processor, while Y;.3 is transferred to 
processor #3 to replace ¥;_,; simultaneously 

Y¥;-4 replaces Y;.5 in processor #3 while Yi-5 is 


A 


transferred to processor #4 to replace 


Yi-6»---,etc. This is done in parallel by all 
processors on s x m data blocks in time 
proportional to msqu. In the last substep, two 
s x s blocks (Uj, and Yijk-1) are downloaded 


from the host computer to each of the 
N-processors, where Kei, ...91-bot1. This 
requires communication ttime proportional to 
(2s24N-2) qu. The total time required for 


performing the forward pass of (45) on ng row 
partitions consists of the above five substeps 
repeated ng times, and is denoted by Tjy. 


T. = mngq (2 S + log, N+yp (2 + log, N)] (47) 


As can be seen from Equations (45) and (46), the 
time required for the backward pass Tiy is 
essentially the same as (47), i.e., Tjy=Tiv- 
5. Algorithm Efficiency and Speedu 


By summing the results of Equations (25, 29, 32, 
34, 42, and 47), the total time, T, required to 
complete the algorithm is found to be of order: 


T = mq [(a, +a, K 


al sie =H (By v By Knax?! (48) 


where %j = 2s (2 + s/m) 


= m ,3n , 
a, = [13s + 2 log, N + G + 14 ieee + 2 log, N)] 
By = (log, N + 10s/m) 
m 
B, = [441 + log, N) +2 (2.5 J + 2 log, N)] 


The coefficients ag and By), respectively, are 
computation and communication costs associated 
with Steps 1 through 3 of Table (2). These are 
fixed and do not depend on the convergence rate. 
On the other hand, a, and fj are, respectively, 
computation and communication cost per 
simultaneous iteration k=1,2,...,Kyay- Both ay, 
and £; are functions of the parameters b,m,n,s,N 
as well as the maximum number of Jacobi sweeps, 


Jax: 

According to our previous definitions, the 

algorithm efficiency e* and speedup g*, are: 
"0.1 Knax 

eX = (49) 


(ay at | K ax? aie (By ~ By Bak’ 


and g* = eXN 


To illustrate the behavior of the efficiency «* 
in Equation (49), the representative values of 
problem parameters in Table (3) are considered. 

Because contributions of the maximum number of 
required Jacobi sweeps, Imax? to the 
coefficients a} and a» (and consequently to e*) 
is made negligibly small by the small ratio of 
m/n, Jmax is arbitrarily taken equal to ten in 
subsequent discussions of e* and g*. However, 
contributions of the maximum number of 
simultaneous iterations, K,,,, to e* is not as 
obvious. In Figure (1), the efficiency ¢«* is 
depicted for the three sets of parameters labled 
AM in Table (3) for the range of 
Knax70,1,2,5,10. As is observed from Figure 
(1), the efficiency of the parallel algorithm 
becomes rather insensitive to values of K,,, in 
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TABLE (3) 


Selected Problem Parameters 


their upper range. The total time, T, of 
Equation (48) behaves linearly with K,.,, beyond 
the initial Steps 1 through 3 of Table (2). 


In Figure (2), Kpay=l0 is used to construct 
Equation (49) graphically for cases A, B and C 
of Table (3). As illustrated, the algorithm is 
more efficient for larger problem order n and 


bandwidth  b. The speedup can readily be 
obtained from Figure (3) by the simple 
multiplication g*=Ne*. For example, using 128 


processors in parallel, Case A can be solved 
more than 100-times faster than with a single 
processor. 


6. Conclusions 


A parallel algorithm is described for extracting 
m eigenpairs of generalized eigenvalue problems 
of large order n>>n. The algorithm combines 
power methods with a_- generalized Jacobi 
technique, both of which are known to have good 
convergence properties. Concurrency is 
introduced by partitioning the computational 
work load among N-processors during an initial 


one-time factorization step and during 
successive iterations. As Equation (48) 
indicates, the first is functionally dependent 
upon a, and By, while the second is dependent 


upon a, and #1. As the number of required 
iterations K,,, increase, the iterative part of 


the calculations dominates the required 
computational resources as well as the 
efficiency and speedup of the algorithm. This 


is illustrated by Figure (1). In terms of 
efficiency and speedup, the algorithm performs 
best for problems with larger order n and 
bandwidth b. 
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APPENDIX A 


dr 


Five basic operations are defined in the following table: 


r@ 


Description Example: m=8 r-am/2 x=l 


shift j indices to 
the right 


shift j indices up 


Py, v™ Pa-x,v if A-x21 
=> Pre d-x,v if A\=-x<l 


shift j indices down 


Py v= Prex,v if Aexcr 
™=>Xex-r,v if Atxor 


exchange j of Pv 
with i of Py wx 


diagonally exchange i 
of Py,v with j of Phax,y 
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FIG 2. EFFICIENCY FOR SELECTED PROBLEM 
PARAMETERS IN TABLE (3) 


: Start with a set of m/2 non-overlapping set 


of i,j indices in each processor P) ,, 
e.g., m=32, Nem/2=r“, re4, A1,2,...,F, 
y=1,2,...,©, so that Py1(1,2), Py9(3,4),... 
P44(31,32). This is the first of the (m-1) 
combinations. 


Generate (r2-1) combinations of (i,j) as 
follows: 
1. Do (r-1) times 
a. apply operation #1, (r-1) times 
with x=1 
b. apply operation #3, once with x=1 
2. Repeat operation #1, (r-1) times with 
x=] 


Generate (r2-1) combinations as follows. 
1. Apply operation #4, once with x=l1. 
2. Do (r-2) times 
a. apply operation #2, (r-1) times 
b. apply operation #4 with x=(r-2), 
or #1 with x=(r-2) (alternate 
between the two). 
3. Apply operation #2, (r-1) times with 
x=1, 
4. Apply operation #5 with x=-1,2,..., 
(xr-2), 
5. Apply operation #2 with x=(r-2)- 
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Abstract -- This paper deals with the devel- 
opment of multiprocessor simulations from a 
serial set of ordinary differential equations 
describing a physical system. The identification 
of computational parallelism within the model 
equations is discussed. A technique is presented 
for identifying this parallelism and for parti- 
tioning the equations for parallel solution on a 
multiprocessor. Next, an algorithm which packs 
the equations into a minimum number of processors 
is described. The results of applying the pack- 
ing algorithm to a turboshaft engine model are 
presented. 


Introduction 


Multiple processors, operating together to 
solve a single problem, can, in many cases, 
decrease the time of calculation. This is impor- 
tant in time-critical applications, such as real- 
time simulation, where this technique can provide 
computational rates unachievable on a single 
processor or allow the use of lower cost hardware 
to provide the necessary computational capabili- 
ties. For certain classes of problems it is 
possible to configure a network of microcomputers 
to achieve the same throughput rate as a large 
mainframe computer at a lower initial and ongoing 
maintenance cost. 

The parallel processing concept has opened 
new areas of research and development in hard- 
ware, software and theory. Some efforts spon- 
sored by NASA Lewis are described in Refs. [1] 
to [6]. Techniques for developing mathematical 
models that can be solved efficiently on parallel 
processors is a key area of study. The first 
step in developing these multiprocessor models 
is to identify parallelism within the mathemati- 
cal formulation of the problem. This requires a 
data flow analysis of the problem's equations and 
will identify the "critical path" and the minimum 
achievable calculation time. The next step is 
to arrange, or "pack" the noncritical path com- 
putations on the minimum number of processors so 
as to make maximum use of the available computing 
resources. 

_ This paper presents a method of partitioning 
and packing equations for multiprocessor solu- 
tion. Reference [6] gives a more detailed dis- 
cussion of these techniques, including a 
comprehensive example of applying them to a 
turboshaft engine model. 


Computational Parallelism 


A mathematical model of a physical system 
consists of a set of equations which describe, 
to some degree of accuracy, the response of that 
system to external influences (driving functions) 


U.S. Government Work. Not protected by 
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over a limited range of operation. This range is 
defined in terms of the maximum and minimum values 
of the driving functions and, if time dependent, 
the maximum frequency or maximum rate of change 

of these functions. Generally, the object of this 
modeling effort is to provide a simulation of the 
physical system. 

The prerequisite to developing parallel 
processor simulations is to be able to identify 
the parallel computational paths contained in the 
model. In general, a dynamic model can be pro- 
grammed on a digital computer as a set of N 
computationally sequential equations of the form 


Xx(ih) = fx[Xp(ih), X,(Ci - 1h), .-., ulih)] 


where Xy(ih) is the result of the Kth equa- 
tion at time ih. Here, h denotes the simulation 
time step or update interval of the model calcula- 
tions. The arguments X,(ih) are the current 
values of the results of preceding equations in 
the model (i.e. m= 1 to K- 1), and X,((i 
- 1)h) are the past values of the results of all 
equations in the model. The argument u_ repre- 
sents values obtained from sources external to the 
model which are always available at the start of 
the model computation sequence. The functional 
relationship between Xy and its arguments is 
represented by fy. Assuming an equation is an 
indivisible computational unit, then the parallel- 
ism in the model is determined by the arguments 
of each equation. That is, two equations, or sets 
of equations, can be computed in parallel within 
an update interval if their arguments are inde- 
pendent of the results of the others computed in 
that interval. 

For example, a model of the form 


X, (ih) = £,[X3((i - 1)h)] 
Xo (ih) = f2[X,(ih)] 
X3(ih) = f£3[X_(ih), X3((i - 1)h)] 


contains no parallelism since X3(ih) requires 


X9(ih) and X5(ih) requires X,(ih). These 
calculations must be done serially. However, the 
model 


X, (ih) = £,[%3((i - 1)h)] 
Xo (ih) = fo[ulih)] 
X3(ih) = £5[{X,(ih), X(ih), X3((i - 1)h)] 


does contain parallelism since X, can be 
computed at the same time as Xp. 
Parallelism due to decoupled or loosely 


coupled equation sets is easily identified from 


the physical nature of the model. A more dif- 
ficult task is the identification of parallelism 
in a set of closely coupled equations, where the 
process dynamics dictate the use of current argu- 
ments in solving for equation results. For 
instance, suppose a model contains the following 
set of equations: 


Xj(ih) = £4[X5((Ci - 1)h), u(ih)] 
Xo(ih) = fo[X5((i - 1)h)] 
X3(ih) = £3[%)(ih)] 
X4(ih) = £4[X,(ih), X(ih)] 
X5(ih) = £5[X3(ih), X,(ih), X5((i - 1)h)] 


The variable Xj can be computed at the start 

of the calculation interval, since it is a func- 
tion of the past value of Xs and the external 
variable u. Xg may also be computed at the 
start of the interval. However, the calculation 
of X3 must be delayed until X» has been 
determined, the calculation of X, must be 
delayed until both X, and Xp, are determined, 
and the calculation of X5 must be delayed until 
both X3 and X, have been determined. As 
shown in Fig. 1(a), three computational paths can 
be identified which can be assigned to three dif- 
ferent computers in the simulation. Note that 
“wait states" have been inserted to insure the 
currency of the arguments. That is, equation 
calculation is delayed until current argument 
values become available. The X3 calculation 

is shown to be delayed slightly for the transfer 
of Xj. The shaded areas (time slots) in 

Fig. 1(b) indicate the time available for result 
transfer to computer number 1. The calculation 
of X, and X3 can take place anywhere in 

the time slot. | 

The detection of this type of computational 
parallelism can become burdensome when the equa- 
tion set becomes large. The technique, however, 
can be automated. Related to this problem of 
partitioning is the problem of allocation (i.e., 
packing these paths into a minimum number of com- 
puters without extending the update time). Fig- 
ure 1(b) demonstrates packing of the paths defined 
in Fig. l(a). Arbitrary calculation times of 
TX,, TX), TX3, TX,, and TXs5 have been assigned 
to the equations producing results Xj, through 
Xs, respectively. The time TX, includes the 
time required to obtain the value of u. Note in 
Fig. 1(b), that, because of the calculation times, 
the X, - X, - X5 path is critical in that it con- 
tains no idle states. This path, therefore, dic- 
tates the minimum possible update time (AT = TX9 
+ TX, + TX5). The paths X; and Xg3 are assigned 
to separate computers. Packing in this example 
is a trivial task, since the X3 calculation 
can be moved onto computer number 2 to be cal- 
culated during the idle period, as shown in the 
figure. 

In many cases, efficient packing requires 
shifting equations in their time slots. This 
causes a ripple effect on the time slots of other 
equations which can complicate the packing 
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problem. Because of the nature of the packing 
problem, a unique solution to the development of 
a packing algorithm does not exist. There are 
many ways to pack most parallel models. 

In the following sections partitioning and 
packing algorithms that have been developed at 
NASA Lewis are discussed. These algorithms were 
tested with a model of a turboshaft engine and 
detailed results are presented in Ref. [6]. 


Partitioning 


To begin the discussion of the partitioning 
algorithm, certain terms should be defined. A 
mathematical model is a set of equations, written 
to define the characteristics of a physical system 
to some desired degree of accuracy. A program is 
a sequential set of digital equations and support- ! 
ing information (e.g., variable and constant def- 
inition) which define the mathematical model 
within the constructs of a programming language. 
A path is a subset of these equations which, 
because of interrelationships between arguments 
and results, contains no parallelism. Partition- 
ing is the transformation of the program equations 
into a number of paths which may be calculated in 
parallel. Packing is the combination of paths 
into a minimum number of processors (computers) 
which provide computation of the model within a 
prescribed update interval. The critical path is 
the longest path and the prescribed update inter- 
val must be greater than or equal to the calcula- 
tion time of the critical path. 

In this discussion of partitioning it is 
assumed that a program is given. That is, these 
equations, when executed serially, provide the 
required results. No assumptions are made con- 
cerning the parallelism of computational opera- 
tions contained in the program equations. 


xX = axy + bez 


contains parallelism (i.e., a*xy can be calculated 
in parallel with b*z) which will be ignored since 
we are concerned with partitioning at the equation 
level and not parsing. For purposes of this dis- 
cussion, the above equation will be considered as 


x = f(a, y, b, z) 


where f is some single operation. Therefore, 
equations will be assigned to paths in their 
entirely and not broken up into more primitive 
result-argument relationships. 

As indicated in the last section, partition- 
ing requires the establishment of result-argument 
relationships for the serial set of equations in 
order to develop computational paths. It is also 
necessary to know the calculation time of each 
equation. The program must be processed to pro- 
vide this information. For this effort, the 
result-argument relationships and the calculation 
time information are outputs of the multiprocessor 
programming utility RIMPL [2], [3]. The primary 
function of this utility is to translate a struc- 
tured program of the mathematical model into | 
assembly language for the simulation processor(s). 
As an option, the utility also provides 


information on the result and arguments of each 
equation, the processor operations necessary to 
obtain the result, and the processor calculation 
time for each operation. For the equation 
X=ytz 
The processor operations to compute the equation 
are: load register R1 with z (requires 8 time 
units), add variable y to Rl (16 time units) and 
store Rl as the value of variable X (8 time 
units). This type of information is generated for 
each equation in the program. 

Consider the close-coupled example in the 
previous section. The first step in the parti- 
tioning process is to convert utility generated 
information into the form needed for partitioning. 
Dependent arguments are those which are the 
results of previous program equations calculations 
in the update interval (e.g., X, is a dependent 
argument of X,). These are the drivers for 
partitioning since their current values are 
required before the computation sequence can con- 
tinue. The independent arguments u and Xs5 
do not affect partitioning since only past values 
are used. The calculation time for each equation 
is determined by adding the calculation times of 
the given processor operations. 

The time at which an equation can start is 
determined by the arguments and calculation time 
of each equation. The first equation of a set 
only has independent arguments and thus, can 
always start at time 0 (measured from the begin- 
ning of the calculation update interval). It can 
never require results from calculations in the 
current update interval since none are yet availa- 
ble. An equation can end at the time obtained by 
adding its calculation time to the time it can 
start. The general formula for obtaining this 
time is 


CANSTART(RESULT) = MAX(CANEND(ARG 1), 
CANEND(ARG 2), ..-, 0) 


CANEND(RESULT) = CANSTART(RESULT) + 
CALCTIME(RESTILT) 


where ARG1 is the first dependent argument, etc. 
This formula is applied sequentially to each equa- 
tion in the program. . 

Once these attributes have been established 
for each program equation, the identification of 
computational paths contained in the program can . 
begin. The algorithm used for path identification 
is shown in Fig. 2. Its purpose is to identify 
all sequences of equations which contain no 
parallelism and which must be computed serially. 
These paths are organized into a linked list 
called PATHLIST. The paths in PATHLIST are 
ordered in terms of decreasing path calculation 
time. Therefore, the first path in PATHLIST is 
the critical path. To form a path, the algorithm 
selects the program equation, having the maximum 
CANEND time,. and which has not already been 
assigned to a path. This is the result equation 
of the path. The next equation selected is the 
one which produces a result used as a dependent 
argument of the result equation. If more than one 
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equation result is used as a dependent argument, 
then the one with maximum CANEND time is selected. 
The selected equation then is inserted in front of 
the result equation in the path. The path forma- 
tion process continues until an equation is 
inserted which has no dependent argument equations 
which are not already assigned to a path. Paths 
are formed until all program equations have been 
assigned. 

Partitioning has been discussed in terms of 
equations that produce values of variables. 
Often, mathematical models contain statements that 
do not produce values. Two common examples are 
conditional statements (e.g., IF ... THEN ... 
ELSE) and command statements (e.g., I/0 opera- 
tions). The calculation time of such a statement 
must be combined with a preceding or following 
equation. This could impose limitations on pro- 
gram structure and is a subject for future study. 


Packing 


The partitioning process produces a number 
of paths consisting of equations which must be 
computed serially and a table of information on 
each equation. The final task in the process of 
formulating a multiprocessor model is to pack the 
paths for assignment to a minimum number of proc— 
essors. The first path in our list has the larg- 
est calculation time due to the partitioning 
algorithm. This is called the critical path and 
its calculation time is the minimum time in which 
the model can be computed no matter how many proc— 
essors are used. The number of paths identified 
through partitioning is usually greater than the 
minimum number of processors necessary. 

The minimum number of processors necessary 
to implement a multiprocessor simulation depends 
on how fast the simulation must be computed. This 
update time must be specified prior to packing. 
The simulation time step h ,is usually based on 


stability and dynamic accuracy requirements. For 
real-time applications, the update time AT 

must be equal to h. The update time also 
specifies when the computations must end. The 


first step in packing is to determine when each 
equation must end using the specified update time. 

To determine when an equation must end, we 
begin with the state variables (defined here as 
those variables whose current values are not used 
as arguments in the model, but appear as results 
of model equations). The state variable computa- 
tions will be the last computations performed, and 
thus must end at the prescribed update intervals. 
The calculation of equations which are dependent 
arguments of these variables must end no later 
than the time at which the state variable calcula- 
tions must start. The times when subsequent equa- 
tions in the result/argument string must end are 
similarly determined. Since a variable can be 
used as a dependent argument in more than one 
equation, care must be taken that the earliest 
time, arrived at after all paths are analyzed, is 
used to specify when that equation must end. 

We now have determined when an equation can 
start, can end, and must end. These are termed 
equations attributes. Since the paths are serial 
they can also be assigned these attributes: A 


path can start when its first equation can start, 
a path can end when its last equation can end, a 
path must end when its last equation must end, and 
additionally, the calculation time of a path is 
the summation of the calculation times of its 
equations. This is sufficient information to pack 
the paths. 

The solution to the packing problem is not 
unique in that many arrangements of paths in proc~ 
essors can result in a satisfactory solution. 

The packing algorithm, shown in Fig. 3, was 
designed to achieve the minimum number of proc- 
essors. Other requirements which may be imposed, 
such as memory size limitations and inter- 
processor data transfer limitations were not 
imposed on the algorithm. 

As input, the algorithm requires: (1) that 
all paths be specified in a linked list called 
PATHLIST in order of decreasing calculation time; 
(2) that the required update time of the simula- 
tion, AT, is specified and that the attributes 
of each equation and path (CANSTART, CANEND, 
MUSTEND, CALCTIME) have been determined as 
described above. 

The packing algorithm creates processors as 
needed and inserts paths from PATHLIST according 
to a hierarchy of relationships between existing 
equations in a processor and the equations in the 
unpacked paths. When a processor is created, the 
path with the longest calculation time in PATHLIST 
is inserted. Next, the paths which are related 
to paths already in a processor are tested to see 
if they fit (see discussion of TESTFIT algorithm 
below). If so, they are inserted, if not, they 
are placed in a carry-over list. 

Then, paths in pathlist which are unrelated 
to the equations in the processor are tested. If 
one of these is inserted, unrelational testing is 
ended and relational testing begins again. When 
no other paths can be inserted into a processor, 
another processor is created. This process con- 
tinues until all paths in PATHLIST are inserted 
into a processor. 

Relational testing is prioritized. All 
unpacked paths which provide critical arguments 
are tested first. (A path is considered to pro- 
vide a critical argument if the result of the last 
equation in the path (EL) is an argument of a 
processor equation (EP) and 


MUSTEND (EL) = MUSTEND (EP) - CALCTIME (EP) 


Next, other related paths are tested. Then paths 
in the carry over list (which was formed from 
paths which were related to equations packed into 
previously formed processors, but not yet packed) 
are tested. . 

Paths are tested for insertion on an equation 
by equation basis using the test fit algorithm 
shown in Fig. 4. First the attributes (CANSTART, 
CANEND, MUSTEND, AND, CALCTIME) of all program 
equations are saved. This is necessary because 
inserting an equation into a processor can cause 
a ripple effect on the attributes of other equa- 
tions. If the whole path does not fit, any equa- 
tion of the path, inserted into the processor, 
must be removed and the attributes of affected 
equations restored. 
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The ripple effect is illustrated in Fig. 5. 
Assume a processor contains two equations (A and 
B) and that it has been determined that equation 
(C) can be inserted between them. The calculation 
time of each equation is shown as the shaded 
areas. For packing purposes, the calculation of 
each equation can take place anytime between its 
CANSTART time and its MUSTEND time. 

The space available for C is the difference 
between the time at which B must end and A can end 
minus the calculation time of B. Equation C will 
be inserted to start directly after A can end. 
The calculation of B will be delayed until C can 
end. Note that the difference between the MUSTEND 
and CANEND times of A and C have been reduced to 
zero by the positioning of C, and that the time 
difference for B has been reduced. The primary 
impact of these changes is to reduce the space in 
the processor available for packing other paths. 
There is also a secondary impact of equal impor- 
tance. By increasing the times at which C and B 
can end, any unpacked equations which use these 
equations as arguments have their starting times 
delayed. Similarly, by reducing the MUSTEND times 
of A and C, the MUSTEND times of any unpacked 
arguments of these equations are moved up. These 
effects tend to reduce the slot sizes of unpacked 
equations restricting the range of time into which 
they can be packed into a processor. Also, these 
ripple effects may introduce computational gaps 
within unpacked paths. 

After the attributes are saved (Fig. 4), the 
equations within a path are ordered in terms of 
decreasing CANEND times for insertion testing. 
That is the latest equation will be tested first 
and the earliest last. 

The processor equations are arranged in 
sequential order where EP(1) is the earliest 
equation and EP(n) is the latest equation. 
Testing to determine if a path equation (E(i)) 
can be inserted into the processor involves the 
identification of all slots between any two proc- 
essor equations (EP(j - 1), EP(j)) where the 
equation fits. The processor end points (i.e., 
EP(j) = EP(1) and EP(j - 1) = EP(n)) must also 
be considered. Because of the argument and result 
relationships between E(i) and the processor 
equations it is required that the range of proc- 
essor equations be limited for testing purposes. 
Let the end points of the range be designated by 
EPE and EPL (the earliest and latest processor 
equations respectively, before which E(i) may | 
be inserted). This range is established as fol- 
lows: If E(i) is an argument of a processor 
equation, then EPL is the earliest processor 
equation of which E(i) is an argument (EPE = 
EP(1)); if any processor equation is an argument 
of E(i), then EPE is the one following the 
latest of these and EPL is the last processor 
equation plus one (end point); if E(i) is 
unrelated to any processor equation, then EPE = 
EP(1) and EPL is the last processor equation 
plus one. 

Once the range of testing has been estab- 
lished, all slots within that range are tested to 
determine if E(i) fits. The fit criterion is 
as follows: 


1. If EP(j) = EP(1) then CANEND* (E(i)) = 
CALCTIME (E(i)) else CANEND* (E(i)) = CANEND 
(EP(j - 1)) + CALCTIME (E(i)); | 

2. If CANEND* (E(i)) < CANEND (E(i)) then 
CANEND* (E(i)) = CANEND (E(i)); 

3. If EP(j - 1) = EP(n) then MUSTEND* 
(E(i)) = aT else MUSTEND* (E(i)) = MUSTEND 
(EP(j)) - CALCTIME (EP(j)); 

4, If MUSTEND* (E(i)) > MUSTEND (E(i)) 
then MUSTEND* (E(i)) = MUSTEND (E(i)); 

5. If [[CANEND* (E(i)) < = MUSTEND (E(i))] 
and [CANEND* (E(i)) < = MUSTEND* (E(i))]], then 
E(i) fits. 

The asterisk indicates the attributes of E(i) if 
it were inserted into the slot between EP(j - 1) 
and EP(j). 

If it is established that E(i) can fit in 
more than one slot, the testfit algorithm proceeds 
to select the slot into which E(i) best fits. 
The best fit criterion is as follows: 

If a slot exists such that 


CANEND* (E(i)) - CALCTIME (E(i)) - 
CANEND (EP(j - 1)) = 0 
then this slot is selected. Otherwise, the latest 
slot which maximizes 


[MUSTEND* (E(i)) — CANEND* (E(i))]. 


is selected. 

This criterion provides for efficient packing 
by eliminating processor idle time if possible, 
and if not, then the ripple effect from insertion 
is minimized. 

Once a slot is selected, the equation is 
inserted and the attributes of all program equa- 
tions are updated to reflect the insertion. If 
any path equation cannot be inserted into the 
processor, path equations which have already been 
inserted are removed from the processor, the 
original attributes are restored to the program 
equations and the Test Fit algorithm ends. 


Results 


The packing algorithm was programmed in 
Pascal, along with the partitioning algorithm. 
It was then tested on a turboshaft engine model. 
The results, in terms of percent processor utili- 
zation, are presented in Table I. The first 
column is the update time AT specified prior 
to packing. It is given in terms of machine 
cycles. In this case, four processors were 
required. The second column gives the number of 
processors required for packing. The remaining 
columns show the percent utilization of the update 
time in calculating each processor's assigned 
equations. The first specified update time (5666 
cycles) was the minimum possible time and cor- 
responds to the critical path calculation time. 
The second update time selection (10 000 cycles) 
required two processors. Note that the percent 
utilization of the last processor in each case 
exceeds the summation of time available on the 
other processor(s). The algorithm, therefore, 
functioned satisfactorily. 
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Since the packing algorithm does not account 
for data transfer times, it is possible that the 
available time between when a variable is computed 
on one processor and when its value is required 
for computation on another processor will be less 
than the time required to transfer the variable 
between the processors. The effect of this will 
be to increase the effective calculation time of 
the packed simulation, and therefore, to increase 
the minimum achievable update time. Data transfer 
effects may be significant for multiprocessor sys- 
tems with inefficient data transfer mechanisms or 
for simulations that require large volumes of data 
transfer between processors. Future work in the 
development of a packing algorithm should include 
a study of the effects of data transfer. While 
these effects will increase the critical path time, 
and therefore, the minimum update time, proper 
consideration of data transfer will minimize this 
increase and provide far more efficient packing. 


Concluding Remarks 


The algorithms and considerations presented 
for partitioning and packing mathematical models 
for calculation on parallel processors have sim- 
plified the development of multiprocessor simula- 
tions at Lewis. Evaluation of the packing 
algorithm will continue as other multiprocessor 
simulations are developed. Work on a completely 
automated programming package is in progress, 
which will take a structured serial statement of 
a mathematical model, detect this parallelism, and 
provide load modules for the required number of 
processors. 

The authors welcome discussions of the tech- 
niques presented in the paper, related techniques, 
and developments in the many other aspects of 
multiprocessor simulation. 
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TABLE I. -- PACKING ALGORITHM RESULTS FOR TURBOSHAFT ENGINE MODEL 


5 666 4 100 
10 000 2 99 
19 568 1 100 


Computer #2 Computer #] Computer #3 


Compute Xo 
Wait X Wait Xp 
At Sample u Compute X Compute X 
Compute X) rea x3 4 pute x3 
| , Compute Xs 


(a) Closely coupled paths. 


Computer # 


={ (dle) __| Computer #2 


1 (idle) _| Computer #3 
v (Packing) 


Computer # 
TX1 wait) | 7X3] (idle) | computer #2 


(b) Packing. 


Figure 1. - Partitioning and packing closely coupled equations. 
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Processor percent utilization 


Update | Processors 
time required 


97 53 


98 
98 


Add path 
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in the program not ( End ) 
assigned to a path? : 


Find unassigned equation 
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time (E;) 


Create a new empty path 


Mark program equation, £;, 
as assigned to a path. Insert 
it into the path as the first 
equation 


to 


Create the set of equations 
whose results are dependent 
arguments of E; and are still 
unassigned and have CANEND = 
CANSTART €;) 


Does the set contain 
any equations 


Find the set equation which 
maximizes path execution 
time 


Figure 2, - Path identification Algorithm. 


Yes 


Are there any paths No (End ) 
in PATHLIST ? 


Find PATHLIST path with 
longest calculation time (Pj) 


Create a new processor 
and insert P;. 
Delete P; from PATHLIST 


Transfer carryover set 
to working set, R. 
Carryover set is emp 


Copy PATHLIST 
working list, WL 


Create the set, S, of all 
paths in PATHLIST, which 
are related ~ to equations 
in the processor. Delete 
them from. WL, if there 


Create the set, C, of all 


paths in S, which provide 
critical arquments* of @ 


any processor equation. 
Delete them from S 


*See explanation in test. 


Figure 3, - Packing Algorithm. 


calculation time (P;). 
Delete Pi from WL. 


Yes 
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Does WL contain any paths? 


Find path in WL with longest 


Do test fit Algorithm 


Did path fit in processor? 
Delete the path 
from PATHLIST 


Figure 3, - Concluded, 


Yes Does the set, C, contain 
any paths? 

Does the set, S, contain 
any paths? 


Edit R set to remove any 
paths already packed. 
Transfer R setto C set. 


Transfer S setto C set 
S set is empty 

Find C path with largest 
calculation time (Pj) 


Do test fit Algorithm 


Yes _ pid path fit in processor? we 


Place P; in carryover set 
if not already there. 
Delete P; from C set, 


Delete Pj from PATHLIST & 


Figure 3. - Continued. 
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all program equations. 


Copy P; equations to working 


list, EL, ordered in terms of 
decreasing can end times 


Does EL contain any equations 2 Path fits in processor 
E; is first equation in EL. ( End) 
Delete E; from EL. 


Is E; an argument of No 


any processor equation? 


EPL is last equation in 
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is first equation in 

processor 
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(EPL). EPE is first equation 
in processor 


Is any processor equation 
an argument of Ej 


Find latest processor 
equation which is an 
argument of Ej (EPE) 


Find all slots* from EPL 
down to EPE into which 
EF; would fit” 


Remove any P; equations from 
No | processor. Restore program 
Are there any slots? equation attributes. Path does 


not fit in processor 


Apply best fit criteria™ to select 
Slot. Insert E; into processor. 
Revise all program equation 

attributes 


* See explanation in text 


Figure 4, - Test fit Algorithm. 


CANSTART (a) ae CANSTART (a) 
A 
CANSTART (c) CANEND (a) 
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sy — CANEND (b) MUSTEND (c) CANSTART (b) 
B ae CANEND (b) 
MUSTEND (b) MUSTEND (b) 
Processor before inserting Equation to be inserted Processor after insertion 


equation 


Figure 5, - The affect of insertion on equation attributes. 
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Abstract 


The specification of arithmetic expressions by 
means of (i) ambiguous context free grammars and 
(ii) context sensitive grammars is considered. A 
CSG is developed which selects the balanced tree 
most suitable for parallel execution. The pos- 
sibilities of parsing an arithmetic expression by 
using the grammar and constructing a replacement 
syntaxanalyser module for a parallel compiler are 
discussed. 


Introduction 


In recent years much work has been done on the 
problem of "balancing arithmetic trees, Ashoke [4] 
Evans and Williams [8] and Evans and Abdollazadeh 
[7]. The principle behind this process being to 
divide an arithmetic expression into two parts of 
comparable computational ‘size' which can then be 
evaluated at the same time on different processors 
and thus reduce the time required to evaluate the 
entire expression. The smaller subexpressions 
formed as a result of this dissection can then 
be treated in a similar fashion until we reach a 
level at which such processing is uneconomical or 
unnecessary. 


Until now the algorithms have relied on 
explicit knowledge of the special (algebraic) 
properties of the arithmetic operators involved. 
Because we would like to be able to apply our tree 
manipulations to parts of a program other than 
arithmetic expressions we restrict the properties 
used to (i) associativity of certain like oper- 
ators and (ii) precendence of multiplicative 
Operations over additive ones. These properties 
not only prove to be the most useful, but also 
appear in numerous other constructions within 
high-level languages and lend themselves to 
specification by formal grammars. 


After preliminary discussion of relevent 
segments of language theory and of the structure 
of arithmetic expressions we derive suitable gram- 
mars for two classes of arithmetic expressions. 

We conclude the paper with a discussion of how 
this work may be extended to more general situa- 
tions and the problem assoicated with parsing the 
languages generated by these grammars. 


Some Elements of Language Theory 


The grammars presented later in this paper are 
related to a new class of operator precedence 
grammars. In order to facilitate their descrip- 
tion some terminology is required. This is 
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fairly standard and can be found elsewhere (eg Aho 
and Ullman [3])) but is included here for complete- 
ness. In these definitions we assume familiarity 
with basic set theory. 


A phrase-structure grammar (PSG) is an 


algebraic structure consisting of an ordered 4- 
tuple (N,T,P,S) where N and T are non-empty finite 
alphabets of non-terminal symbols and terminal 
symbols respectively such that NMT= 0 and NuT=V 
where V is called the vocabulary of G. P is a set 
of productions, PCVtxV* and, assuming the symbol > 
not to be in V, (a,8) eP is usually written as 


a>B 
Finally SeN and S is called the start symbol. 


Given w,yeV™, then w directly derives y if 
W=ZjuZ and y=zvzo where Z1,Z2,veV., ueV* and 
u*veP and is written as w=y. 


If now w and y are words over V and there is 

a finite sequence Wo,w),W2, ...Wr where wo=w, 
Wr=y and wj-1>w; (i=1,,...r) then we say that w 
derives y, written as w>y. Moreover if ael* and 

=a then a is a sentence generated by G. Thus 
the language, L(G) generated by G is {x:xeT” 
avo S>x}. Where G is ynderstood we can define 
L(X)={x:xeT*, XeN and X3X}. 


Let G=(N,T,P,S) be a PSG as described above. 
such a grammar is called Chomsky Type 0. If each 
element of P is of the form Zjuz2 + z vz? where 
Z1,Z2 eV*, ueN and veVt then G is said to be 
context sensitive (a CSG), or Chomsky Type 1. An 
alternative restriction is that if wy eP then w 
and y should be such that 1<|w|<|/y|. 


If the replacements may be carried out regard- 
less of context then we may replace ‘contexts’ 2} 
and zo by the empty string, A, and obtain the 
weaker restriction that if w*yep then weN and yeVt+. 
This restriction is satisfied by context-free or 
Chomsky Type 2 grammars, (CFG'S). 


An important subset of CFG's is the so called 
operator grammars. These are grammars in which 
all productions are such that no two non-terminals 
are adjacent in any right-hand side of a produc- 
tion. For the definition of precedence relations 
and operator precedence grammars, see [3]. The 
symbols <*, = and «> denote precedence relations 
and provided that at most one such relation holds 
between any two elements of TUf{$} then the associ- 
ated operator grammar is called an operator 


precedence grammar. 


Structure of Arithmetic Expressions 


For any expression we need to consider all 
admissible tree structures and from these select 
one most suitable for parallel execution. We say 
a tree is admissible if, by evaluating the sub- 
expressions in the order defined by corresponding 
subtreees, evaluation of the expression associated 
with the entire tree is arithmetically correct for 
all legal assignments of numerical values within 
the expression. 


Example: For the expressions 7 -3+2 _ the 
tree in Figure 1 (a) corresponds to the valuation 
(7 - 3) + 2 = 6 and is admissible, but the tree in 
1 (b) infers the calculation 7 - 3(+2) = 2 and is 
inadmissible. 


Eventually it will be necessary to specify the 
processing of arbitrary well-formed arithmetic 
expressions consisting of identifiers a, b, c, ... 
etc., each associated with a numeric quantity, the 
operators +,-,* and /, and the usual parentheses. 


However, mindful of the desire to be able to 
extend this methodology to manipulation of other 
aspects of programming languages we shall concen- 
trate on expressions most amenable to the process 
being developed here. Identification of the rele- 
vent algebraic properties is readily achieved by 
examination of the rules of elementary arithmetic. 


Our concern is with expressions involving 
quantities of type real. Mathematically, the 
reals are an infinite set of numbers on which the 
two operations of addition (+) and multiplication 
(*) are defined. Computationally, any implementa- 
tion of real arithmetic uses only a (comparatively, 
very small) finite set of numbers and often encoun- 
ters situations where the result of an addition or 
multiplication is aed not defined; resulting in 
overflow conditions(@) and hopefully a halt in the 
proceedings. 


Using parentheses to indicate the desired 
order of evaluation, the field of real numbers R 
is subject to the following 9 axioms: 


1. Addition is commutative 
a+b = bta 
2. Addition is associative 
(atb)+c = at(btc for all a,b,ceR 
3. There is a specific xeR such that 
xta =a for all aeR 
x is called zero and usually written 0. 
4. For each aeR there corresponds an element beR 
such that 


for all a,beR 


atb = 0 
b is the additive inverse of a, usually 
written -a or (-a). 
5. Multiplication is commutative 
a*b = b*¥a for all a,beR 


(a) This is in contrast to calculations suffering 
loss of accuracy due to round-off error. When 
specific orders of evaluation are required in 
order to minimize this effect, parentheses can 
be used as noted below. 


6. Multiplication is associative 
(a*b)*c = a*(b*c or all a,b,ceR 
7. There is a specific yeR\{0} such that 
y*a =a for all aeR 
y is the multiplicative identity and is 
usually denoted by 1. 
8. For each aeR\{0} there is an element ceR such 
that a*c = 1 
9. Multiplication distributes over addition 


a*(bt+c) = (a*b)+(a*c) for all a,b,ceR. 


Subject to the calculations being carried out 
within the limits determined by an implementation, 
the arithmetic is fully and strictly as defined 
by these nine rules. However, in order to make 
the expressions more easily understood various 
conventions are adopted. First, subtraction and 
division are introduced and defined in terms of 
additive and multiplicative inverses viz. 

a-b een a/b aot #(b-1) 

Second, in order to reduce the necessity for 
some parentheses, it is assumed that multiplica- 
tion (and hence division) take precedence over 
addition (and subtraction) 

ie a*b+c means (a*b)+c and atb*c means at(b*c) 


Following the concepts of admissibility and 
generalization we choose not to allow tree trans- 
formations based on commutativity or distributi- 
vity. This is because these properties are very 
special and hence do not easily generalize. They 
are also of questionable benefit in reducing 
execution time of expressions, (see Abdollahzadeh 
[1] for details). In terms of tree transforma- 
tions the manipulations are as in Figures 2 and 
3 respectively; commutativity requires reordering 
of leaves in the trees and distributivity involves 
deletion/creation of terminal and non-terminal 
nodes. 


Axioms 3,4,/ and 8, involving identities and 
inverses, serve to define the auxiliary operations 
of subtraction and division. If these operations 
are not implemented directly but via the process 
of constructing inverses then they need not be 
considered further. Notwithstanding this fact 
most computers have explicit subtraction (and, 
getting away from most micro-processors, division) 
instructions which are more awkward to manipulate. 
The subtraction operation being subject to the 
derived rules as follows. By definition 


X-y-Z = x+(-y)+(-z) 


This second expression can then be evaluated 
either as 


(x+(-y) )+(-z) = (x-y)-z 


x+((-y)+(-z)) = x+((-1)*y+(-1)*z) 
= x+(-1)*(y+z) x-(y+z) 


or, as 


It therefore follows that both the trees in Figure 


4 are admissible for this expression and to be 
consistent with our earlier remarks on tree 
tramsformations we reject the form in Figure 

4 (b). This emphasises the left associativity of 
subtraction. Similar arguments apply to division. 


This leaves us with the associativity of addition 
and multiplication, and the precedence of 
multiplication over addition. 


We now turn our attention to the grammatical 
specification of admissible trees for arithmetic 
expressions that allow associative transformations 
and assist in the selection of the admissible tree 
most suitable for parallel execution. 


Expressions Involving a Single 
Associative Operator 


The classical grammar central to our 
considerations is: 

E*E+T|T , TeT*P|P , P+(E)|I (Grammar Gj) 
G] involves two operators and hence we shall 
initially only use one layer of it, namely the 
sub-grammer Go. 

E+E+T|T (Grammar Go) 
The sentences derived from E in Go are of the form 
9 T+T, T+T+T, eee etc. 


Moreover a typical derivation in Go is 
E>E+PEt+T+ Et T+ T+ eet T+ tT t+ ert tT tT 
with the derivation tree in Figure 5 (a). 


The corresponding arithmetic tree is given in 
Figure 5 (b). 


Symbolically (ie syntactically rather than 
arithmetically). L(G)={T(+T)":n20}. Now this 
language can also be fritter as {(T+)"T:n 20} and 
this gives rise to a syntactically equivalent 
grammar G3, 

E+T+E|T (Grammar G3) 

However the underlying syntactic structure, 
and therefore any ‘natural’ semantic ordering, 
associated with Gz is as shown in Figure 6. 
Figure 6 depicts the %3 derivation of the same 
sentence as in Figure 


In grammar Go the left recursion [3] indicates 
the left associativity of addition and gives rise 
to the precedence relation + +> +, Similarly the 
right recursion in G3 yields right associativity 
and + <* +. Notice also that both Go and G3 are 
unambiguous and hence there is only one admissible 
semantic structure for any sentence in each of the 
grammars. However, since addition is associative 
(rather than just left- or right-associative), 
there are, in non-trivial examples, very many 
admissible trees. To enable syntactic generation 
of these trees we combine Go and G3 into Gq 

E*£+T | T+E|T (Grammar Gq) 
Now Gq is ambiguous and gives rise to, amongst 
others, the five semantic trees depicted,in Figure 
/. 


Gq incorporates the precedence relations + <* + 
and + *> + and leads to the definition of an ex- 
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tended precedence relation, namely °? . Formally: 
within an operator grammar, given terminal (oper- 
ator) symbols a and b, we say that a and b are of 
comparable precedence (written a °*’ b) iff a <*b 
and a *> b. Thus, from Gq we may obtain the 
precedence table in Figure 8. 


In fact Gq can generate all admissible trees 
for sentences in L(G2) but by the deterministic 
nature of context-free parsing algorithms, even 
when acting on ambiguous grammars, such algorithms 
will always give a tree similar to those in Figure 
5 (b) or 6 (b). It is argued elsewhere, Evan and 
Adbollazadeh [7] that the most suitable tree for 
parallel execution of the expression T+T+T+I+T 
is that given in Figure 7 (b) and we now set about 
the problem of isolating this particular tree from 
all admissible trees for the expression. 


Limiting consideration to expressions built 
from the addition of terms (T) we can always over- 
come the selection problem by the inclusion of 
syntactic sugar in the form of parentheses. Of 
course, this is cheating, expecting the programmer 
to analyse the expression for himself, however it 
helps to motivate the grammatic constructions that 
will eventually lead to automatic balancing. 


Explicitly, using parentheses to indicate the 
tree structure, the expressions E, involving n 
terms (Tj) and n-1 operators are: 


FE) = Tye : E3 = (T)+T9)4+T3, 
Eq = (tiger 
eee E, = 6 er EL ot ee 


Obviously, when the value of n is not a power 
of 2, we cannot represent the structure of the 
expression E, by a tree which is balanced and in 
which all subtrees are balanced. In such cases 
we balance the subtrees from left to right, 

(the rationale for this is given in Evans and 
Abdollazadeh [7]). To particularize our use of 
the word 'balanced' we define its use in two 
technical terms. Both of these are recursive and 
they recurse with respect to the height of a tree. 


The height, h(t) of the binary tree t is such 
that, 
(i) h(t) = O if t is a tree associated with 
an expression involving no operators, 
(ii) h(t) = max(h(t 1), h(t2))+1 if t is com- 


posed of two subrees tj and tg, joined 
by the binary operator op, ie t=t, op to. 


A com aes balanced binary tree (CBBT) of 
height n is any tree of height n=0, or (ii) 
a tree curresnanding to the expression tj op tga 
where t, and tg are CBBT's of height n-l. 


and 


It follows immediately that the CBBTs 
associated with the language {T(+T)": n>0} are 
those derived from expressions E, where n=2™ for 
some m, ie trees of height m involving n operands. 


A (left-to-right) balanced binary tree (BBT) 


: either (i) a trivial tree of height 0 or (ii) 
op ty where t; is a CBBT and tg is a BBT and 
abe SS oe ta). 


A consequence of the definition of BBIs is that 
the BBT involving n operands is of height p where 
oP-1 on coP; moreover, by isolating non-trivial 
CBBTs in parentheses as soon as such a structure 
is recognized, we can deduce the form of any BBt 
while processing it from left to right. Using 
multiplicative notation to indicate duplicity of 
structure the process can be written explicitly 
as: 


Ey, = T, Eo = T+T=(E +E ))=2E], E3 = 2E +E ,=(E2) +E} 
Eq = (E2)+E ,+E,=(Eo)+(E, +E, )=EotEo)=2E9 
Es = (Eq)+E,, Eg = (Eq)+(Eq +E) )=(Eq)+2E4=(Eq)+(EQ) 
E7 = (Eq)+(E2)+E, {=(Eq)+E3}, Eg = (Eq)+(E2)+(E +E} 
=(Eq)+(E2)+(Eo)=(Eq)+2E9=(Eq)+(Eq) =2E4 
E p-1 when n=2” 
and En a { 2 p-1 p 
+E where 2 <n<2 
op-1 q 


and q<2P72 


Diagrammatically the balanced structure of E¢@, E7 
and Eg is indicated in Figure 9. 


As will be seen from the right subtrees of the 
diagrams in Figure 9, if the tree is complete (E? 
in Figure 9) but of a smaller height then the left 
subtree (Eq in Figure 9 (a), then the height of 
the right tree is increased by one (E3 in Figure 9 
(b)) and the previous subtree inserted as the left 
part (E> in Figure 9 (b)). Otherwise we step down 
the right subtree to find its largest complete 
component and perform the same process at a lower 
level. It therefore follows that a 'balancing' 
grammar needs to keep track of the heights (or at 
least the difference between heights) of adjacent 
subtrees. 


Recall that the height of the balanced tree 
for E, is p where 9P~*., <9P and its structure is 
given by Eop-1+Eq where q<oP-1. Hence the Ce 
operator is at the top of the tree, the (9P-2)t 
operator is one level lower, the (pP-3)th operator 
is two levels lower, and so on. 


The grammar, G5 mimics our structuring process 
in so far as whenever a BBT is completed it is 
surrounded by '(' and ')'. Also, when it is 
required to increase the height of a tree to 
accommodate more operands (as in Figure 9 (a) and 
(b)) this is effected by joining two trees thus: 


(CPBT) + (BBT) 


- even trivial trees, ie terms, are bracketed by 
our grammar -, grammatically we reduce ')+(' to + 
by a suitable production. Before giving the gram- 
mar and describing its use, we note that in order 
to clarify the difference between actual and 
terminal symbols and synthetic structure- 
indicators we have used 0 and C (open and close) 
in place of ‘(' and ')'. To aid parsing §$ is 

used as an end marker. | 


Gs = ({X,S,E,P,0,C}, {+,T,$}, Q,X) 


where Q = {(1) ET, (2) Po, (3) OEC+EPE, 
| (4) P»CPO, (5) $+C$, (6) PE$*CPES, 
(7) S$*0E$, (8) S$+0S$, (9) X*S$ } 
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Now we apply Gs to the reduction and structuring 
of E7. Recall E7 = T+T+T+T+T+T+T $ 
and that it should be structured (see Figure 9 
(b)) as E7 = ((T+T)+(T+T) )+((T+T)+T)$ 
in which the right subtree is a CBBT of height 2 
and the left subtree is a BBT of height 2 result- 
ing in h(E7)=3. The expression can be analysed 
(the reverse process to derivation) as follows: 


(1) T+T+T+T+T+T+T$ (3) OOEC P ECPE$ 
(2) EPE+T+T+T+T+T$ (10) OOEC P EPE$ 
(3) OEC+T+T+T+T+T$ (11) OOEC P OEC$ 
(4) OECPOEC+T+T+T$ (12) OOE P EC$ 
(5) OEPEC +T+T+T$ (13) 00 OEC C$ 
(6) OOECC PEPE+T$ ~ (14) 00 OE $ 

(7) OOECC POEC+T$ (15) S$ 

(8) OOEC P EC+T$ (16) X 


Productions (1) and (2) of Q (which we shall 
refer to as Qj and Qo ...) simply encode Ts and 
'+' symbols and are used freely throughout the 
reduction. Stages 1, 2 and 3 transform ‘T+T' 
into 'OEC' represesnting a CBBT of height 1 by 
means of Q3. The next ‘T+T' is similarly treated 
giving (No 4) ‘OECPOEC'. Production Q4 now 
removes the reverse brackets ')+(' represented by 
‘CPO' and then using Q3 again we obtain (No. 6) 
‘OOECC' which corresponds to a CBBT of height 2. 


Now the string generated by Eo in Figure 9 (b) 
is transformed into 'EPE' and then ‘OEC' in stages 
6 and 7. The central plus, now P, then causes 
partial combination of these CBBTs. However, the 
two E's in No. 8 do not combine because they are 
not of the same height, the second being too low. 


The last term comes into play at No. 9. 
Because E7 cannot be completely balanced we need 
to apply special productions to cater for the 
incomplete subtree E3. Production Q6 erases the 
superfluous 'C' enabling the right-hand subtree to 
be incorporated with the preceding E (No. 11) Qg 
then performs the last proper reduction with two 
trees of height 2 (No. 12). The final stages 
using Qs, Qg and Qg remove peripheral bracket 
tokens and achieve the start symbol X. 


Pictorially the parse is set out in Figure 10. 


We can now extract the desired admissible tree 
structure from this parse by translating from Q 
to a related set of context-free productions Q'. 
This is simply done by ignoring all structuring 
symbols and trivial productions. Hence we have: 

Q 


E>T —E > 
P>+ P + 
> EPE E > 
P + CPO 
$ > C$ 

> CPE$ 
> OE$ 
> OS$ 
X + S$ 


Applying this translation to Figure 10 gives 
the context-free parse in Figure 11 and further 
simplication trivially gives us Gg defined by the 
productions 

E+E+E|T (Grammar G6). 

Consequently the sequence of applications 
of reductions Q3 and Q] in grammar Gs gives a 
balanced parse from the grammar Gg. Figure 11 
is, aS required, a syntactic equivalent of the 
semantic structure in Figure 9 (b). 


Expressions Involving Two Associative Operators 


The construction of the context-sensitive 
grammar Gs from the context-free grammar Gq can 
be mirrored for any language based on a single 
associative operation. Hence, we could for 
instance take G) 

EsE+E|T , T>T*P|P, 


P+(E)|I (Grammar G)). 


and slice off the second, multiplicative, layer 


T>T*P|P 
and then add extra productions to allow both left 
and right-associativity. Thus, to generate L(T) 
we have T>T*P| P¥T | P 
or alternatively JTI*T|P . 


To deal with more than one such operator is 
more complicated and methods for the construction 
of grammars to cope with the general situation are 
currently under review. Of more immediate concern 
is the case noted above. We know that, by con- 
vention, multiplication has a higher precedence 
than addition and hence whenever these operators 
are adjacent in an expression we need to ensure 
that any tree balancing takes account of these 
precedence relations. Of course for any sub- 
expression involving only one kind of operator 
the method of section 3 can be used. 


To illustrate how these operators inter-relate 
with each other and with the tree-balancing 
processes we give an example. fo aid description 
of multiplicative sub-structure we shall use Tp 
to represent a sub-expression involving n operands 
combined multiplicatively, and to aid analysis 
we use '$' as a delimiter rather than just a 
terminator for the entire expression. 


Example: Consider the expression, [j+Io+13*Iq*I5 
which should be parenthesised thus: 


(I, +I9) + ((13*14)*I5) 


As before the reduction starts by forming Iy+I9 
into a (C)BBT of height 1. However, the 
following I3 prevents 13 from being immediately 
included in an additive BBT of height 2, and 
causes a multiplicative tree to be formed from I3 


and Iq and subsequently I5 is also included. In 
order that the correct combination should be 
achieved we need to check both operators adjacent 
to non-peripheral operands. In this particular 
Situtaion we could use productions such as 

TI : (E)+ 2T+T+ 
to prevent incorrect manipulation of 13, 
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Because '*' has the highest precedence of 
(arithmetic) operators present in the expression 
we can invoke the balancing of multiplicative 
trees without reference to '+' and hence we use 

(T) + TAT 


Indeed, when an operator of highest precedence 
is used, no checks on the types of adjacent oper- 
ators are required except in the case when the 
operator in question is not associative. In 
parsing terms the delimited expression 


$Iyt+lot+13*1q*I5$ 
can now be reduced to $(E9)+(To)*I5$ 
and then to $(Eo)+ (15)*T75 


The (incomplete) multiplicative tree must now be 
reduced. This can be done in several ways. 
Following the strategy used in the additive gram- 
mar, G5, we can enforce left to right balancing by 
removing the right-hand bracket from '(T)' and 
allowing the replacement bracket to be created at 
the extreme right of the expression. In conse- 
quence we have the productions 
*T +)*T and )$ +$ 
However, this gives rise to the abortive reduction 
sequence 


(1) $(Ea)+(Ta) TS (4) SLE DHL (T))S 
(2) $(E>)+(To*T73 (5) $(Ep+(T))$ 
(3) SCE ora yg (6) $(EotE) )$ 


Reference to the desired admisible tree for 
this expression (Figure 12) reveals the require- 
ment for the right subtree to be higher than the 
left subtree. Dealing with this phenomenan as 
before, by using the production and completely 
destroys the left-to-right balancing of additive 
trees and hence an alternative method must be 
found. 


Such a method, based on the strict use of 
brackets to preserve tree height information and 
hence requiring no bracket inbalance at any stage, 
is described in four stages. 


(i) First, we return to the (single operator) 
additive grammar and generate 


L(X) = {T(+T)" : n>0} by the CS productions 
(1 X>$S$, (2) S+{S}, (3) SeE, (4) +E}>}4E, 
(5) ++}+{, (6) {E}*E+E, (7) EAT (Grammar 7) 


Now any reduction sequence associated with 
grammar 7 has equal numbers of open and close 
brackets at every stage and before any application 
of rule 2, the number of balanced bracket pairs 
indicates the height of the structure tree. How- 
ever, rule 4 is too general for our purposes, it 
allows any combination of trees E; and Ej (in that 
order) where h(Ej)>h(Ej). In particular it gives 
rise to the following reduction of E7. 


(1) $T+T+T+T+T+T+T$ (6) ${{EtE} HES 
(2) $E+E+E+E+E+E+E$ (7) ${{{E}} HES 
(3) ${E}+{E}+{E HES (8) ${{{E+E }}3$ 
(4) ${EtE }+{E}E$ (9) ${{C{E}}}}$ 
(5) ${{E HE HES 


This implies a tree of height 4 and relates to 
the tree in Figure 13 corresponding to the analy- 
Sis E7=(EqtEg)+E, instead of E7=Eqt(Eo+E,). Both 
are possible analyses from grammar 7. 


So there is a requirement for some indicator 
to stimulate the accepting of incomplete trees 
from right to left. This is easily achieved by a 
Simple modification given as grammar 8 in which y 
and 6 are used as markers; each y permitting a 
possible imbalance at each level and the 6s pre- 
venting such an imbalance from occurring to the 
left of the central ‘+' at the corresponding level 
of the right sub-tree. 


(1) X*$S$, (2) S+{Shy, 
(5) ++}+{, (6) ET, 
(9) +E} y>}+65E 


(3) SE, 
(7) yA, 


(4) {E }+E+ 6E, 
(8) dA, 
(Grammar 8) 


(Notice that grammar 8 is not context-sensitive. 
However, its form enables the balancing methodol gy 
to be clearly seen and can be modified later to 
meet CS criteria.) The skeletal derivation of £7 
from grammar 8 is as follows: the underlying 
semantic tree being indicated by the arrowed 
progression of ‘+' symbols. 


X 
- ${{{E}yv}yv}y$ 
$1 (E+ Seviyhs 
${{E} + {E}y}y$ 
+ 


+ + 
ne : E ous 


* 
Nt 


ses E E +2E)E)% 


% SEEM ya 
o EN 
= SE FEFESESE+ESES 


A context-senstivive equivalent of grammar 8 is 
given by the productions 


(1) X+$S$, 
(5) {E}+E+E, 
(9) +FIs}t€, 


(ii) The second stage in our grammar 
modifications is to introduce different kinds of 
brackets at different levels of the syntax thus 
enabling the balancing of multiplicative subtrees 
to be performed without regard to, and chrono- 
logically before, manipulation of any surrounding 
additive structure. This could be done syntacti- 
cally but use of a distinct bracketing system 
allows processing of the entire multiplicative 
substructure to be analysed and reduced to an 
atomic additive term (of a suitable height) which 
can then not be dissected, by subsequent additive 
manipulation. In particular it can be included at 
the left- or right-hand side of a higher tree at 
will without creating an inadmissible tree or a 
tree which is higher than necessary. 


(2) S+{S}, 
(6) {E}*E+F, 
(10) +FI>}+F 


(3) S+{ST, (4) SE, 
(7) +7}+{, (8) ET, 
(Grammar 9) 


Proceeding in this direction we can take the 
terminal symbol T of grammar 9, reclassify it as 
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a non-terminal, and define it so as to generate 
multiplicative subexpressions, balanced from left 


to right. Accordingly we require the production. 
(1) T{T], (2) T*{TQ, (3) TP, 
(4) [P]>P*P, (5) [P]+P*P, (6) *>]*[, 
(7) *RI>]*P, (8) *RI>}*R, (9) Pol, 


(Productions 9a) 
Using the sets of productions 9 and 9a together 


to generate $1j+I9+13*I4*I5$ we obtain 
X => ${{SI}T$ => ${E}+E$ 
=> $S$ => ${{E}r$ => $E+E+T$ 
=> ${ST$ => ${E+F I$ etc 


This reduction implies that h(T)=0 ie the term 
is atomic whereas the true position would be 
indicated by [LT]], say, having height 2 and 
corresponding (additively) to an expression such 
as Eq. Using the same productions, the required 
information is availble (at stage *) in the sub- 
sequent derivation, namely 


ue > {(P)i => *pepeP 
=> => [P*RI => [*]*] 
ai [tra (*) me “LP IAP 


Utilization of this information is dealt with 
in the next two stages of the balancing process 
for the class of expressions in question. 


(iii)-(iv) We now need to convert multipli- 
cative brackets to additive ones and also allow 
the expression rather than the grammar to dictate 
left or right associativity of the resulting 
Subexpression within the surrounding expression. 
Consequently additive brackets created by this 
process behave in more general ways than those 
occurring elsewhere and hence there is a require- 
ment to deal with both of these processes together. 


Explicitly, recall that indication of sb height 
of the multiplicative subtree in $1, +1 *Ta*I5$ 
is 2 and could be given by replacing ff rn by 
{{E}}. If done directly this yields the inter- 
mediate form ${E}+{{E}}$ which fails because it 
balances to the right and not to the left. In 
order to be able to accept such a form we need to 
know that the right subexpression is derived from 
multiplication. 


We return to the form: $E J+{(T J M$ 
To allow such a composition we need to achieve the 
progression 


=> ${E +L (T ]0$ => ${{E+T}}$ 
=> ${E+(T] I$ > ${{E+E}}$ 
- indicating a tree of height 3. 


> ${{f{E}}}$ 


This is done by incorporating the replacement 
of the leftmost ‘[' by '{' with the absorption of 
‘}+{' into ‘+’ and with the conversion of the 
corresponding close bracket, here 'I', to '}'. 

To perform the required transformation we use a 
as a marker in the following productions. 


(1) +a+H[, (2) Lavo, (3) Taral, 
(4) Jara], (5) Ha>all, (6) }}elat, 
(7) }$+Hag (8) }}+Ja}, (9) }$+]a$, 


Next, we must migrate any subsequent '[' 
to the left and convert that bracket and its 


corresponding ']' into curly brackets. We use 
a thus 
{EtarE+L , Ta?alT . Jara] 
Na+all etc. as before. 


(This sequence of productions again includes a 
non context-sensitive rule. At the risk of intro- 
ducing parsing difficulties, the offending rule 
could be replaced by 

Y+ar£+[ 


with corresponding changes, such as 


+E>}+Y , S*Y} , S7YT 


elsewhere in the grammar. ) 


The manipulation of high multiplacative sub- 
trees to the left of an addition operator can be 
dealt with by straight forward bracket conversion 
and the additive balancing processing provided 
that extra conversion initiation rules are 
included, namely 


$ {a>$[, 


and corresponding conversion rules 


{{a>{[ 


{+>]at, {+>Ila 

Yet again the modification would introduce 
non-context senstivive rules. However, their 
removal necessitates the inclusion of more markers 
hence the rules will be left in their present 
form. Justification for this decision will be 
given in section 6 of the paper. 


We are now in a position to expand the sets 
of rules 9 and 9a into a full grammar. The only 
significant change being the omission of the first 
two rules in 9a, thus ensuring that multiplicative 
brackets are only obtained by conversion from 
additive ones. 


The productions of the Grammar G)o, are 
displayed in two columns denoting the rules used 
in CBBTs and (additionally) in BBTs respectively. 


Gig = (IX SE ta a Pe Leg ly a,T,F,R, Os, 


{1,+,*5$3,P19,%) 
where Piq = | where 
1) (1X2$8$ 
2) S+{S} S+{ST 10) Pel 
3) S7£ 
4) {E}7E+E {E }oE+F +a>}+[ 
5) +H 11) [Larol 
12) Taal 

6) ET 13) Jaa] 

+F T>}+E Tla>all 

+F T>}4F 14) }}eJa} }}>Ila} 
7) TT? 15) }$>Ja$ }$ >Ta$ 
8) [P]2P*P  [P]>P*R 16) }+>]a+ } +Ilat 
9) *e)}*L {E+ ar£+[ 

*RI>]*P 17) ${a$L 

*RI>]*R 18) {f{ar}L } 


» 
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We conclude this section with outline 
derivations of three non-trivial expressions 
using Gig. 
Example 1: $14+I9*I3+Iq4*I5$ 


(for structure see Figure 14) 


>scceis | = $e} tE)$ | > SEHCTIFCTIS 
> ${{E+FI}$ > S{E+E MH[T]$ | > $I+I*I+1*1$ 
=> ${{E}+{E}$ => $ {E+aT ]+[T]$ 


(T indicates RH subtree of 1 unit 
lower than LH subtree) 


Example 2: $I+I+I*I*1$ (Figure 15) 

X 
S${{E}}}$ [> $ter(T]$ |S $te+(TI]n3 
> ${{E+TJa}$ > S$fE+[T] MoS j> $I+I+1*I1*I$ 
S${t{E+aT]}$ (> ${Eto[T] 


(I indicates incomplete multiplicative 
right subtree) 
Example 3: $I1+I+I1*I*I+I$ 
Here we have the expression in example 2 

included as an admissible subexpression Figure 16 
ence, 2 
=> ${{{{E}rirs = S{{{E}}HES 
=> ${{{E+FIIT$ [> $1+I+1*I*I1+I1$ 
> ${{{EHFITs$ 


Further Extensions to The Grammar 

So far we have dealt only with expressions 
constructed from simple identifiers, I, and the 
arithmetic operators ‘'+' and '*', Extending the 
language to handle more complex arithmetic atoms 
(such as numeric constants) and, without regard 
to their semantic properties, the operator symbols 
‘-' and '/', is trivial. Similar methods applied 
to Boolean expressions are also obviously pos- 
sible. Moreover, because our grammatical structure 
now incorporates information about the height of 
the respective semantic tree, the inclusion of 
the usual parentheses, '(' and ')', causes no 
difficulty; although introduction of the markers 
required to invoke the conversion of matching 
brackets, ‘{' and '}', and 'L' and ']'; over the 
syntactic brackets necessitate many more produc- 
tions. 


Extension of these techniques to larger 
portions of programming languages may not seem 
feasible. However, recall that our basic 
construction eminates from the introduction of 
ambiguities to a suitable operator-precedence 
grammar. Such a grammar for an Algol-like lan- 
guage already exists, Floyd [9]. 


From the more pragmatic and practical 
standpoint we need to investigate implications of 
‘live-dead' variable analysis, Sheafer [10] etc. 
If this information could be detected and manipu- 
lated by syntactic means it would allow (i) for 
the inclusion of function calls within expressions 
and hence (ii) for the handling of statements as 
expressions with side-effects. 


Nevertheless, even without these further 
semantic constraints, what we have so far is a 
system that enforces the necessary structure on 
the evaluation of a language fragment but, because 


Parsing Considerations 


All the grammars presented in this paper are, 
or can be, written to conform to Context-Sensitive 
requirements. This is apparently the narrowest 
general classification that is satisfied by our 
grammars and hence before these grammars can be 
put to work we need to have a suitable parsing 
machine. In Abdollazadeh [1] various published 
algorithms are described and compared. Also 
included are details of several other parsing 
strategies applicable to certain of the grammars 
contained therein. However, most of these 
algorithms are more powerful than necessary since 
they can be used to analyse sentences of languages 
which are not context-free. 


This 'sledge-hammer to crack a nut’ situation 
has been studied before by Baker [5] and Aggarwal 
[2]. Indeed Aggarwal defines the set of so-called 
SVMT-bounded grammars which are CS but generate 
CFL's. This class of grammars extend previous 
classifications guaranteed to generate CFLs but 
does not include all the CS grammars developed 
here and in Abdollazadeh [1]. 


Because of the special form of our grammars and 
the possibility that the most manageable forms 
(from the parsing standpoint) may not be context- 
sensitive but even more general, we need to 
investigate, inter-dependently, parsing method- 
ology and grammar modification. From our initial 
investigations it would appear that the most 
important need is to lessen the non-determinism 
provided that this is conversant with the 
retention of sufficient structural information. 


Conclusion 


We have demonstrated that it is possible to 
balance simple arithmetic trees syntactically and 
indicated how the method can be extended to deal 
with more complicated program fragments. The 
balancing technique not only determines the 
minimal height of the expression tree, (subject to 
certain semantic constraints) but also allows for 
potential overlap of adjoining trees. Figures 1/7 
and 18 depict a general and a specific situation. 


As is seen from the latter figure, the diagonal 
length ** indicative of idealised execution time - 
is related to tree height and additionally, the 
diagonal width is related to the number of indepen- 
dent processors that could be usefully employed in 
the evaluation. This visualization is also capable 
of indicating store access operations and inter- 
machine transfers. 
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Our work is now to stretch (semantically 
motivated) syntactic methods as far as is practical 
and then introduce the minimal semantic structure 
(such as left-to-right attributes, Bochmann [60]) 
so as to move the less complex aspects of a lan- 
guage into the earlier phases of compilation and 
to facilitate more efficient code generation for 
however many processors are available at 
run-time. 
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Abstract 


A pipelined solution method for tridiagonal systems of linear 
equations is proposed which introduces parallelism in a way 
that may be cffectively exploited by static data flow 
computers. It climinates the substantial data rearrangement 
overhead incurred by many existing parallel algorithms, and 
sustains a relatively constant parallelism during various phases 
of program execution. Using the new method, we outline a 
pipepined code mapping scheme. ‘The principle outlined in 
this paper may be extended to other suitable parallel machine 


architecture. 
1. Introduction 


‘Tridiagonal systems of lincar equations form a very important 
class of linear algebraic equations. For example, the heart of 
finite difference solutions of partial differential equations may 
consist of tridiagonal systems of equations. A_ tridiagonal 
system of lincar equations can be solved on a conventional 
computer using the classical Gaussian elimination algorithmn. 
However, such solution method is sequential in nature and 
hence unsuitable for parallel computers without drastic 
alteration. 


In the past decade, new techniques have appeared for solving 
tridiagonal systems of equations with parallel computers [1], 
15]. ‘Phe best known parallel algorithm is based on the cyclic 
reduction technique, first proposed by Golub and Hockney 
and applicd by Buzbee ct al, for solving tridiagonal system of 
equations efficiently [2]. One approach is using such parallel 
technique to solve the recurrences established by the 
1.U-decomposition method) of Gaussian — climination 
algorithm. One such algorithm, known as recursive doubling 


*‘The research described in this paper and its two companion papers [8, 10] 
was supported by the Department of Energy and the National Science 
loundation. 


0190-3918/86/0000/0084 $01.00 © 1986 IEEE 


suggcsted by Stone and originally designed for Iliac IV, was 
later modified for other vector computers [16]. Another 
approach has resulted from considering the necds of parallel 
processing in the first place and trying to design 
fundamentally new algorithms which are inherently more 
parallel. ‘Vhe odd-even cyclic reduction algorithm is base on 
such a principle [2]. A major difficulty with the algorithms 
based on cyclic reduction technique is the overhead of data 
rearrangement between computation steps and the 
considerable variations of degree of parallelism between 
computation steps. | 


In this paper, a new method for solving tridiagonal systems of 
equations is proposed which introduces parallelism in a way 
that may be effectively exploited by data flow computers. ‘The 
algorithm is based on the maximally pipelined solution of 
linear recurrences presented in a companion paper [8]. It 
performs a program transformation of the recurrences 
gencrated in the Gaussian climination method to produce 
machine code which can be executed in a maximally pipelined 
fashion. The new method climinates the substantial data 
rearrangement overhead incurred by many existing parallel 
algorithms and it sustains a relatively constant parallelism 
during various phases of program ecxccution. Based on this 
scheme, the code structure of a maximally pipelined 
tridiagonal equation solver is outlined for a static data flow 
supercomputer. ‘The principle outlined in the paper may be 
extended to other data flow computers. 


2. Background and Related Work 


In this section we state bricfly the problem of tridiagonal 
system of lincar equations and review the directed methods 
for solving them — such as the Gaussian climination 
algorithm, in particular the linear recurrences established by 
the [.U-decomposition technique. We also survey the related 
work of parallel tridiagonal solution methods, such as the 
well-known cyclic reduction technique. 


2.1 Statement of The Problem 


We consider the solution to the following tridiagonal sct of 


linear equations: 


Di, <i ORL a ee 0 x, ky 
ay b; C2 PTS ' X2 k, 
0 Q3 b; ao r 123 : 
_ ; “9 : a ‘ ’ (2.1) 
aon An-| Bn-1 Cn-1 ‘ ° 
Q------- 0 a, nf \x ky 


or expressed in matrix-vector notation 


Ax =k (2.2) 


In this paper, our major concern will be the case where the 
coefficient matrix A is positive definite or at Icast pivoting is 
not required. 


2.2 LU-Decompusition 


There are a number of serial methods for solving the 
tridiagonal system as expressed in (2.1). ‘The maximally 
pipelined solution method to be developed in this paper is 
based on the well-known U-decomposition technique [5]. 
The Stone's recursive doubling algorithm to be discussed later 
is also based upon such technigue. In this method, we find 
two matrices, 1. and U, such that 


(I) LU = A; 

(2) Lis a lower bidiagonal matrix; 

(3) U is an upper bidiagonal inatrix with Is on. its 
principal diagonal. 


When A is non-simpler, its 1.U decomposition is unique. In 
fact, it is shown that 


e;! Qexsreeaesiasa= 0 
Ay. - 65. ee, 
L= ee OR) e;! Pa 
Q---------- 0 a, e,' 
and 
Wp “Qiesseeeere 0 
Ov. buys, ! 
‘ > l se ' 
U= ; . aes 
ee 0 
i ied Uy-| 
Q-------------- ‘0 ] 
where 
u, = c,/b, 
u. = c/(b,-a.u,_)) i= 2,3..n-1 (2.3) 
c= u/c, 
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After computing I. and U, it is relatively straightforward to 
solve the system of equations by a two-step process. First, 
letting Y = Ux, we have 


ly =K (2.5) 


Ux =y (2.6) 
and together, we have Ax = LUx = Ly =k. 


‘The equation Ly = k can easily be solved for y as follow. 


= k,/b, 
(k;-aiy;, )/(b;-a;ur,_) 


i= 2,3..n = (2.7) 


Note that in the solution process, as indicated by (2.7), there is 
no need to compute ¢, explicitly unless the matrix 1. is needed 
in other places. Next, we solve Ux = y for x by noting that 


X= Vn 


xX. = y.~ UX. 
1 1 


Xi i= a-lyn-2..1 (2.8) 


The two steps of (2.7) and (2.8) are often called 


forward-elimination — and backward-substitution. The 


recurrences (2.3), (2.7) and (2.8) constitute a complete solution 
for Ax = k. and a sequential algorithm to perform such a 
solution is the so-called Gaussian elimination algorithm. 


2.3 Cyclic Reduction Technique 


in this paper, we make no attempt to survey all parallel 
algorithms for tridiagonal linear equation solvers, but revicw 
only two well-known methods which are based on the cyclic 
reduction. ‘This discussion will motivate the pipeined solution 
presented in this paper. 


2.3.1 Recursive Doubling Algorithm 


The recursive doubling algorithm proposed by Stone [16] 
began with the observation that the formula required by LU 
factorization, such as (2.7) and (2.8), are first-order linear 
recurrences (FLLR). ‘The equation (2.3) appears not to be a 
linear recurrence. However, if we introduce a new variable qd 
such that u, = -q,/q; , ,, then (2.3) can be transformed into the 
following second-order linear recurrence (SIR): 


= 0 i = 2,3.....n-l 
(2.9) 


ag.) + biG, + C4; , | 


where q, = 1, q, = -b,/c). 


Stone pointed out that (2.9) can be transformed into a 
first-order linear recurrence except that the sequence is now a 


set of vectors instead of scalar values. ‘The recursive doubling 
algorithm uscs standard cyclic reduction method for handling 
lincar recurrences to solve them. 


It is helpful to review the cyclic reduction technique for 
solving linear recurrences before we outline its disadvantages. 
lor instance, we consider the evaluation of the sequence of x; 
from the following first-order linear recurrence relation. 


fori = 2...n (2.11) 


Kk = ax, +; 


where x dy sed and bab, are known values. ‘The basic 


| 
idea of standard cyclic reduction technique is to back up the 
recurrence in (2.11) such that a new recurrence can be 
obtained which relates every other tern, every fourth term, 


every cighth term, cte. 


For cxainple, from (2.11) we have 


(2.12) 


AS 
oe 
me a oe 
— 
— = 


where an = aa.,, bf) = ab. + b. The superscript (1) 
denotes the fact that this is a first level backup. Such a backup 
process can be repeated (in a cyclic fashion) and we obtain a 
set of equations as follow. 


x = ax. + 0? (2.13) 
where 

af? = al DCP (2.14) 

b= al Moir + oh? (2.15) 


with | = 0,1...log,n, i = 2,3...n. An important observation is 
that if any of a, b. OF X; is outside the defined range, its value 
can be taken as zero. ‘Therefore, when | = log,n, all x, are 
solved by | 


— p(log,n) 
x, = bF b 


We can have the following observation: 


(1) high parallelism cxists at certain phases {steps) of 
the algorithm, i.c., at a fix level 1, (2.14) and (2.15) 
can be evaluated for all tin parallel; 

(2) the parallelism grows roughly linearly with the size 
of the vectors — i.e. n in this case; 


(3) the useful parallelism decreases as the computation. 


progressing through different phases. 


The amount of parallelism varics between phases of 
computation, ‘This will increase the difficulty of fully utilize 
the parallelism of the machine. 
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2.3.2 Odd-even Reduction Algorithm 


The odd-even cyclic reduction algorithm is perhaps the most 
successful cyclic reduction algorithm applied to solve 
tridiagonal systems [2]. It starts directly from the system of 
equations defined by (2.1), ie., 

* 

ax it bx tex) =k b= 12.6n 

The algorithm first climinates the odd numbered variables in 
the even numbered cquations by performing clementary row 
operations. In each level, we cut down the total number of 
equations by 172, hence, in log n levels, the middle element 
Xv) can be computed dircctly from the coefficients. The 
remaining unknowns cain be found by a refilling procedure. 
This algorithm also involves the recursive calculation of 
coefficients for equations at cach level. One important 
advantage of the odd-even reduction over the — recursive 
doubling algorithm is that it reduces the number of operations 
considerably at each level, and the total number of operations 


is on the order of O(n). 


One major difficulty with odd-even reduction is the data 
rearrangement of variable and cocfficient vectors between 
phascs of computation. For example, on the Cyber 205 one 
cannot apply vector operations directly to every other 
elements of the vector. ‘Thus extra opcrations must be 
employed to reformat those elements into a new vector [14]. 
On the Cray it is possible to access elements of a vector at a 
fixed increment, but this may result in a performance 
degradation [12]. Because of the overhead of data 
rearrangement, the cyclic reduction algorithm may run slower 
than a scrial algorithm for sufficiently small n [15]. 


Another problem is the degree of variation of parallelism 
between different phases of computation. Because the 
parallelism decreases very rapidly, this problem becomes more 
scrious than that for the recursive doubling algorithm. A 
parallel version of odd-even reduction algorithm has been 
proposed to keep a high parallelism throughout — the 
computation. However, it increases the number of operations 


significantly to O(nlogn) [E 1]. 
3. A Pipepined Solution Scheme for Linear Recurrences 
3.1 Overview of the Pipelined Solution Scheme 


In cyclic reduction scheme, the goal is to increase the speed 
through fully exploiting the parallclism in the original 
problem. High concurrency is obtained by replicating the 


* In the remaining discussion of this section, we assume n is a power of 2, but 
this is not an essential assumption. 


operations as much as necessary to compute all clements in 
the result vector in parallel. In contrast, we propose a new 
solution method which can explore and organize the 
parallelism in a way that best inatches a suitable computer 
architecture, Le. the static data flow architecture. It is based 
on a maximally pipclined code mapping scheme developed in 
our previous work [3.6]. 


‘lwo forms of parallelism exist in a data flow machine level 
program, as shown in Figure I, which consists of seven actors 


eo --\ | 3 
eg Me, EL 
o Ng 
gon Ne 2“ \4 
jo=0O= 
eget: wi t ‘ t 
Stage | Stage 2 Stage 3 Stage 4 


Fic. |. Pipelining of data flow programs. 


divided into four stages. In Figure 1 (a), actors | and 2 are 
cnabled by the presence of tokens on their input arcs, and thus 
can be executed in parallel.” This is called spatial parallelism. 
Spatial parallelism also exists between actors 3 and 4, and 
between actors 5 and 6. ‘The second form of parallelism is 
pipelining. \n- static data flow architccture, this means 
arranging the machine code such that successive computations 
can follow cach other through one copy of the code. If we 
present a sequence of valucs to the inputs of the data flow 
graph, these values can flow through the program in a 
maximally pipelined tashion — ie. input/output valucs are 
consumed/produced at maximum = rate allowed by the 
machine architecture. In the configuration of Figure | (b), 
two set of tokens are pipelined through the graph, and the 
actors in stage 1 and 3 are enabled and can be executed 
concurrently. “Thus, the two forms of parallelism are fully 
exploited in the graph. 


The power of pipelined computation in a data flow computer 
can be derived from machine-level programs that form a large 
pipcline in which many actors in various stages are exccuted 
concurrently. Each actors in the pipe are activated in a totally 
data-driven manner, and no explicit sequcntial control is 
necded. With data values continuously flow through the pipe, 
a sustained parallclism can be cfficiently supported by the 
data flow architecture. 


3.2 Maximally Pipelined Mapping of Linear Recurrences 


An attractive way to implement recurrence on data flow 
computers is to introduce feedback paths in the data flow 
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graph. ‘This, however, presents particular problems when 
maximum pipclining of the program is desired. A direct 
translation of the first order recurrence is shown in lig, 2. ‘The 
value x, depends on the value of x, ,, therefore, a feedback 
path, such as the one marked in the graph, is generated. ‘The 
key is to understand the role of the merge opcrator (denoted 
by M in Fig. 2): (1) under the merge control input values 
(<FT...T>), the initial output value of the loop is taken from 
the second input of the merge, i-e., x,. (2) the upper output of 
M is routed under the feedback control valucs, i.e. <1... TF, 
therefore all but the last clement of the array will be fed back; 
and (3) the lower output of the merge is forwarded as the 


Feedbock Poth 


FiG. 2. The pipelined mapping of a first-order linear recurrence. 


output of the loop unconditionally. Due to the existence of 
cycles, the data flow graph produced by such a scheme, in 
ecneral, cannot be fully pipelined. More spccifically, the 
feedback link between the output of cell 3 and the input of 
cell 1 prevents the whole graph from being fully pipelined. 


‘The problem of the above example and its solution have been 
studied by the author in [3, 6]. ‘The problem is essentially a 
mismatch between the dependence delay— the dependence 
inherent in the recurrence (i.¢., x, depends on x,_,, therefore a 
two-stage feedback delay is required) and the computational 
delay—the actual Iength of the loop in the data flow graph 
generated by the direct translation scheme (3 stages in this 
example). In [8], the author described a solution for such a 
problem on a static data flow computer, based on the concept 
of companion functions [3,13]. It is essentially a way to remove 
the dependence of x, on X,,, thus, casing the feedback 
constraints in order to match the computational delay of the 
data flow graph. For the above example, we have: 


x, = b, 
xX, = a,b, + b, 


X= aa x, +ab,,+b, wherei> 3 (3.1) 


* A solid disk on an arc represents the presence of a token. 


This transformation is interesting to us because x, now 
depends on x, instead of x; ). Therefore, we can map our 
example, now expressed as in (3.1), into a data flow graph as 
shown in Fig. 3. Note that we have introduced two additional 
pipelines a, and b, as denoted by the dotted lined box C, 
where 


() 
a 


pf!) =ab., +, where i > 3. 


‘This added pipeline is named the companion pipeline in {3}, 
and its structure is shown in Fig. 4. ‘lo understand how the 
scheme works we first examine the loop in Fig. 3. ‘The role of 
the merge opcrator is as before except that two initial values 
are presented to the sccond input of the merge, ic. x, = b,, x, 
= a,b, + b,. The ID cell plays the role of a FIFO of size 1, 
and is inserted to tune the computational delay in the 
feedback path to match exactly the dependence delay. ‘The 
two boxes in Fig. 4 are also FIFOs, and they can introduce 
proper skew needed in the pipelining of array operations [6]. 
The rest of the graph is self-explanatory and the reader should 
be convinced that the graph is maximally pipelined. 


The transformation shown in the above cxample is equivalent 
to dividing the first-order linear recurrence into two 


Feedbock Path 


Fic. 4. The companion pipeline of example (3.1). 
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equivalent classes of computation. In fact, since x, now 
depends on x,_,, we may split the sequence of results into two 
subsequences: | 


xX’ = «x pXgp% geeeXog. ped 


> Cae «x,,x aX gre Xp yen 


Kither sequence can be computed independently by sharing 
the same loop, and the companion pipcline provides the 
appropriate input cocfficients. 


The advantage of this scheme is obvious. First, it does not 
require data rearrangement during the computation. In fact, 
it even climinates the requirement that the input vectors be 
completely filled before the computation starts. If the input 
cocfficient vectors themselves are gencrated by some 
preceding code block in a_ pipclincd fashion, the 
producer-consumer type of interface technique (as described 
in [6]) can be applicd to save considerable storage space for 


the intermediate valucs. Moreover, we observe that the 
degrce of parallelism remains constant (5 floating point 


arithmetic operations and several other operations) in the 
computation.” The high throughput is achieved by the 
maximally pipelined execution of cach actor in the data flow 
program, hence, the storage usage by the machine cede is very 
efficient. Finally, there are no essential limitation on the 
length of the vectors which can be computed. 


4. A Maximally Pipelined Method 


Starting from [U-decomposition, we can observe that the 
major equations, such as (2.7) and (2.8), are first-order lincar 
recurrences. ‘Therefore, the pipelined method as described in 
the last section can be applied directly. Now consider (2.9) 
which can be transformed into the following second-order 
lincar recurrence: 


q=aq.,+ Ba. i= 34..n (4.1) 


where q, = 1, q, = -b,/c, and 


a, = -D,)/¢,, 


B, = -,/6, 


Performing one level backup we obtain 
q, = aq, + BMG,  i=4,5...n (4.2) 
where q, = 1, q,= “b,/c,, q; = b,b,/c,c, -a,fCy, and 


(1) _ 
ay = aa, +B; 


Bi) = ap. i= 4,5...n 


Fig. 5 shows a maximally pipclined data flow machine level 
program for mapping (4.2). The loop in the middle of Fig. 5 
can casily be understood by noting its similarity with the loops 
in Fig. 3. The code in the dotted lined box is the companion 
pipcline gencrating values for a!) and Bi” The node labeled 


N performs a negation of its input. The boolean value 
sequences C0 - CS can be found in Fig. 7. ‘The boxes denote 
the FIFO buffers which are introduced for balancing the 


| 
i— C2 | 
| 


| 
| 
OY? | 3 
| 


CO:FT...T _ 
Cl:T... TF ‘ 
C2:FT..T (¢:) 
C3:T._.1F 

Cc C4:FFFT.T 
CS:T..TFF 
Ce:FY..Te 8 Yi 
C7FFT.T ogy 

C8: T..TEF 


Fic. 7. Pipelined tridiagonal linear equation solver—Part 3. 


* [lere the variation of the parallclism during the start and finish time of the 
computation is not considered 
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graph [4,6] to achieve maximum pipelining, and the number 
written inside the box is the number of stages in that buffer. It 
is easy to check that Fig. 5 correctly computes (2.9) and it is 


inaximally pipelined. 
We rewrite the first-order lincar recurrence in (2.7) as 


Y, = BY. + h, 1=2,3..N (4.3) 


where y, = k,/b,, g = -a/(b-au,,), hy = k/(b; - au. ,). 
Performing one level backup we obtain 


y= ey +h i= 3.4.0 (4.4) 


where y= k/b,, Y,= (-a,k, +b) k,)/ (b, - a,u,)b,. 
and 


Bi = 88-1 
1) _ 
ht) = gh. +h. 


Fig. 6 shows a maximally pipelined mapping of (4.4). The 
dotted lined box is the companion pipeline and the boolean 
sequences C1,C2,C7,C8 can be found in Fig. 7. 


l‘inally, (2.8) can be conveniently treated as a first-order linear 
recurrence by introducing new variables x ,, y » UW, such that x 


[lence, (2.8) can be 


i Xeie pe Yi = Yeign 8a = Unie 


rewritten as 


X.=PFxX., +5, i= 2,3...n (4.5) 

i i i-} i 
where Xi =y,h aru, and S = y. We can note that (4.5) is 
a standard first-order linear recurrence, hence we can solve it 
by one level backup: 


xy +50 j= 34.0 (4.6) 


where x, = y¥ X= UdV, + y,and 


f) —rr. 

i ii-l 

x lb Renee 

So = TS of S; 


Fig. 7 shows a maximally pipelined mapping of (4.6). 


Now we have constructed a complete pipclined machine code 
strucure for a maximally pipelined tridiagonal solver as shown 
by Hig. 5 - Fig. 7. Fig. 5 and Fig. 6 can be combined into one 
maximally pipelined data flow graph by observing that the 
sequence of values of u; produced by Fig. 5 can be directly fed 
into Fig. 6. The interface between the outputs of Fig. 5 and 


Fig. 6 and the inputs of Fig. 7 cannot be connected dircctly. 
The main reason is that the order in which the elements yi and 
u, are generated by Fig. 5 and Fig. 6 is opposite to the order in 
which they are used for the maximum pipelining of Fig. 7. 
Hence, we should first store the values of uy; into memory. 
Then, Fig. 7 will access the arrays in a reverse order. ‘The code 
in Fig. 5 and Fig. 6 will sustain a constant parallelism such that 
there are 20 floating point operations and a number of other 
operations are concurrently in pipelined opcration. When 
only the code of Fig. 7 is in execution, the parallelism will be 
reduced to a constant of 5. Although there is such a change of 
degree of paraliclism between the forward climination and 
backward substitution phases of the computation, the 
parallclism remain entircly stable during each phase, hence 
are casily to be handled by the processors. Furthermore, In 
Fig. 5 - Fig. 7, the pattern of runtime data routing is regular, 
thus climinating the data rcarrangement problem for cyclic 
reduction. Morcover, it essentially can work for tridiagonal 
systems regardless of their size, hence, has more flexibility and 
generality than the cyclic reduction scheme. 


The reader may wonder if the 5 to 20 folds of parallelism 
available in the pipelined algorithm may not mect the appetite 
of a supercomputer. We argue that the major concern should 
be how the parallelism in the algorithm can be most 
effectively used by a suitable architecture. [irst, the new 
scheme maintains a relatively constant amount of parallelism 
and relatively simple data routing pattern. ‘Thus, the resource 
management and allocation problems are more casy to handle, 
thereby providing better opportunity of parallel processing 
when the machine has extra power. Second, onc is often faced 
with solving a sct of m independent tridiagonal systems (say, 
m= 64), as frequently occurs in the solution of PDEs [1]. In 
this case, the new scheme can be best used by gencrating m 
independent pipelines for each system to obtain 20xm folds of 
parallelism (more than 1000 if m=64 !). Finally, the new 
scheme is flexible cnough to be extended to obtain more 
parallelism when such a requirement does occur [9]. 


5. Sumary and Discussions 


In developing new parallel algorithms, high importance 
attached not only to the speed of the greatest computation 
rate, but to the numerical stability problems as well. The 
stability aspect of the pipelined tridiagonal solver has been 
studied by the author in a second companion paper [10]. 


The entire code for the pipelined tridiagonal linear equation 
solver has been translated into the machine code of a 
proposed static data flow supercomputer, and preliminary 
simulation results indicate that the projected maximally 
pipelined throughput can be sustained for cach of the three 
loops. 
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The pipelined solution scheme has several important 


~ advantages over other cxisting parallel algorithms. Although, 


the primary target machine uscd in this paper is a static data 
flow computer, we expect that the principle can also be 
applied to other data flow computers, such as the dynamic 
data flow machine [1], although a different perspective of 
pipelining may be requircd [7]. It is my belicf that the basic 
idcas may also be uscful for a conventional parallel machine 
architecture. A compiler which can perform the automatic 
program transformation to implement the pipepincd solution 
scheme is an interesting challenge to the compiler construction 
for such computers. 
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allel processors; D.1.1 [Applicative (Functional) Pro- 
gramming Techniques]. 
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ternary trees, Gaussian elimination, Pivot Step. 


Abstract 


The quadtree representation of matrices is explored, 
particularly as it admits a parallel matrix inversion algo- 
rithm. A version of Gaussian Elimination (full matrix piv- 
oting) is described as an applicative program which min- 
imizes process dispatch by folding the pivot search into 
the preceding pivot operation. The tree structure incorpo- 
rates incremental decomposition (for arbitrary, but small, 
numbers of processors), aids in load balancing, and pro- 
vides a uniform representation for both scalars and sparse 
matrices that eliminates all compatibility /bounds checking 
within the important algorithms. Like other algorithms 
particularly suited to larger problems (where parallelism 
pays off), it may be used at the higher level in a hybrid 
strategy, for example, over pipelined vector-processing on 
smaller, conventionally represented submatrices. 


Section 1. Introduction 


Consider a scenario of parallel or multi- processing 
with realistic constraints. A machine with p processors is 
available to implement a matrix algebra package for sparse 
matrices of size, say, n x n. Restrictions are that 7<p<<n 
and that the cost to dispatch/recover a processor is signif- 
icantly greater than the cost to perform simple arithmetic. 


Those restrictions [12] preclude some popular solu- 
tions wherein processes are dispatched whenever a proces- 
sor might be (wished) available and often on processes that 
are so simple as to be trivial. For an algorithm to be useful 
under these restrictions, it must admit isolation of substan- 
t:al subprocesses, sufficiently high in the computation tree 
(of the the chosen algorithm) in order to assure that the p 
processors can be loaded using as few process dispatches as 
possible while balancing the load and avoiding duplication 
of effort. To do that requires identification of independent, 
arbitrarily large processes as high in that tree as possible— 
particularly for the cases where p is indeed very small. 
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What is really needed first is an algorithm, present- 
ing such a tree with the suggested decomposition proper- 
ties. This paper presents such an algorithm, matrix in- 
version via classic Gaussian elimination, over a new data 
structure—the quadtree representation for matrices—in a 
purely applicative style [3, 5]. This formulation lends itself 
to satisfying the restrictions set forth above, regardless of 
the particular values of n and p. 


Applicative programs are necessarily presented as ex- 
pressions without assignment statements or control state- 
ments except for (pure) function application. Such pro- 
grams implicitly solve the problem of decomposition into 
independent processes [4] because each subexpression with- 
in the program is necessarily independent; therefore, the 
syntax tree may be strongly associated with the tree for 
process decomposition. Intermediate binding still allows 
intermediate results to be shared—rather than recomputed. 
While these matrix results are interesting and useful as 
they stand, an implicit goal of this paper is to encour- 
age further study of algorithms for parallel computation 
through the philosophy enforced by applicative (or func- 
tional) programming style. 

Gaussian elimination is hardly new, so what can be 
said that is really novel? Certainly, no improvement to 


well-studied asymptotic behavior will be offered. Three 
results, however, are offered that, together, promise a thor- 
oughly practical algorithm under the envisioned constraints, 
whether or not their ultimate realization follows the disci- 
pline of applicative programming. 

First, the relatively new idea of quadtrees for repre- 
senting matrices [10, 11] is developed further, in a way 
that unifies our approach both to matrix/scalar algebra 
and to sparse/dense matrix manipulation. All scalars, z, 
also represent diagonal matrices, blurring the distinction 
between scalar and matrix, between sparse representation 
and dense representation. We shall see that one family of 
algorithms handles all pathologies. 


Second is a version of matrix-inversion via Gaussian 
Elimination, through Pivot Step [6] with the pivot element 
selected as the largest candidate (in magnitude) from the 
entire matrix before each pivoting. Known to be most sta- 
ble, this algorithm is also shown to present a pattern of par- 
allelism. Each pivot step naturally decomposes by quad- 
rants which, most interestingly, is a pattern suitable for the 
search tree for identifying the next (largest in magnitude) 
pivot candidate. There is, therefore, no question whether 
a search for the next pivot element should be implemented 
using parallelism or on a cheaply-dispatched uniprocessor. 


While such a search surely could use parallel processing, 
eliminating the explicit search phase by folding it into each 
(sparse) pivot saves search time, and better amortizes the 
overhead to dispatch each pivoting process. 


Finally, an interesting relationship between padding 
and processor allocation is proposed to balance the load 
across independent processes. The quadtree matrix rep- 
resentation appears to be suited only to representation of 
2™ x 2™ matrices. When the size of a matrix is not a power 
of two, some padding is necessary which only wastes space 
proportional to m. What is interesting is that, having em- 
bedded an n xX n matrix in (the lower right of) a 2” x 2™ 
one, processor allocation can make profitable use of the 
value of (2 — n) in partitioning the p processors among 
subproblems. The argument is cast in terms of familiar 
matrix operations. 


The remainder of this paper is in four parts. The 
quadtree representation is introduced first, including dis- 
cussions of its restriction to vectors and generalizations to 
higher dimensions. The second section explores a Gaus- 
sian elimination algorithm under the new parallelism, as 
outlined above. The third offers a strategy for allocation 
of processor resources to discount the padding that might 
be necessary to fill out the quadtree representation. The 
final section offers some conclusions and hopes for further 
work. 


Section 2. Matrices 


Let any d-dimensional array be represented as a 2¢-ary 
tree. Here we consider only matrices and vectors, where 
d = 2 suggests quadtrees, and d = 1 suggests binary trees. 


Matrix algorithms will be arranged so that we may 
(without loss) perceive any scalar, z, as a diagonal ma- 
trix of arbitrary size, entirely of zeroes except for z’s on 
the main diagonal; that is, = [z6,;,;]. Thus, a domain is 
postulated that coalesces scalars and matrices, with every 
scalar-like object conforming also as a matrix of any size. 
Of particular interest are 0 and 1, which are at once the 
untque additive and multiplicative identities, respectively, 
for scalar/matrix arithmetic. Similarly, the scalar z as a 
binary tree is interpreted as a vector of arbitrary length, 
each of whose components is z (much like Daisy’s [7] nota- 
tion (z*).) Inferring the conventional meaning from such a 
matrix now requires additional information (vz. its size), 
but we can proceed quite far without size information; it 
only becomes critical upon Input or Output. 


Lest it appear that this coalescing of hitherto dis- 
joint types hides too much, it is useful to draw a anal- 
ogy from ordinary computation on floating-point numbers 
(FPNs), where details of internal representation are also 
suppressed. The point is that the way that quadtree- 
matrices (and FPNs) are commonly represented outside 
the machine has little to do with their internal represen- 
tation. There are, in fact, conventional styles for writing 
matrices on paper—in row-major order (and for writing 
FPNs in scientific notation), but these may differ wildly 


93 


12000 
31200 
03120 
G0512 
aoa3t 


Figure 1. 

A 5x5 band matriz 
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quadtree representation. 


from the way that they (and FPNs) are to be represented 
within the machine. Although the algorithms for translat- 
ing between such representations are elegant, they are so 
complicated that they are surely not the first thing that 
should be shown to those unfamiliar with the nature of 
these internal representations. Thus, an unfortunate bar- 
rier rises in the path before those who would tinker on 
them: one must first write the I/O translators, which are 
among the least comprehensible, least efficient, and least 
exercised programs over the structure. 


A matrix (of otherwise-known size) is either a vague 
‘scalar’ or it is a quadruple of four equally-sized subma- 
trices. So that this recursive cleaving works smoothly, we 
embed a matrix of size n x n in a 2/le 1 x fle 9] ma- 
trix, justified at the lower, right (southeast) corner with 
zero padding to the north and west, except for nonzeroes— 
preferably ones—padded along the northwest diagonal (to 
avoid introducing an unnecessary singularity; see Figure 
1.) The matrix is justified to the southeast, rather than 
the northwest, so that its eliminant [2] is properly defined. 


There is also a normal form convention. Under this 
quad representation, no submatrix will ever be composed 
of four ‘scalar’ quadrants, of whom the northeast and south- 
west are zero, and whose northwest and southeast coincide. 
Such a matrix would be represented, instead, by the lat- 
ter ‘scalar,’ standing alone. Thus, the two important iden- 
tity /annihilator matrices are represented uniquely by 1 and 
0. If we require that the northwest padding, as in the pre- 
vious paragraph, is necessarily one, then a canonscal form 
results. 


Elsewhere [10] I observed how algorithms for matrix 
addition, transpose, and multiplication follow the desir- 
able pattern of decomposition, generally into 4” or 8” 
independent processes that can be dispatched high in the 
computation tree, up to the capacity of the execution en- 
vironment. Of note is the role of 1 and, especially, of 0 
as a constituent quadrant to such an operation. When 
any addend’s quadrant is 0, the effort for matrix addition 
immediately simplifies by 25% because it is, therefore, un- 
necessary to descend and to traverse the corresponding 
quadrant of the other addend; all we need is a borrowed 
reference to it as one quadrant of the result. A factor’s 
quad of 1 reduces Strassen’s decomposition [9] of Gaussian 
multiplication from eight recursive multiplications to six; 
not only does a 0 quad similarly annihilate two recursive 
multiplications, but also it avoids two of the four subse- 
quent additions, as well. 


These properties are particularly valuable for matrices 
with regular patterns of non-zero entries, especially those 
that are sparse or in diagonal form. No special code is nec- 
essary to accelerate conventional operations on them (but 
one can wish for hardware that accelerates specialized tests 
like the ubiquitous tests for 0 submatrices.) It is, however, 
necessary to maintain the normal form so that, like ra- 
tional numbers being always ”reduced to lowest terms”, 
matrices are reduced to their corresponding scalars when- 
ever zero southwest\northeast quadrants and coincident 
northwest/southeast quadrants permit. 


Another advantage shows up upon deeper study of 
several matrix manipulation programs over quadtrees: al- 
gorithms written to accept canonical-form operands need 
not be sensitive to the usual compatibility requirements. 
That is, ordinary quadtree algorithms for various opera- 
tions will work regardless of the depth of their tree/oper- 
ands; when operands are of different depth, the shorter 
paths—ending at a scalar—will be interpreted as all-con- 
forming, diagonal matrices. Although their meaning may 
be questionable, the algorithms will run to completion in- 
stead of crashing with array-bounds-violations. In terms of 
domain theory, we have raised these operations to a higher 
point in their respective function-domain (given them more 
meaning by defining results where others fail because of in- 
compatibility), while simplifying the code (by eliminating 
all the incompatibility tests and signaling thereof). 

A “header” above each matrix quadtree might use- 
fully contain two values needed for output translation: the 
length of the diagonal padding, and the exponent, m, for 
a 2™ x 2™ matrix. The value of m also suffices for run- 
time compatibility checking. Another bit there indicates 
whether the quadtree is to be interpreted as transposed, 
recursively interchanging southwest\northeast quadrants 
upon any access. Thus, not only does quadtree represen- 
tation allow us to transpose an entire matrix in constant 
time—at the cost of building a new header—but also it 
allows row and column traversal at equally high efficiency, 
at the cost of symmetric-order traversal [6] of the appro- 
priately projected binary tree. 
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Figure 2. Knuth's Problem 2.2.6-18 [6, p. 556], worked. 


Section 3. Pivot Step and Inversion 


This section describes an algorithm for matrix inver- 
sion, extended from Knuth’s [6] for an entirely different 
data structure (also sparse). His terminology is used be- 
cause it is readily available. Figure 2 outlines this algo- 
rithm applied to his problem, and should be compared 
against the solution that he provides. The algorithm is 
Gaussian elimination with full pivoting (although Knuth 
only selects pivots rowwise.) Much of the description ap- 
plies to a single Pivot Step, but its most interesting prop- 
erty relates one such step to the next. It is easily simpli- 
fied to one that only finds the root of the linear equation, 
Az = y, for Matrix A and Vector y. 


Let a non-singular matrix, M, be represented with a 
quadtree as described in the previous section. Each non- 
terminal node, however, is to be decorated with additional 
information: the magnitude of the largest element in that 
quadrant, and its local horizontal and vertical coordinates. 
(It will be necessary to qualify the decoration further to 
exclude any element in an already-pivoted row or column 
from decorations. At this point, however, assume that no 
pivoting has yet occurred.) These coordinates need not 
be traditional indices—though it is easy to think of them 
that way. A cheaper implementation is just a pair of bits 
at each node selecting one subtree; the catenation of these 
subtrees identifies a path through each quadtree to the 
appropriately largest scalar. 


Let us consider an algorithm to invert M, a matrix 
represented as a canonical-form quadtree, padded along 
its northwest diagonal as described in the previous section. 
The first step in computing the inverse of M, therefore, is 
to traverse its tree representation in postorder, installing 
these decorations at all internal nodes. (Sibling subtrees, 
of course, may be traversed in parallel.) Call the deco- 
rated matrix Mp. Decorations appear in florets in Figure 
2; to save space, however, only the local maxima (no local 
coordinates) are shown. 


Also needed are two trivial binary trees, each initially 
the scalar 0 indicating a boolean vector of all zeroes, and 
one trivial quadtree, Po, also initially 0, similarly. For 
each i, Quadtree P;, indicates the exact position of the 
first ¢ pivot elements—none so far; it will be filled in to 
become a permutation matrix, Pam, by the time the fully 
pivoted matrix, Mgm, is computed. The boolean vectors 
indicate which rows/columns have already been pivoted; 
they will be filled with 1’s until they both indicate that 
all rows/columns have been eliminated. The two vectors, 
therefore, are merely row (column) projections from the 
corresponding P,, but in a format useful for directing sub- 
sequent decorations. 


Having established the initial values of M; and P;, for 
the next of 2™ pivot steps, we discover that the decoration 
at the root of this tree identifies the next pivot element. 
We presume here that M, is not a scalar. If it were (and 
thereby identified as the pivot scalar), then its reciprocal 
is the pivoted matrix. 


Otherwise, the M; may be decomposed into quad- 
rants, each distinguished by the decoration. One is ptv- 
otquad, distinguished because the pivot element lies within. 
Another is rowguad, named because it lies horizontally 
from ptvotquad. The third is columnquad, because it lies 
above or below ptvotquad. The last is offquad, so called 
because it does not coordinate on ptvotquad, but lies diag- 
onally from it. 


The description that follows discusses the transforma- 
tions on the four quadrants in reverse order from that just 
above. It turns out that offqguads transformation is sim- 
plest, but it does depend on intermediate results derived 
from processing the other three quadrants. So many addi- 
tional results are needed from handling ptvotquad, more- 
over, that its description will be considerably eased by 
working backwards in order to justify them. 


An important property of functions is that one invo- 
cation may return several results of differing types. This 
feature has been lost in many programming languages (per- 
haps because their designs presume that a computer has 
but one accumulator,) but it is critical to applicative style 
and allows one function invocation to return all these in- 
termediate results. 
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Knuth’s transformation of the matrix, 


where a is the pivot element, b is any element in the pivot 
row, c is any element in the pivot column, and d is any off 
pivot element coordinating on 6 and c, is 


Rims ore 


respectively. This is the transformation to be made, though 
it is to be done here recursively and (likely) in parallel. 


The transformation of offqguad requires the values of 
d contained therein, and two vectors of values b/a and c 
coordinating on each d, occurring in the pivot row and col- 
umn, respectively. Since these two vectors, represented as 
binary trees in normal form, occur in rowquad and column- 
quad they will have been extracted as intermediate results 
while processing those quadrants. If either of these vectors 
is zero, however, then the transformation collapses to the 
identity function; all values of bc/a = 0 and not even the 
internal decorations in offqguad change, as neither the newly 
eliminated pivot row nor pivot column cross it. (This is 
the savings of sparse representation for Pivot Step.) 


When offquad is a scalar, d—even if it is a normalized 
representation of a larger matrix (notably if d = 0) and b/a 
and c are vectors—then the correct transformation is the 
decorated form of d’s difference with their outer product, 
d— c/a. In order to decorate the difference, the appro- 
priate halves of the boolean vectors identifying rows and 
columns eliminated from M; are necessary parameters. (If 
d is scalar and these vectors are non-trivial, then d must 


be expanded into and decomposed, as below.) 


d 0 
0d 
When d — bc/a is a scalar, either zero or its absolute value 
becomes its decoration, depending on whether it lies in an 


already pivoted row/column, or not. 


When offquad is decomposed into four quadrants, then 
each of these is treated as an offquad, with the vector pa- 
rameters cleaved in half to provide the four sets of vector 
arguments, and the four results decorated and normalized 
into the transformed quadtree. 


The treatment of rowquad and culumnquead are simi- 
lar, so only that of the former is presented here; the latter’s 
is nearly dual to what follows. The treatment of rowquad 
requires the inverse of the pivot element, a, and a vector 
that is the portion of the pivot column that lies in prv- 
otquad, but for the pivot element, itself. It is sufficient to 
represent both as a copy of the pivot-column vector with 
1/a in place of the pivot element (which will have been 
extracted during the treatment of ptvotquad). Also needed 
are a relative index locating the pivot row and halves from 
the boolean vectors indicating which rows/columns cross- 
ing rowquad have already been eliminated. 


Results are the transformed rowquad and half of the 
pivot row (as a binary tree) extracted from rowguad for 
use in handling offquad. That vector is also needed for 
handling most of rowguad, itself because, unless rowquad 
is trivial, it must be decomposed into two subrowquads and 
two suboffquads, the latter of which coordinates on that 
half of the pivot row. 


Thus, the transformation of rowquad focuses first on 
the pivot row. When the row index indicates that rowquad- 
call it b— is entirely within the pivot row, then the residue 
column vector is just 1/a, and the needed results (matrix 
and row-vector) are each the product, b/a. Decorated as a 
matrix, the local maximum is 0 because this row is being 
eliminated. 


If rowquad is to be decomposed into submatrices, then 
the local index will indicate whether the pivot row crosses 
the upper or lower half. Accordingly, the pivot column and 
boolean: vectors (identifying already eliminated rows/col- 
umns) are split, and those two containing the pivot row are 
treated (as rowquads) first. The pieces of the pivot row, 
extracted thereby, are joined at a binary node to become 
the vector-result of treating rowgquad, and are passed as 
arguments with the other two quadrants for treatment as 
offquads. Then these four quadrants are assembled and 
decorated into the matrix-result of this treatment. 


Finally, we consider the treatment of ptvotquad, upon 
which all the other three quadrants’ treatments depend; 
it must occur first and yield as a partial result the trans- 
formed, decorated version of ptvotquad; and as intermedi- 
ate results: the row and column indices of the pivot ele- 
ment, and vector copies of the pivot column (with 1/a@ in 
place of the pivot element) and pivot row (with —1/a in 
place of the pivot element). Moreover, while locating the 
pivot element, updated versions for P;, the permutation 
matrix (which will only change in the quadrant correspond- 
- ing to ptvotquad), and for its projections, the two boolean 
vectors of eliminated rows/columns should be constructed. 
That’s eight results, four of which are intermediate—not 
to be included directly in an answer, but to be used in 
treating sibling quadrants. 


Three observations complete this description. First, 
the treatment of ptvotquad is the same as the treatment 
of the whole matrix, M,; the only difference is the un- 
needed, extra four results, beyond M;41, P41, and the 
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two boolean vectors projected therefrom. Secondly, the 
arguments to each Pivot Step (and pivoting successive ptv- 
otquads) are M,, P;, and the two boolean projections from 


P; (and the corresponding quadrants/halves therefrom.) 
The last three arguments are used as seeds for updated 
results, and the last two also help to place O decorations. 

Finally, if M; (correspondingly, ptvotquad) is a scalar, 
a, then all eight results are trivial: M;41; = 1/a and is 
decorated as 0 (now that both it’s row and column have 
been eliminated); P;;; = 1 and both its projections are 1, 
also; the pivot column is 1/a, and the pivot row is —1/a; 
and both relative indices are 1. 


When M, is not scalar, it decomposes into four quad- 
rants, one of each type considered above. Algorithms for 
three of them (rowquad, columnquad, offquad) have been 
discussed above, and the fourth (pivotquad) is to be treated 
recursive as a Pivot Step, with basis stated the preceding 
paragraph. 


The parallelism in this algorithm manifests itself in 
the interdependence of these recursive decompositions. For 
instance, treatment of successive candidates for ptvotquad 
must precede transformation of all other quadrants; the 
depth of this recursion is at most m. After each is com- 
pleted, its associated rowquad and columnquad may be dis- 
patched simultaneously, and, of these, half must be trans- 
formed before the other half. Again, the depth of recursion 
is at most m. Thereafter, however, all offqguads at all lev- 
els of the quadtree may be treated stmultaneously. They 
generate most of the effort in a Pivot Step, but and thier 
transformations are mutually independent. 


It should be clear that a pivot step can change lots 
of decorations from those in M,; is there, then, sufficient 
information to restore them? Yes, because decorations will 
only change where scalar values have changed, or because 
a local maximum is disqualified because it resides in ei- 
ther the current pivot row or current pivot column. Such 
decorations have already been visited by this algorithm! 
(This point is most important when inverting sparse ma- 
trices, where little traversal is necessary.) We need only 
arrange that, as each interior node in the quadtree (that 
becomes the pivoted matrix) is reassembled, it must be re- 
decorated with the appropriate maximum and local coor- 
dinates. Therefore, the position of each scalar encountered 
is resolved against the boolean vector indicating already- 
eliminated rows and columns; if it is to be excluded, treat 
its magnitude as zero for the purposes of finding the maz- 
tmum local magnitude. If all four local maxima are zero, 
then the subtree is decorated with zero (and the local co- 
ordinates may be left undefined.) 


As the four new quadrants are reassembled, it is neces- 
sary to find the maximum of their decorations and its two- 
bit coordinates, according to which of the four quadrants 
it came from. Internal zero decorations do not propagate, 
because some decoration must be positive if the original © 
matrix was non-singular. 


That completes the description of a single pivot step, 
M; > Mj41. It only remains to observe that the results of 
one step, including the pivoted, decorated matrix, the two 
vectors (binary trees) of eliminated rows and columns, and 
the building permutation matrix are passed from one step 
directly along to the next. There 1s no need to search for 
the next pivot element, because it has already been located. 
Moreover, it has been located by parallel processes already 
dispatched for the pivot step, itself, in parallel. Thus, there 
is no dispatch/recovery overhead for the parallel search! 


Finally, observe that the desired inverse, M~! is read- 
ily available after permutations: 


M-! = PZ, x Mam x Pia. 


In fact, the code in the appendix builds up P?, rather than 
P;, anticipating this transpose.. 

This entire algorithm proceeds on non-singular matri- 
ces without any counters; even the outer control over of 
2™ successive pivot steps may be set up as a loop until 
decoration becomes zero. In some sense, then, it is more 
abstract, and more useful, than algorithms that depend 
heavily on size declarations and bounds tests. 


Section 4. Subprocess Balancing 

Suppose an n X n matrix is embedded in a 2” x 2™ 
matrix, where k = 2™—n and 0 < k < 2™~' as in Figure 3. 
Section 2 suggested that the values of k and m might need 
to be available to the system at run-time, even though 
they remain unnecessary to (in particular) the abstract 
multiplicative operators. This section proposes another 
use for this same information: load balancing among the 
processors. 
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Figure 3. Areas within quadrants, 
excluding padding of k rows/columns. 


The following discussion does not depend on dense- 
ness or sparseness of matrices. It does presume that the 
distribution of non-zero entries is uniform; if more patterns 
are known, then further inferences might be possible. The 
values of m and k indicate the size of a matrix and what 
proportion of it is trivial. From them we can determine 
what portion of each of the four quadrants is serious, #.e. 
likely to cause serious effort for the processors, and we can 
use this information to distribute the p processors among 
the four quadrants. 


Consider matrix addition, for instance. The quad re- 
cursion pattern is simple; each quadrant requires addition 
effort in proportion to its “serious” area (Figure 3). The 
serious area of such a matrix is (2" — k)?. It is divided up 
among its four quadrants in the following proportions: 


Quadrant Relative share of area 
Northwest ie 
Northeast ee 
Southwest ie 
Southeast on ye 


It is unlikely that these proportions have integer prod- 
ucts with p. If only a fraction of a processor is available 
to a quadrant, a good solution is to combine quadrants on 
shared processors in a way that the individual requests for 
a processor sum to an integer. A likely grouping is to set 
the northwest, northeast, and southwest sums (the partial 
quadrants) on one processor, and to set the southeast sum 
alone on another. 


Therefore, if one had p processors to add such matri- 
ces, one could use this information to distribute them to 
four quadrant process in these proportions in a top-down 
pattern. As mentioned in the introduction, it is important 
that processor allocation be done as high in the data struc- 
ture/computation tree as possible, so that the overhead of 
process dispatch/recovery not be paid repeatedly. 


In this way it is likely that larger quadrants (south- 
east) gets more processors. When a quadrant receives but 
one processor to compute its result, it operates as a unipro- 
cessor; if it receives more than one, it can apply the same 
idea to divide up its processor resource once again, and so 
forth. The first three quadrants make the most interest- 
ing further processor allocations; the southeast quadrant 
is presumed to be uniform and so its share will just be 
divided in even fourths. 


These same proportions could apply to the Pivot Step 
algorithm above. Unlike addition, however, we saw that 
there was some serial behavior to Pivot Step. Thus, (using 
terminology from before) rowquad and columnquad may be 


dispatched together, but they are only two of four quad- 
rants. Nevertheless, their computational effort may also 
be approximated by the relative size of two areas (diag- 
onal from each other), in the table above. Although we 
may not know which processors until run time, we may 
select the proper proportion then and split the processor 
resources in that manner. 


Gaussian matrix multiplication (under Strassen’s for- 
mulation [9]) easily decomposes into eight products, which 
are pairwise summed to form a four-quadrant answer. Ex- 
cluding the effect of scalars (t.e. a quadrant of 1 or 0 
avoids a quadrant multiplication) and asserting that the 
O(n*) algorithm does require a processor resource propor- 
tional to (2 — k)*, we determine the proportion of this 
resource that each of the eight products needs. This was 
done by setting two matrices, as in Figure 3, as multipli- 
ers and extrapolating the effort to multiply each of eight 
sub-products from the areas of the sub-multipliers (quad- 
rants.) 


The eight ratios are best described by combining them 
pairwise, as their associated products are to be added, to 
yield the proportion of effort invested in building each of 
the four sums that become the matrix product. The four 
ratios coincide with those already derived above! Further- 
more, proportions associated with each of the eight prod- 
ucts may be obtained by multiplying each of these four 
ratios by 2"—! /(2™ — k) and by (2"—! — k)/(2™ — k), re- 
spectively. Thus, the four-quadrant ratios are exactly as 
before, and the eight-quadrant ratios are uniform exten- 
sions from these. 


Such process allocation could be determined statically 
at compile time [8] when the language requires matrices to 
be of declared, constant size and uniform sparseness. The 
algorithms proposed here do not require bounds declara- 
tions, but if they were available, it would be possible to 
avoid much communication with, and system saturation 
of, a dynamic scheduler. 


Section 5. Conclusions 


Because we assumed that the number of processors, 
p, is small compared to the size of the matrix, n, we need 
to cleave matrix manipulations into a few subprocesses of 
balanced size, so that the resource p can be allocated de- 
liberately. If one cleaving does not consume all the pro- 
cesses, the pieces may be further split. Avoiding repeated 
dispatch and recovery, these algorithms have the virtue of 
splitting at the base of the quadtree—at the root of the 
problem. | 


Although these algorithms are described as if only 
scalars could be leaves of the quadtree, that arrangement 
is not necessary. It is perfectly possible that sizable matri- 
ces dwell at the leaves, matrices that might be represented 
in traditional row-major order, and manipulated using tra- 
ditional iterative and pipelining algorithms, programmed 
in FORTRAN et fils. (Pipelines would need an extra in- 
put stream of bits, identifying candidates for updated local 
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maximum, and. each could then compute updated vector 
decorations during a vector update.) The size of such leaf 
matrices should be chosen to balance the efficiencies of 
existing style and existing machinery against the obvious 
need for multiprocessing techniques to accelerate large ma- 
trix computation. 


Derived from a purely applicative approach, they bet- 
ter suit a computing environment with many processors, 
with memory banked at varying distances from the differ- 
ent processors, and with contention for access to shared 
resources as a real constraint on efficiency. The quadtree 
approach shows us how to represent matrices so that use- 
ful pieces may be localized where some processor can make 
computational headway on the problem without excessive 
interference from all its brethren. It has long been known 
[4] that applicative style solves the problem of decompos- 
ing an extant algorithm; the problem addressed here is the 
discovery of good algorithms that cleave into usefully sized: 
pieces. 


One last issue raised by this work relates to all sparse 
matrix techniques. I have seen only vague definitions of a 
what makes matrices sparse, possibly because sparse rep- 
resentations are so different from one—another, and maybe 
because sparseness has been purely an operational concept. 
Quadtrees, as we have seen, offer a reasonable representa- 
tion for sparse matrices that is consistent with what we 
would use for dense matrices; based on that observation, 
alone, I suggest a measure of sparseness for matrices: 
the ratio between its average path length in its quadtree 
representation (from root to leaf) and the logarithm of its 
size. Ratios closer to zero indicate sparser matrices.. 


- Quadtree representation of matrices was motivated by 
studies in applicative programming and as part of an effort 
to study its impact on Matrix Algebra for MIMD multi- 
processors. The results of the effort are more than satisfac- 
tory: not only does the technique apply to a well-worked 
problem, but also it yields new insight (through a distri- 
bution of searching across Pivot Step) on optimal hard- 
ware/software solutions. Thus, there we have more sup- 
port for using applicative programming (and its algebra) 
for programming parallel architectures and a suggestion 
that those architectures provide a multi-banked heap. 


There is already interest using quadtrees as the uni- 
form representation for matrices in a large computer alge- 
bra system [1], a sophisticated piece of software running — 
on fairly conventional hardware. They also have an im- 
portant role to play within conventional languages used on 
very sophisticated hardware, an experiment that remains 
to be done. 


Acknowledgement: Research reported herein was 
sponsored, in part, by the National Science Foundation 
under Grant Number DCR 84-05241. I would like to thank 
John Lash and Kamal Abdali for ideas and encouragement 
that helped these ideas mature. 


References 
1. S. K. Abdali. Personal communication (1985). 


2. S. K. Abdali. & D. D. Saunders. Transitive closure and related 
semiring properties via eliminants. Theoretical Computer Science 40, 
2,3 (1985), 257-274. 

3. J. Backus. Can programming be liberated from the von Neu- 


mann style? A functional style and its algebra of programs. Comm. 
ACM 21, 8 (August, 1978), 613-641. 


4. D. P. Friedman & D. S. Wise. Aspects of applicative program- 
ming for parallel processing. IEEE Trans. Comput. C-27, 4 (April, 
1978), 289-296. Preliminary version appeared as: The impact of ap- 
plicative programming on multiprocessing. Proc. 1976 International 
Conference on Parallel Processing, 263-272. 


5. S. D. Johnson. Synthesis of Digital Designs from Recursion 
Equations, M.I.T. Press, Cambridge, MA (1984). 


6. D. E. Knuth. The Art of Computer Programming, I, Funda- 
mental Algorithms, 2nd Ed., Addison-Wesley, Reading, MA (1975), 
299-318 + 401, 556. 


7. A.T. Kohlstaedt. Daisy 1.0 Reference Manual. Tech. Rept. 119, 
Computer Science Dept., Indiana University (November, 1981). 


8. V. Sarkar & J. Hennessy. Compile-time partitioning and schedul- 
ing of parallel programs. it Proc. SIGPLAN 86 Symp. on Compiler 
Construction, SIGPLAN Notices 21, to appear. 


9. V. Strassen. Gaussian elimination is not optimal. Numer. Math. 
13, 4 (August 19, 1969), 354-356. 


10. D. S. Wise. Representing matrices as quadtrees for parallel 
processors (extended abstract). ACM SIGSAM Bulletin 18, 3 (August, 
1984), 24-25. 


11. D. S. Wise. Representing matrices as quadtrees for parallel 
processors. Information Processing Letters 20 (May, 1985), 195-199. 


12. M. F. Young. A functional language and modular arithmetic 
for scientific computing. In Jean-Pierre Jouannaud (ed.), Functional 
Programming Languages and Computer Architecture, Lecture Notes in 
Computer Science 201, Berlin, Springer (1985), 305-318. 


Appendix 


The following examples, all of which evaluate to 1, are useful as 
an introduction to the Daisy [7] code that follows. They exemplify a 
new style of applicative programming that depends on functional com- 
bination and data recursion [5] to specify multiple and interdependent 
results without cognizance of a necessary sequence of evalua- 
tion. All primitives used here, however, have been in Daisy from its 
birth. The forms 


identifierStruc binding result] 
identifierStruc binding result] 


let: [ 
rec: [ 

return resulé computed in an environment enhanced with :dents- 
fierStruc bound to binding. Evaluation is lazy, and the list structure 
of binding must match the structure of sdentifierStruc, wherein bound 
identifiers become bound according to their position within that list. 
Functional combination [4] is indicated by a list structure as a function, 
to the left of the colon, the “apply” operation. In the layout below, 
one may perceive that the constituent components of that combination 
is applied vertically to the (transposed) argument matrix. Thus, the 
intermediate results of all the functional combinations below is that of 


(10(3 1)). 


let: [ {sum [quotient remainder] ] 
<add <div rem >>3< 
<6 < 10 * >> 
<4 < 3 = >> > 
remainder] 
divide = *\[a b].<div:<a b> rem:<a b>> 
let: [ {sum [quotient remainder] } 
<add divide >3< 
<6 10 > 
<4 3 >> 
remainder] 
rec: [ {sum [quotient remainder] ] 
<add divide >3< 
<6 sum > 
<4 3 >> 
remainder] 
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The skeleton of the Daisy code for Pivot Step follows. It presumes 
that quadtree-matrix arguments/results are decorated at non-terminal 
nodes. This code has been cut back to remove all provision for normal- 
ized (sparse matrix) representation and for permuting the quadrants. 
Normalized matrices may require expansion of scalar arguments into 
a quadruple (two of which are zero; two of which are the scalar) upon 
function entrv. and collapse of such quadruple patterns to the scalar 
upon exit. Without provision for permuting, this code will pivot only 
on the northwest-most (upper left) scalar entry; the expanded code 
tests two bits of “decoration” (ibit and jbit), and provides permuta- 
tions on arguments and results to translate to/from this northwest 
orientation. 


Notice that the instances of functional combination in PIVOT, 
ROW, COLUMN, and OFF are 


[PIVOT ROW COLUMN OFF, 


[ROW OFF ROW OFF\, 
[COLUMN OFF COLUMN OFF}, 
[OFF OFF OFF OFF, 


respectively, reflecting the recursive decomposition of each kind of 
quadrant. It is also important that OF F immediately tests whether 
either vector argument (the projection from the pivot row or pivot col- 
umn) is zero, and acts as an identity function in that instance. Also, 
this code builds the transpose of the permutation matrix (P7 from 
the paper) directly. 


PIVOT = *)\(decoratedMtx elimrow-col PermutT]. if:< . 
Scalar? :decoratedMtx let: [ inverse reciprocal :decoratedMtx 
< inverse [TRUE TRUE] 1 <negate:inverse inverse> [1 1]> } 
let:{ [[max ibit jbit] ! mtx] 
decoratedMtx 


let: cl {epivot [] [] [eright epot]] (permutHEAD ! Revert] } 


spread4:elimrow-col Permut 
rec: = 
[ {i [eleft etop] permutPIVOT [pleft eee [ipos Jpos] ] 
fe pright) [111 pbot] 
<PIVOT COLUMN ones < 
mtx 

<epivot <eright etop> <eleft ebot> <eright ebot>> 

<permutHEAD ptop pleft <pright pbot>> 

< ipos jpos 7) > > 
let: [ [elimrow-col pivotrow-col] 


<<<eleft eright> <etop ebot>> 
<decorate:<elimrow-col <1 11 111 1v> > 
elimrow-col <permutPIVOT ! permutTAIL> 
pivotrow-col <twice:ipos twice: jpos> > 


<<pleft pright> <ptop pbot>> > 


33> 


ROW = “\[decoratedMtx elimrow-col pivcol index]. 
one?:index let: [ BoverA 
decorateproduct:<elimrow-col pivcol decoratedMtx> 
<BoverA BoverA> ] 
let: [ {iresidue ibit] divide: Brg 2> 
rec:[ [ [i left] Ar right) 
<ROW 


ifs< 


iv J 
OFF OFF >3< 
tall: py eee 
spread4:elimrow-col 
let: [ [top bot] pivcol 
<top top <bot left> <bot right> >] 
<lresidue iresidue iresidue iresidue >> 
< decorate:<elimrow-col <i 11 i141 iv>> <left right> > jJ> 


COLUMN = “)[decoratedMtx elimrow-col prow index]. 
one?: index 
<decorateproduct:<elimrow-col prow decoratedMtx> decoratedMtx> 


ifs< 


let: [ [jresidue jbit] divide:<index 2> 
rec:({ [ [1 top} 41 (111 bot] iv } 
<COLUMN OFF ‘COLUMN OFF >: < 


tail:decoratedMtx 

spread4:elimrow-col 

let: [[left right] prow 
<bot right>>] 

jresidue > > 
<top bot > > }]}> 


<left <top right> Ieft 
<jresidue jresidue jresidue 
< decorate:<elimrow-col <1 11 44141 iv>> 


OFF = “\[decoratedMtx elimrow-col prow-col index] . 
anyzero? :prow-col decoratedMtx 
one?:index decoratedifferencE: <elimrow-col decoratedMtx 

decorateproduct:<elimrow-col prow-col> > 


ifs< 


decorate:<elimrow-col 
<OF: OFF 


OFF OFF >: < 
tail:sdecoratedMtx 
ees elimrow-col 
pread4:prow-col 
<half: jadex half:index half: index half:index> > >> 
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Abstract 


A unified class of banyan networks is proposed which 
presents better distance properties than SW-banyans. This new 
class is called SK-banyans for the SKewing scheme that is adopted 
for the connections between nodes at different levels. Its definition 
is given along with the characteristic properties relative to SW- 
banyans. A consideration is given for the virtue of symetric con- 
nection schemes in relation to their distance properties. Examples 
of different Characteristic Connection Patterns (CCP’s) for SK- 
banyans are presented. A general construction scheme is proposed 
and a comparison of distance properties of SK- and SW-banyans is 
presented. Optimality criteria have been developed to determine 
whether a given CCP provides the best distance properties. 


1 Introduction 


Interconnection networks have become an important com- 
ponent of parallel or distributed computer systems, and as such 
have stimulated much interest in recent years [16]. Topics such as 
topologies, routing, data manipulation, fault-tolerance and VLSI 
design, to name but a few, have by themselves spawned new 
research areas where so much is yet to be done. 


On the issue of topologies, the topic to which this paper is 
related, new structures have been proposed recently [6,10,16], as 
well as old structures have been subject of renewed studies [11]. 


Banyan networks [5,14] constitute one such class of the 
many proposed topologies, and reports on their fault- tolerance 
properties, resource allocation algorithms and performance evalua- 
tion can be found in the literature [3,4,7,8,12,15]. 


It was with much hesitation that we considered putting for- 
ward a paper on “yet another network topology". The reason, 
however, for doing so is twofold. First, the SK-banyan represents 
a unification and not a fragmentation of the networks issue, be- 
cause the SW-banyan (and all of its isomorphic counterparts) is a 
member of the class of SK-banyans. Second, significant under- 
standing of the issue of network connectivity and distance 
properties have been attained by the study of SK-banyans. 


Our research focuses on the evaluation of the distance 
properties of a specific class of banyans, motivated by results 
reported in [13] regarding the distance properties of KYKLOS 
multiple-tree networks. These results showed that, by changing the 
connections of the second tree in an appropriate and regular way, 
the distance properties of this so modified double-tree were shown 
to be better than for the conventional double-tree. This 
represented a break from traditional symetric interconnection 
schemes of the past which have lead to symetric redundancy. The 
connections, while a break from the past trends, are regular and 
predictable from stage to stage. As the banyan networks have an 
embedded tree structure, they might also show some improvement 
by using this modified connection scheme. It must also be pointed 
out that the modified connection still provide a tree topology be- 
tween the apex and base of the structure. 
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The emphasis on the study of the distance properties of this 
new banyan network is justified by the fact that the delay ob- 
served in the transmission of messages across an interconnection 
network is closely related to their distance properties {1]. Since the 
apex-distance is always log, N, the distance referred to here is 


from any node to any other node. This is particularly useful when 
studying single-sided networks, because base-to-base distance be- 
comes quite important. Also, the fault-tolerance properties of the 
ICN may show some improvements as well, due to the existence of 
alternative routes. This is true for both KYKLOS and SK-banyans 
where better distance properties are obtained by avoiding redun- 
dant connections between sets of nodes at different levels. 


2 Banyan Networks 


A banyan network is defined in terms of a directed graph 
representation, which is a Hasse diagram of partial ordering with a 
unique path between every base and apex. A base is defined as a 
node of out- degree 0 and an apex is defined as a node with in- 
degree 0. In this paper, we assume the convention that levels are 
enumerated from apex to base, with level 0 corresponding to the 
apex. Two examples of banyan networks are shown in figure 1. 
The left network is irregular and is restricted to special applica- 
tions, such as mapping an algorithm into a network. The right 
network, being regular, is very appropriate for use in interconnec- 
tion networks, both because data routing algorithms can be easily 
specified, and because of its good properties for data manipula- 
tion, due to its embedded tree structure. A regular banyan is 
characterized by the fact that all of its nodes (with the exceptions 
mentioned above) have the same in-degree (called spread in 
banyan terminology) and the same out-degree (called fanout). 
These parameters are represented by the letters s and f, respec- 
tively. 


Regular banyans can be further classified into different net- 
works, according to the connection scheme applied to the nodes. 
Two of them have been reported in the literature, the SW-banyan 
and the CC-banyan. The former is defined as a recursive expan- 
sion of a crossbar structure (which itself can be thought of as a 


one-level SW-banyan), by interconnecting s*! such crossbars and f 
identical [-1 SW structures, all with the same fanout and spread.! 


SW-banyans can also be further divided into rectangular 
and non-rectangular SW-banyans, the only difference being that 
for the former the spread and fanout are the same. SW-banyans 
are described by a set of numbers in the format (s,f,l) for non- 
rectangular banyans, and by (s,l) or (f,l) for rectangular banyans. 
This latter class of SW- banyans has been shown to be topologi- 
cally equivalent to a number of other proposed topologies [16], and 
as such the results presented here could also be applied to these 
isomorphic topologies. 


1, is the number of levels in the SW-banyan. 


3 Skewed Banyan Topology 


3.1 Basic principle 


One property of (s,f,1)SW-banyans is that s nodes at level 7 
have each one of their links connected to the same f nodes at level 
+1. In other words, we can say that f nodes at level :+1 are 
connected to the same s nodes at level ¢. This property is il- 
lustrated in figure 2 for a (2,3,2) SW-banyan. 


With this connection scheme we can notice that f edge- 
disjoint paths of length 2 exist between the s nodes of every set at 
level +. Although the redundant paths improve the reliability of 
connections between s nodes, and only between these, they 
represent a bottleneck if we consider their distance to other nodes 
at the same level or to nodes at different levels(the numbers at 
each node in figure 2 represent their distance to the node marked 
0’). If the connections in the original network are rearranged in a 
way such that two nodes at level ¢ have no more than a single 
node in common at level :+1, the remaining f-1 links for each 
node at level 1 might be connected to nodes at level :+1 that con- 
nect to different nodes at level t, rather than back to the original 
two nodes. The basic idea then is to reduce the distance between 
nodes at the same level by avoiding redundant connections be- 
tween them through nodes at level +1. For the original network 
of figure 2, the connections can be rearranged as shown in figure 
3. Comparing the distance properties between these two connec- 
tion schemes, we can observe some improvement caused by the use 
of a skewing scheme (the numbers represent distances to node ’0’ 
as in figure 2). 


That these improvements are scalable to larger size networks 
or to networks with different fanouts and spreads can be easily 
verified, and examples in a later section will show this. 


3.2 Definition of SK-banyans 


As an SK-banyan is closely related to an SW-banyan, we will 
define the former in relation to the latter. 


Definition 1: An (s,f,]) SK-banyan is a banyan 
network for which the following conditions hold : 


1. it is regular; 


2. it has the same number of nodes per level as an 


(s,f,1) SW-banyan; 


3. it has the same number of links between levels 
as an (s,f,1) SW-banyan. 


We continue by defining an important property of SK- 


banyans, namely, the existence of two characteristic levels. 


Definition 2: The characteristic levels of an 
(s,f,1) SK-banyan are the two levels that correspond, on 
an (s,f,l) SW-banyan, to the levels whose respective 
connected subgraphs have the following property : 


sXf if s<f 


= 
sXs if s>f 
{Xf if e<f 
nN. — 
oe fxs if s>f 
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where : 
n, - number of nodes at level i 


Doe number of nodes at level i+1 
S - spread 
f - fanout 


Another related definition is presented as follows. 


Definition 3: The Characteristic Connection 
Pattern (CCP) of an (s,f,]) SK-banyan is an undirected 
graph formed by the characteristic levels and the con- 
nection pattern between them. 


In figure 4, a (2,3,3) SW-banyan is shown. For this case, s < 
f and then n, = 6 *“f=6andn, = f/*f=49. Of all the 


connected subgraphs in this particular banyan, the only ones that 
satisfy definition 2 are those marked. They will correspond then to 
the characteristic levels on a (2,3,3) SK-banyan. If the connections 
between these levels are rearranged, the particular connection pat- 
tern chosen will constitute the Characteristic Connection Pattern 
(CCP) of the SK-banyan, as defined. Two of the many possible 
choices for the CCP for a (2,3,1) SK- banyan are shown in figure 5. 


The importance of defining the characteristic levels and the 
characteristic connection pattern will become clear after the fol- 
lowing definition. 


Definition 4: A uniform (s,f,]) SK-banyan is an 
(s,f,l) SK-banyan for which the connection patterns be- 
tween levels are reproductions of the Characteristic 
Connection Pattern, taken.on a group-wise basis. 


In figure 6, a (2,3,3) SK-banyan is shown, for which the CCP 
is that shown in figure 5(a), and the nodes at levels 1 and 2 con- 
stitute the characteristic levels. If we look at a group of connec- 
tions between levels 2 and 3, each group with three links, we can 


recognize the same pattern as the CCP. This is what is meant by 
“group-wise" in the above definition. Between levels where n> 8 


*fandn, yew * f (for s <= f), a group of connections, rather 


than individual connections, will follow the CCP, no matter how 
large the group. 


Since the CCP completely defines a uniform SK-banyan, 
once it is chosen we are able to construct the whole network and 
to define a routing algorithm. Although not proved here, it may 
also be anticipated that non-uniform SK-banyans, where the con- 
nection patterns between levels are not necessarily reproductions 
of the characteristic connection pattern, may be more difficult to 
construct and as complex as irregular banyans for data routing, 
thus will not be considered here. Also, by Definition 1, SW- 
banyans can be considered a case of SK-banyans, for which a 
specific CCP is used. 


For simplicity, we will assume in the following discussion 
only the case for e <= f. The case for s > f obeys the same 
principles. 


We now introduce a matricial representation of the CCP, in 
order to facilitate its analysis and classification. The CCP is a 
bipartite graph, where one set has s * f nodes and the second set 
has f * f nodes, as given by Definition 2. The CCP can be further 
divided into s * f bipartite subgraphs, each one with f nodes in 
each of its two sets. The number of different subgraphs is equal 
to f!, and as there are s * f of these subgraphs in a CCP, the total 


number of different CCP’s for a given ¢ and f.is equal to (f!/ S. 
Table 1 shows these values for different spreads and fanouts. 


The natural partition of the CCP into bipartite subgraphs of 
a larger bipartite graph leads to the following definitions. 


Definition 5: The Connection Matrix (CM) of a 
bipartite subgraph of a CCP, is an f x f binary matrix 
CM =|z, ; such that : 


1 if node 7 is connected to node 7 
L.. 
ij 


O if they are not connected 


Definition 6: A Primitive Connection Matrix 
(PCM) is one element of the set that is composed of all 
the f! possible Connection Matrices. 


Definition 7: The Characteristic Connection 
Matrix (CCM) of a CCP is an s x f matrix whose ele- 
ments are Connection Matrices taken from the set of 
Primitive Connection Matrices. 


Given the CCP of figure 5 (a), we can partition it into six 
bipartite subgraphs as shown in figure 7, where the respective 


CM’s are also shown. The CCM will be formed by these CM’s, as 
shown in figure 8. The set of PCM’s for this case (s = 2; f = 9) is 
shown in figure 9, along with the respective graphs. Numbering 
each of the PCM’s from 0 to [(f!) - 1], the CCM can be 
represented as an s x f matrix whose elements represent the num- 
ber of the PCM that corresponds to each CM in the CCM. For the 
CCM of figure 8, this notation would lead to the matrix shown in 


figure 10. This notation is more compact and as complete as the 
CCM itself. 


3.3 Optimality criteria 


It is clear, by looking at table 1, that an algorithm or 
criteria should be proposed in order to determine which of the 
large range of CCP’s provides the best distance properties. Given 
the size of the problem, and the fact that in order to gather mean- 
ingful results we must deal with reasonably large networks, an 
analytical approach seems unfeasible. Instead, we performed a 
computation of the distance properties of a subset of all the pos- 
sible (2,3,4) SK-banyans. From these results, we were able to 
derive two criteria that can determine, based on the COM, if the 
resulting (2,3,1) SK-banyan is optimal in terms of distance 
properties. 


The first criterion is related to the connectivity of the nodes 
of the upper level of the CCP. As pointed out in section 3.1, the 
connections of these nodes should be made in a way such that they 
linked the maximum number of nodes in the lower level. This 
spreading of connections is represented by a matrix, called Spread 
Connectivity Matrix, defined as follows. 


Definition 8: The k-th Spread Connectivity 
Matrix (SCM) SCMK of a CCP is an f x f matrix 
whose elements are given by the k-th-row-summation of 
the Connection Matrices of the corresponding Charac- 
teristic Connection Matrix, or : 


f-1 
SCM,,= >. CCM, 
(= 
for 0 < 1,7 < f-l 
0o<k < s-1 


As the CCM is an s x f matrix, there will be s SCM’s. We 
can now state the first optimality criterion. 


Definition 9: Connectivity Criterion : a (2,3,]) 
SK-banyan presents optimal distance properties if all 
the elements of every SCM are equal to 1, or: 
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This criterion was established by verifying that from the set 
of PCM’s, f of them can always be selected such that their sum- 
mation, as defined above, leads to a matrix with all the elements 
equal to 1. This corresponds to the optimal case. 


The second criterion is related to the redundancy in connec- 
tions between two nodes of the upper level of the CCP. As also 
pointed out in section 3.1, nodes at the upper level should have 
only one node in common at the lower level. This redundancy is 
represented by a matrix, called Redundancy Matrix, defined as fol- 
lows. 


Definition 10: The Redundancy Matrix (RM) of 
a CCP is an f x f matrix whose elements represent the 
number of ocurrences of the indices t,7 of the nodes of 
the upper level of the CCP to which each of the nodes 
of the lower level are connected. 


The nodes at the upper level are numbered as shown in 
figure 11. The numbers at each node of the lower level indicate the 
indices of the nodes of the upper level to which they are con- 
nected. In this case, each ordered pair occurs only once, and so the 
Redundancy Matrix will have all of its elements equal to 1. We 
state the second criterion as follows. 


Definition 11: Redundancy Criterion : a (2,3,1) 
SK-banyan presents optimal distance properties if all 
the elements of the Redundancy Matrix are equal to 1, 
or: 


RM..=1 
ij 
for 0 < i,j < f-1 


This criterion was established by noticing that any redun- 
dant connection results in an entry in the Redundancy Matrix 
equal to 2, whereas the absence of a connection results in an entry 
equal to 0. The only case where there is no redundancy is when all 
the entries in the Redundancy Matrix is equal to 1, which means 
that there are no redundancy or absence of connections between 
the nodes. 


As an example of the application of these criteria, figure 12 
shows three cases of different CCP’s with their respective CCM’s 
and the Connectivity and Redundancy matrices. Figure 12 (a) 
refers to a (2,3,1) SW-banyan, (b) refers to a non-optimal CCP, 
and (c) to an optimal CCP. As can be seen, only in the last case 
the two optimality criteria are satisfied. 


Using these criteria to compute the Spread Connectivity and 
the Redundancy matrices of all the (2,3,4) SK-banyans takes less 
than 1/1000 of the processor time needed to compute their dis- 
tance matrices and average distances. By selecting the PCM’s 
such that the connectivity criterion is satisfied, the range of 
CCM’s to be searched for optimal cases can be significantly 
reduced. A procedure to perform these selections systematically is 
being investigated. 


3.4 Construction Scheme 


Given the definitions from the previous section, we now. 
present a construction scheme for (s,f,l) SK-banyans, based on the 
CM’s. One reason for using these matrices is that, once s and f are 
selected, the PCM’s are defined and any CCM can be constructed 
by assigning to each of its CM’s the corresponding PCM. 


for i = 0 to (number of levels - 1) do 


gs. =s*f' 


bs. = f 
1 


for k = 0 to (number of nodes 
at level i) - 1 do 


offset = k / gs, 
gn, = k mod gs. 
bn = (k mod gs.) / bs. 


for 1 = 0 to (f - 1) do 
skew = skew _ factor [ pcm [ cm_num, ]]] 


and connect 
node(k,i) to 
node(offset + 1 * bs, + skew, i+1) 
where : 


gs. - group size at level i 


bs, - block size at level i 

k, i- node number (number, level) 
offset - lowest node number of the group 
gn - group number of the node 


bn - block number of the node 


cm_num - OM number (0 <= CM <= (8*f)-1) 
pem - PCM number (0 <= PCM <= f-1) 


This algorithm has been implemented in Pascal, and was: 


used to generate all of the examples used in this paper. 


4 Distance Properties 


The distance properties of some examples of SW- and SK- 
banyans will now be presented, and two metrics will be used. One 
measures the percentile of nodes that are within a given distance 
from a particular set of nodes. We will call this measure the reach 
factor. The second measures the average distance between sets of 
nodes. 


4,1 (2,5) SW- and SK-banyans 


It is also necessary to consider rectangular banyans, hence 
table 2 shows the distance properties of both banyans, in terms of 
reach factor. It can be seen that SK-banyans provide better dis- 
tance properties than SW-banyans. This table also shows a very 
important property, namely, that the improvement obtained for 
some levels is not at the expense of others. As can be seen, level 0 
in both banyans present the same distance properties, but starting 
at level 1, the SK-banyan provides improved distance at every 
level. Figure 13 shows the same results in a graphical form, for 
levels 0 and 4. 


4.2 (2,3,4) SW- and SK-banyans 


Table 3 shows the distance properties for (2,3,4) SW- and 
SK- banyans. Again, the SK-banyan shows an improvement over 
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SW-banyans, this time higher than in the previous case. Figure 14 
compares the reach factor for the two types of banyans. As before, 
the improvements obtained for any level is not at the expense of 
the distance properties of any other level. 


4.3 (2,3,4) SK-banyans with different CCP’s 


As mentioned in the previous section , different CCP’s may 
provide different distance properties. Computing the distance 
matrices for the CCP’s of figure 5, we obtain the results shown in 
table 4. It is important to realize that not only are some improve- 
ments achieved by using SK- over SW-banyans, but also by choos- 
ing one CCP over another. Figure 15 shows how different CCP’s 
compare to each other and to the respective (2,3,4) SW-banyan. 


4.4 Average distances 


For a specific subset of (2,3,4) SK-banyans (which included 
the sub-class of SW-banyans), we computed the average distances 
between sets of nodes at each level. Three results are shown in 
table 5, respectively for the SK-banyans whose CCM’s are those 
shown in figure 12. As shown, the improvement in average dis- 
tance varies with the level. For a better visualization figure 16 
shows the variation of the average distance with the CCP type, for 
this particular subset of SK-banyans. 


5 Conclusions 


A new class of banyan networks has been presented which 
has been shown to have better distance properties than SW- 
banyans. Although these improvements have been demonstrated 
only in relation to distance properties, it should be expected that 
similar improvements would occur for other properties, especially 
traffic and fault-tolerance. Justification for this reasoning is 
provided by way of analogy with KYKLOS and many other 
topologies, which have shown the connection between distance and 
traffic [9]. 


Improvements in the distance properties are essential for 
minimizing communication overhead. An example of this is the ar- 
chitecture proposed for database applications [2]. In this architec- 
ture, a set of host computers at the apex of an ICN processes the 
queries and send requests to a set of I/O nodes at the base of the 
same ION. The storage allocation scheme proposed assumes heavy 
traffic between these I/O nodes, both to decrease the effects of 
“hot spots" and to reallocate the database as it changes. It is 
desirable in this case that the distance between these nodes be as 
low as possible, without aggravating the distance properties of the 
apex nodes. 


The matricial representation of CCP’s allowed the construc- 
tion and classification of SK-banyans, as well as the proposal of 
criteria for determining the optimality of a CCP. Although results 
here were provided only for the case where s = 2; f = 8, the for- 
mulations and definitions are general in nature. 


Because SW-banyans are a case of just one CCP, the concept 
of SK-banyans represent a generalization of the topology , and al- 
low a more unified view of the properties resulting from intercon- 
nection variation. The symetry applied to conventional networks, 
while providing simple routing, has lead to redundant connections. 
Minimization of this redundancy results in networks with a richer 
set of properties such as distance and traffic. 
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Reach factor CK? 


Table 2: Distance properties of (2,5) SW- and 
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ABSTRACT -- This paper extends the modeling and analysis of 
packet switched multistage cube interconnection networks to 
include the use of multiple-packet message formats. A multiple- 
packet message format is needed within an interconnection net- 
work when the message to be transmitted exceeds the network’s 
packet size. The packet offset time is introduced and used to 
gauge the transmission rate differential between the network inter- 
change boxes and the system PEs. Based on assumed network 
and system operating characteristics, optimum network perfor- 
mance offset values are given. A methodology for evaluating the 
relative performance of circuit and packet switching transmission 
modes is developed. For an assumed set of network and system 
parameters, results indicate that, in contrast to inter-computer net- 
works, the performance of packet switching networks is better 
than that of circuit switching networks for long message lengths. 


1. INTRODUCTION 


An integral part of any parallel processing system is its inter- 
connection network, used to link the processors and, possibly, 
_ memory modules together. One class of networks suitable for use 
in a parallel processing system is the multistage cube networks 
[24]. This class includes topologically equivalent networks such 
as the baseline [26], the indirect binary n-cube [18], the general- 
ized cube [22], the omega [14], the flip [1], and the SW-banyan 
(S=F=2) [11]. Multistage cube networks have been used in the 
STARAN [1] and BBN Butterfly [4] and have been proposed for 
use in many future systems such as: the IBM RP3 [19], PASM 
[23], Ultracomputer [10], and database machines [7]. 


Multistage cube networks can be designed to function using 
circuit or packet switching methodologies for data transmission 
within the networks. A general framework is developed in this 
paper that quantizes system and network parameters and thereby 
facilitates the comparison of circuit and packet switched networks. 
A specific set of network and system parameters are used to 
demonstrate the utility of the framework. The general framework 
can be used by network designers to select which network switch- 
ing methodology best meets their system requirements. Networks 
can also be designed that incorporate both circuit and packet 
switching modes [20]. In networks where both modes are avail- 
able, this analysis framework can be used to identify which 
transmission mode should be used for a given set of operating 
conditions. 


In a packet switched network, when the amount of data to be 
transmitted exceeds the network’s packet size, multiple data pack- 
ets must be generated by the network source and then be transmit- 
ted to the appropriate destination. The performance analysis of 


networks operating in a multiple-packet environment is considered . 
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in this paper. Packet switching network models are extended, 
through the use of simulations, to include the transmission of 
multiple-packet messages. A tradeoff analysis, based on assumed 
network and system operating characteristics, is made to compare 
the performance of this format to that of circuit switching. This 
research has been motivated by the design of the PASM parallel 
processing system system at Purdue University [23, 16]. 

Section 2 defines the system and network operating environ- 
ments. The performance analysis of multiple-packet message 
transmissions is then presented in Section 3. Performance com- 
parisons between circuit and packet switching network operations 
is discussed in Section 4. Conclusions are presented in Section 5. 


2. NETWORK MODEL AND SYSTEM OPERATION 


In this paper, the generalized cube network will be used as a 
model of all of the networks within the multistage cube class. 
The network model will be the same as the models developed in 
previous analyses for omega networks [2, 3] and baseline net- 
works [15]. Assume the network has N inputs (sources) and N 
outputs (destinations), where N = 2". The network consists of n 
Stages with each stage being composed of N/2 2-by-2 interchange 
boxes. Interchange boxes in stage i pair I/O lines with labels that 
differ only in the i-th bit position. The same labeling is used for 
both the input and the output lines connected to an interchange 
box. A generalized cube network is shown in Figure 1, with N=8. 
The network’s sources and destinations are assumed to be pro- 
cessing elements (PEs), processor-memory pairs; i.e., PE i is con- 
nected to network input port i and network output port i, 
0<i<N. This configuration is generally referred to as a PE-to- 
PE system architecture. Distributed routing control is assumed, 
with the settings of the individual interchange boxes (straight or 
exchange -- broadcasting will not be considered in the model) 
determined by routing information contained within the routing tag 
of each message [14, 22]. Prior to the transmission of data 
between a source-destination pair in a circuit switched network 
implementation, a complete path linking the pair must be esta- 
blished and then held until the completion of the data transfer. 
Using. packet switching, a message consists of one or more data 
packets. Each packet makes its way from stage to stage releasing 
links and interchange boxes immediately after using them. 


The network is assumed to be operating under the following 
assumptions [8]. Each source PE generates its messages indepen- 
dently from all other sources. Each source generates messages for 
each destination PE with equal probability. Messages will be gen- 
erated based on a predetermined, fixed loading factor. The desti- 
nation PE can receive data from the network faster than the net- 
work can transmit the data (this simplification implies that an out- 


‘put device will not act to bottleneck the operation of the network 


itself). Finally, a message, or a message packet, consists of two 
parts -- the routing tag and one or more data words. 


4HCUa2e 
ACcCVACO 


STAGE 2 | @) 
STRAIGHT EXCHANGE 


Figure 1. The generalized cube network for N=8. 


In an interchange box, if two different message requests 
(routing tags) attempt to reserve the same output link, or (in the 
case of circuit switching) if a request is blocked by an already 
established message path, a conflict 1s said to have occurred. A 
conflict resolution algorithm must be invoked to resolve the 
conflict. In packet switching networks, the algorithm selects one 
of the requests and permits it to pass through the interchange box. 
The other request is held in an input buffer and will attempt to 
traverse the box at a later time. In circuit switching networks, the 
message that is blocked at the switch is either dropped and then 
resubmitted to the network (the drop algorithm) or is held in the 
box until the blockage is removed (the hold algorithm). These 
two circuit switched algorithms have been investigated in, for 
example, [15, 6]. 


3. MULTIPLE-PACKET NETWORK OPERATION 


In this section, the effects of multiple-packet messages on 
the operation of a packet switching generalized cube network are 
investigated. The system is assumed to be operating in the MIMD 
mode. The message to be transmitted will consist of a routing tag 
and one or more data words. If, however, the message size 
exceeds the size of the network transmission packet, multiple 
dependent packets must be generated. Each packet will contain 
the same routing tag and a portion of the data to be transferred. 
The packets are processed sequentially by the network since only 
one network path exists between any source-destination pair. 


Operational dependencies exist between multiple-packet mes- 
sages that do not exist when single-packet message formats are 
used. These dependencies center on the sequential generation of 
packets (within any given message) all being routed to the same 
network output port (recall that in single-packet messages, the 
routing of all of the packets is assumed to be a strictly indepen- 
dent process [8]). As an example, consider two multiple-packet 
messages, where the packets of one message are being held in one 
input buffer of an interchange box and the second message’s pack- 
ets are held in the other input buffer. Any conflict (or lack of a 
conflict) that occurs when the first packets in the two buffers are 
processed will also occur when the succeeding packets are pro- 
cessed. 


The dependencies that exist between packets in multiple- 
packet message transmissions cause network performance analysis 
using ‘‘standard" Markov chain modeling techniques, as in [8, 3], 
to be extremely complex. Interconnection network simulators 
such as PUGS [5] can, however, be used to accurately predict the 


performance of multiple-packet messages. PUGS is capable of 
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simulating both single- and multiple-packet message formats. 
When using the multiple-packet format, performance statistics for 
both individual packets and complete messages (composed of 
dependent packets) can be obtained. The accuracy of PUGS out- 
put data has been validated through comparisons with other 
researchers’ published network simulation and Markov chain per- 
formance data, such as [6, 3, 15, 8]. To increase the accuracy of 
the statistics, the length of the simulation runs was adjusted so as 
to insure the simulated network had reached a ‘‘steady-state" 
operating condition. This occurred after the first 250-300 simu- 
lated packet cycles. The actual simulation lengths were then set to 
approximately ten times this value -- 2000 simulation cycles. 
Statistics gathered before steady-state was reached were discarded. 
Lastly, for each combination of network variables that were 
evaluated, each simulation was repeated ten time and the results 
averaged so as to further reduce the statistical variations. Data 
and statistics obtained from PUGS will be used throughout this 
aper. 

a Assume that the data to be transferred is stored in the 
memory of the source PE. As a result of the slow memory access 
time within the PE (due to memory device speed limitations and 
long inter-IC and inter-circuit board transmission delays) and the 
comparatively fast interchange box logic (small, fast packet 
buffers and straightforward control logic), the time required by a 
PE to construct a data packet will be much longer than the time 
required by an interchange box to process the same size packet. 
This operational speed differential between the PE and the net- 
work (i.e., the interchange boxes) requires that packets be held in 
a PE’s output buffer and not released to the network until the 
packet is completely formed. 


In analyzing multiple-packet message performance, the 
packet cycle time, defined as the time delay incurred by a packet 
in traversing a network interchange box, will be used as the basic 
unit of system/network timing. This delay is the amount of time 
required to move a packet from the front of one of an interchange 
box’s input buffers to the input buffer of an interchange box in the 
next stage. As discussed above, this time delay will be much 
shorter than the time delay associated with the generation of a 
packet within a PE. A packet offset time, the time (in packet 
cycles) between packet generations of a multiple-packet message, 
can be used to quantize the speed differential between the network 
and the system PEs. If the time to generate a packet equals’ the 
time to process a packet in an interchange box, the packet offset 
would be one. If a message consisted of w packets and the first 
packet was generated and then submitted to the network at time t, 
then, in general, a y-cycle packet offset would cause packets to be 
submitted at times t, tty, tt2y, --- tty-(w—1). The total mes- 
sage transmission delay is the time difference between the begin- 
ning of the generation of the first packet of the message and the 
completed transmission time of the last packet of the message. 
Note that the time from the generation of the first packet to its 
submission to the network is y. This delay will be 
y°(w) + (the transmission delay of the last packet), where the first 
term is the delay incurred by the message before the last packet is 
available for transmission. Recall that, because of pipeline effects, 
the delays of the other packets in the message will be reflected in 
the transmission delay of the last packet. 


In [13], Kruskal and Snir have presented an alternate, more 
constrained, discussion of multiple-packet network operation. 
They assumed that all of the message packets were generated 
simultaneously (this corresponds to a packet offset of zero) and 
that the packets moved through the network as a single "group." 
Additionally, the inclusion of the packet generation time in the 
overall transmission time has not been incorporated into previous 
analyses. In [8], for example, timing delay calculations start when 
a packet is accepted by a network input port. Time devoted to 
generating a packet and any time lost waiting for the availability 
of an network input port is not included in their analysis. 


Representative performance analysis results of multiple- 
packets messages, obtained using PUGS, are shown in Table 1. 
Results in this table are for a 5-stage network (N=32) and 
multiple-packet messages consisting of 2, 4, and 8 packets. The 


packet offset ranges from 1 to 15 packet cycles and the loading 
factor is 100% in each case. The loading factor is the probability 
in a given time cycle of a network source generating a new mes- 
sage, given that it is not currently transmitting a message. Buffers 
at the inputs to each interchange box, capable of holding four 
packets, are assumed to be in use (buffer size is based on analysis 
results in [8]). The packet and message delay times can be used 
to gauge the effects of the packet offset on network performance. 
Ideally, a packet would require k packet cycles to traverse a k- 
stage network. An m-packet message would, in turn, require 
k+(m—1) packet cycles to completely traverse the network (this 
time value assumes the packets are transmitted through the net- 
work in a pipeline fashion). The normalized packet and message 
delays can therefore be computed by dividing the packet and mes- 
sage delay times by k and k+(m-—1), respectfully. 


Number of | pacvet Normalized delay 
packets per | orrcot Packet Message | Packet Message 
message 


1 3.58 

2 3.27 3.49 

5 1.35 2.77 
10 1.31 
15 1.31 

1 

2 

3 
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Table 1. Performance results for multiple-packet message 
transmissions (5-stage network (N=32) and 100% 


network load). 


From Table 1, it can be seen that, for any given message 
size, as the packet offset is increased, the delay experienced by a 
packet in the network decreases and approaches the minimum 
traversal delay (normalized delay of 1). This is to be expected 
since the apparent network loading decreases as the packet offset 
is increased, thus reducing the network conflict delays that a 
packet will experience. The overall delay experienced by the 
entire message is also seen to decrease as the packet offset begins 
to increase from its initial value of 1, again a result of decreased 
network conflicts. However, as the packet offset continues to 
increase, the message delay time is seen to drop to a minimum 
level and then begin to increase. The increase is due to the longer 
delays between packet submissions that unnecessarily increase the 
message transmission delay. Packet offsets that lead to minimized 
overall message delays are in the range of 2 to 5. For example, 
message delays for two and four packet messages are minimized 
with a packet offset of 5, while an offset of 2 produces minimum 
delays in an eight-packet message. PUGS was also used to collect 
Statistics for many different combinations of network operating 
parameters, such as: network size, number of packets per message, 
network loading, and packet offset values. In all cases, packet 
offsets in the area of 2-5 have been seen to produce the minimum 
message delays through the network. : 


In a typical system development process, the speed of the 
PEs is determined by the selection of a target microprocessor 
chip/chip set. The packet offset will then be fixed by the 
specification of the network interchange boxes and their packet 


cycle times. The analysis discussed above allows the most cost- 
effective tradeoff point for the network’s hardware design to be 
predicted. The network’s estimated performance over a range of 
cycle times can be used as a guide in determining the required 
hardware sophistication of the network design. The most desir- 
able design, in terms of the cost-performance tradeoff, would be 
the slowest and typically the least expensive network that did not 
act as a bottleneck to the anticipated inter-PE message flow. 


4. PACKET vs. CIRCUIT SWITCHED OPERATIONS 


In inter-computer networks, the classic tradeoff between cir- 
cuit and packet switching is one of message length. Long mes- 
sage formats are best supported by circuit switching whereas 
short, ‘“‘bursty" formats are ideally suited for packet switching. 
The design of these networks tends to neglect propagation delays 
through the switch points in the network and instead concentrates 
on the transmission channel (link) delays [21, 25]. The 
differences in the two switching methods are not as clear-cut in 
intra-computer networks found in parallel processing systems. 
Because the processors are generally “‘close" physically, the inter- 
change box delays dominate over the transmission link delays. 
System characteristics such as the operational mode (e.g., SIMD 
or MIMD), the architecture supporting the network, as well as the 
anticipated message format and system loading, greatly influence 
the selection of the switching method. 


Systems that utilize the network solely for inter-PE data 
transfers (e.g., the PE-to-PE system architecture) can be character- 
ized by lightly loaded, low message conflict networks. In this 
case, Circuit switched networks can give very acceptable perfor- 
mances using circuitry that is much less complex than a compar- 
able packet switched network (e.g., message queues and their con- 
trol logic are not needed in the interchange boxes). Implementa- 
tion of a circuit switched network is particularly advantageous in 
SIMD systems performing data permutations with no network 
conflicts where, once the path connecting the source-destination 
PE pair is established, the interchange boxes contribute only 
minimal gate delays to the transmission time. 


In systems where the network is used to connect processors 
to memory modules (e.g., the processor-to-memory system archi- 
tecture), inter-processor communication and instruction fetch 
Operations utilize the network. The network loading would be 
expected to be high due to the preponderance of instruction and 
data fetch operations that must be supported. Packet switched net- 
works have been proposed for this environment [8, 9]. Note that, 
through the use of cacheing techniques, the network loading can 
be substantially reduced and, in effect, giving a network perfor- 
mance similar to the PE-to-PE architecture’s. PE-to-PE architec- 
tures will be assumed in this paper. 


When packet switching is used for inter-PE transfers, if the 
message exceeds the packet length, multiple packets must be sent 
through the network. The processors must absorb the added over- 
head of message packetization (at the source PE) and packet 
recombination (at the destination PE). Each packet must indepen- 
dently perform its own routing/path establishment. These delays, 
coupled with the queueing delays at each switch, can increase the 
overall message transmission time in a packet switching network. 


In this section, performance tradeoffs between packet and 
circuit switched interconnection networks will be explored. The 
analysis will be done using the same external system conditions in 
both switching modes. SIMD and MIMD operations will be con- 
sidered separately. 


4.1 SIMD operations 


SIMD system operations that utilize interconnection network 
transfers generally employ data permutations, where data’ is 
transferred from a set of network sources to a set of network des- 
tinations using a particular transfer scheme. Cube-type networks 
have been found to perform most useful SIMD permutations in 
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one pass through the network, with no conflicts [14, 18]. Permu- 
tations that cannot be performed without conflicts in a single pass 
can be performed in multiple, conflict free, passes through the net- 
work. As a result, the analysis for SIMD operations will assume 
that data transfers through the network will take place without net- 
work conflicts. 


To enable direct comparisons of the performance of circuit 
and packet switched networks, a number of assumptions must be 
made concerning the operation of the computer system external to 
the interconnection network. These assumptions are listed below: 


1. Data that is to be transferred through the network will ori- 
ginate from the same point within the source PEs (i.e., data 
for circuit switched transfers will not be in a PE’s memory 
while the data for packet switched transfers is in the PE’s 
registers). This ensures that the time overhead required by 
the source PEs to fetch the data will be the same for both 
switching modes. 


The message transmission time will start when the first word 
of the message is generated by the source PE. 


The overhead of computing the routing tag is not included in 
the message transmission time. In the case of multiple 
packet messages, the routing tag is computed only once (at 

. the beginning of the message) and is stored for later use in 
subsequent packets. 


Processing of data at the destination PEs will not degrade the 
performance of the network or impede the message genera- 
tion process of the source PEs. 


Packets must be fully formed at the source PE before they 
can be submitted to the network. 


In the following discussions, assumptions are made about 
values for various network parameters to demonstrate the metho- 
dology developed. The framework established here can, of 
course, also be used to evaluate different systems by modifying 
the values of these parameters. The values used below are esti- 
ne developed from our design of the PASM system prototype 
[16]. 

Using circuit switching, the message transmission process 
can be decomposed into two phases -- set-up (or path establish- 
ment) and the actual data transfer. During the set-up phase, the 
message request (routing tag) must propagate through interchange 
boxes in each stage of the network and establish the desired data 
path. Let the time to establish a path through an interchange box 
be (T,, + Tj, + T,). Ty is the time required by the box’s control 
unit to check the routing tag and establish the needed box setting 
and T;, is the propagation delay through the interchange box’s 
Switching logic. T, is the link transmission time between 
interchange boxes in adjacent stages as well as between the PEs 
and the network input and output stages. From this, the path set- 
up time, Tyetyp, can be calculated and will require 


n(T oy + Ti, + Ty) + T, + n(Ti, + T) + T, (1) 


time units to establish a path through an n-stage network. Note 
that the first two terms represent the total time required by the 
message request to traverse the network from the source to the 
destination. The second two terms represent the time required by 
a message grant signal to be returned from the destination to the 
source. 


Once the path through the network has been established, the 
actual data transfer can begin. The delay experienced by a data 
word in traversing an n-stage network is 


Taetay = 2(n+1)T, + 2(n)Tip « (2) 


The factors of 2 represent the time to. transmit the data word itself 
to the destination plus a returned data acknowledge signal gen- 
erated by the destination PE. Let L be the total message length, 
excluding the routing information. Additionally, let T generation be 


the time required by a PE to fetch a data word from memory and 
send it to the network input port. The actual data transmission 
time, T, nit, Will be 


Tymit = 


If the generation time of a word is less than the transmission 
delay of a word, the total delay will be the generation time of the 
first word plus L transmission delays. Under these conditions, the 
performance of a circuit-switched network could possibly be 
improved through the use of data pipelining (with data buffers 
placed between the network stages) or the use of a message- 
switched network design [12]. If, on the other hand, the genera- 
tion time is greater than the transmission delay, the total time will 
be L generation delays plus the transmission delay of the last data 
word. In other words, if the transmission delay of the network is 
small with respect to time required by the source PEs to generate 
the data words, the overall data transmission time of a message 
will be determined by the speed of the PEs -- not the speed of the 
network. In these instances, circuit-switched network performance 
enhancements such as data pipelining will not improve the overall 
message transmission time. Hybrid techniques such as pipelining 
and message-switching will not be considered in this paper. 


For example, the Motorola MC68010 microprocessor, 
operating at a clock frequency of 10 Mhz, is capable of perform- 
ing one memory-to-memory data transfer in, at best, 800 ns (the 
network source and destination ports are considered to be memory 
addresses by the PE) [17]. If the 2-by-2 switching logic of the 
interchange boxes is implemented using multiplexers or by two 
levels of tri-state buffers, T;, can be expected to be approximately 
20 ns (worst-case, TTL logic). T, can be assumed to be no more 
than 5 ns. A reasonable value for T,, is 100 ns. The total delay 
experienced by a data word would then be (from Equation 2) 


T generation + L-Tyetay, Tdetay>7 generation 


L-T seneration + Téelay, TédetayST generation 


Network delay = 10(n+1) + 40n 
= 50n+ 10. 


Clearly, for system and network sizes that are currently imple- 
mentable, the speed of the PE will be the determining factor in the 
data transmission time. Using the representative timing values, 
listed above, the set-up time (from Equation 1) and the data 
transmission times can be combined into the overall message 
transmission time, Tyressage: 


(4) 


Tnessage= T setup + Tymit 
= (150n + 10) 
+ L-(PE word generation rate) . 


(5) 

Next, consider the operation of a packet switched network. 
Note that, because there are no conflicts within the network, there 
will be no queueing delays within the network. The size of the 
interchange box buffers will not effect the overall network opera- 
tion. With the exception of the logic to form and control the 
interchange box buffers, the architecture and speed of a packet 
switched interchange box will be the same as that of the circuit 
Switched box, discussed above. The values for T,,, Tj,, and T, 
will be the same. Let T, be the time needed to access a buffer in 
an interchange box. Because the buffer will be physically close to 
the rest of the circuitry for the box, if not on the same VLSI chip, 
T, will be substantially less than the time required to perform a 
PE memory access. Assume that buffer reads and writes can be 
overlapped. The delay, Thacker, that a data packet would experi- 
ence in traversing a packet switched box would be 


T packet = Tou + (Tq + Tip + T)(P+I) . (6) 


The factor (P+1) is the actual packet size including the routing tag 
(P data words plus one routing word). Time required for 


handshaking between adjacent interchange boxes can be over- 
lapped with the other phases of T packet. 


Using dual-ported memories for the packet buffers, Ty would 
be at least 100 ns. Substituting this and the previously assumed 
values, Equation 6 then reduces to 


Tpacket = 125(P+1) + 100 ns . (7) 


The time required by a PE to generate a packet would be 
(P+1)(PE word generation time). Continuing with the MC68010 
example, this would be, at best, 800(P+1) ns. The ratio of the 
packet generation time and Tpacke, the packet offset, will be 
approximately 6. This leads to an alternate view of the packet 
offset time for SIMD operations. 


When more than one packet is generated per message, if the 
number of stages in the network is less than or equal to the packet 
offset, a packet will completely traverse the network before the 
next packet in the message can be formed by the source PE and 
submitted to the network. The pipelining of message packets will 
not occur -- effectively eliminating one of the general advantages 
of packet switched networks. An L-word message, divided into 


[L/P | data packets will have a transmission time of 
T message = [L/P |(Packet generation time) 
+ NT packet . 


(8) 


The first term represents the delay between the generation of the 
first word in the message (contained in the first packet) and the 
time the last packet is submitted to the network. The second term 
is the time required by the last packet to traverse the network. 


Table 2 lists the message transmission times for various 
sized messages using both packet and circuit switching modes. 
The message transfer times were calculated using the representa- 
tive timing values for the interchange box components and PEs, 
discussed above. It is clear that, for SIMD operations, the circuit 
switched mode provides lower message transmission delays since 
the transmission of individual data words can be overlapped with 
the PEs’ generation of the next data words to be transmitted. The 
overlap or pipeline performance normally associated with packet 
switched operations is precluded by the processing speed 
differential between the PEs and the packet switched network. 
From Table 2, it is also observed that packet sizes of two or four 
words yielded generally lower message transmission times than 
the other packet sizes. Small packet sizes incur an additional gen- 
eration overhead of having to write the routing tag to many 
different packets (e.g., for a 4-word message, four 1-word packets 
with routing tags must be generated compared to only two 2-word 
packets and one 4-word packet). As the packet size increases, the 
packet delay time through an interchange box, T packer, increases, as 
does the packet generation time. The combination of these two 
effects causes an increase in the total transmission delay through 
the network of each packet and negates the relative advantage of 
having to transmit fewer total packets for a given message size. 
The performance of the 2- and 4-word packets were a compromise 
between the two extremes. 


4.2 MIMD operations 


The performance analysis and comparison of circuit and 
packet switched networks presented in the last subsection was 
based on the assumption that network transfers would take place 
without conflicts. In MIMD operations, this assumption is not 
valid. Here, the performance of a network will be affected by 
conflicts within the network and the associated queueing delays. 


For MIMD networks, the circuit switched message transmis- 
sion process can still be decomposed into the set-up and the data 
transfer phases, as before. The conflict and blocking effects on 
the network’s MIMD performance can be predicted using the Mar- 
kov analysis of [15] or by simulation. The analysis of packet 
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switched networks must also include the effects of queueing 
delays within the buffers associated with each interchange box. 
The simulation methodology described in Section 3 for multiple- 
packet messages can also be used here. 


Any attempt to directly compare the performance of circuit - 
and ‘packet switched networks must use the same operational 
environment for both networks. External (to the network) factors 


| size _| aa 


Packet ee ranster me 


a 


“T41 O21 3.81 7.01 13.41, 
—— 300 4.60 7.80 14.20 27.00 
4 stiaes 2 430 430 670 11.50 21.10 
8 packet 4 6.90 690 690 10.90 18.90} 
8 12.10 12.10 12.10 12.10 19.30 
oH 22.50 22.50 22.50 22.50 22.50 


1.56 2.36 3.96 7.16 13.56 

—— 3.35 4.95 8.15 14.55 27.35 

R atages 2 4.78 4.78 7.18 11.98 21.56 
8 packet 4 7.63 7.63 7.63 11.63 20.53 

8 13.33 13.33 13.33 13.33 20.53 

16 24.73 24.73 24.73 24.73 24.73 


Table 2. SIMD performance comparison of circuit and 
packet switched networks (times in microseconds, 


packet size does not include routing tag). 


such as the system size, processing speed of the PEs, and the net- 
work loading factor must be the same for both networks. Addi- 
tionally, the fundamental design of the two networks must also be 
the same (e.g., data path width, the technology used in the inter- 
change box implementations, and the PE-network interface metho- 
dology). When these internal and external factors are fixed, a 
comparison between the two switching methods can be made. 


Assume the networks are constructed such that the timing 
values used in Section 4.1 remain valid. Specifically, 
Ty = 100ns, Ty, = 20ns, T, = S5ns, and Tg = 100ns. Furthermore, 
assume that the PEs are implemented using a clock frequency of 
10 Mhz, as in an MC68010 microprocessor (the microprocessor 
requires 800ns to write one word from memory to the network). 
Under these assumed conditions, Tables 3 and 4 list simulation 
results for the message transmission times and throughput perfor- - 
mances of circuit and packet switched networks for a range of 
loading values and message lengths. Comparing the data in these 
tables to that of Table 2, it can be seen that the MIMD perfor- 
mance of both network types approaches the SIMD performance 
values as the network loading is decreased (as would be 
expected). The performance relationship between the circuit 
switched conflict resolution algorithms is as predicted by [15] -- 
the drop algorithm provides generally lower transmission times 
and higher message throughputs than does the hold algorithm. 


Over the range of message lengths considered in Tables 3 
and 4, minimum message transmission times for a given data 
transfer size (message length) can be obtained in the packet 
switched network when the entire message is contained in only 
one or two packets. As in SIMD operations, the performance tra- 
deoff is lighter network loading (number of packets) with larger 
packet sizes versus shorter packet transmission times of the indivi- 
dual packets with smaller packet sizes. 


A representative plot of the message transmission times for 
both circuit and packet switched networks is shown in Figure 2. 
For each plot, the data used was taken from the conflict resolution 
algorithm (for circuit switching) or the packet size (for packet 
switching) that provided the lowest transmission times. The cir- 
cuit switched network provided superior network performance for 
smaller message sizes while packet switching performed best with 
larger message sizes. This can be attributed to, in circuit switch- 
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MIMD message throughputs for circuit and packet 
switched networks (4-stage network (N=16), 
throughputs expressed as the number of messages 
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Figure 2. Comparison of message transmission delays for 


both circuit and packet switched networks (4-stage 
network (N=16), 100% network loading, times in 
microseconds). 


ing, significantly higher conflict rates and blocking times that 
result when the network paths are held for periods of time. 
corresponding to the transmission of ‘‘long" messages, as reported 
in [3, 15]. This does not occur in packet switched networks. In 
contrast, packet switched networks tend to perform better with 
longer messages, where the overhead of the packet generation 
(i.e., the routing tag) is proportionately smaller (when compared to 
the actual message transmission time) and the delays due to net- 
work conflicts do not increase as fast as in circuit switched net- 
works. For small message transfers, the blocking delays within a 
circuit switched network will be reduced, thus producing good 
overall message delay characteristics. This, coupled with the rela- 
tively high overhead of packet generation and queueing delays 
experienced by packet switched messages (when compared to the 
transmission time), enables circuit switched networks to function 
better than packet switched networks for small message lengths. 


5. CONCLUSIONS 


This paper has centered on the modeling and analysis of 
multiple-packet message formats and their relative performance 
advantages when compared to circuit switched networks. A 
multiple-packet message format is needed within an interconnec- 
tion network when the message to be transmitted exceeds the 
network’s packet size. In order to quantize network performance, 
the operation of interchange boxes, networks, and external systems 
were expressed as time functions. General equations relating these 
time functions to the performance of a network were developed. . 
By using these equations and the PUGS simulator, the effects of © 
any set of internal and external environment assumptions could be 
evaluated and network performance -- eMac packet or circuit 
switched -- could be predicted. 


During the analysis of packet switched networks, the effects 
of various packet sizes was examined. The packet offset figure of 
merit was introduced and used as a performance gauge of the 
transmission rate differential between the network and the PEs. It 
was observed that the optimum network performance (in terms of 
low transmission delays and reasonable network component costs) 
could be obtained with a packet offset in the range of 2 to 5. 


Comparing the performance of circuit and packet switched 
networks for a specific set of time function values, circuit switch-: 
ing was shown to provide minimum transmission delays in all 


SIMD operations and in short message transfers in MIMD opera- 
tions. Packet switched networks functioned better than circuit 
switched networks as the message length increased. The dominant 
element of the network performance was seen to be the processing 
rate of the PEs themselves. 


A general methodology has been developed that quantized 
system and network parameters, thus allowing the performance of 
a network design to be evaluated under a set of anticipated operat- 
ing conditions. By adjusting the assumed values of these parame- 
ters, this methodology can be used by other researchers as an aid 
in the design of their own networks. 
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ABSTRACT: A reconfigurable multicomputer archi- 
tecture based on rectangular CC-banyan multistage 
interconnection network is presented. A graph- 
theoretical approach is used to study this network's 
permuting and structural properties. It is shown that 
a CC-banyan has a modular structure and hence can be 
recursively defined. A method for evaluation of the 
total number of permutations in CC-banyans is pre- 
sented. Using this method, we derive the analytical 
expressions for the number of permutations in 
CC-banyans with fanouts two and three. 
1. INTRODUCTION 

There is growing interest inmulticomputer systems 
for increased performance and improved reliability. 
The operation of these systems is critically dependent 
on the interconnection network which connects system 
resources such as processors and memories (or compu- 
ters). As a result of the rapid advances in LSI and 
VLSI technology, extensive research is being conducted 
on multicomputer systems. Many of these systems are 
based on multistage interconnection networks 
(MINS). 

Various MINS implemented with 2x2 switching ele- 
ments, such as baseline, modified data manipulator, 
flip, omega, indirect binary n-cube and regular rec- 
tangular SW-banyan (s=f=2) are known to be topologi- 
cally equivalent to each other [1]. These equivalent 
networks possess a “buddy property," i.e., the outputs 
of two switching elements at stage i are connected as 
inputs to only two switching elements at the (i+1)th 
stage [2]. 

Banyans [3] represent a large class of intercon- 
nection networks. An SW-banyan with fanout two de- 
scribes a class of equivalent MINs implemented with 
2x2 switching elements. Also, many non-equivalent 
MINs [4] represent special cases of cylindrical cross- 
hatch (CC) banyans [3]. It has been shown in [5] that 
the so-called ‘non-equivalent’ delta network [6] is 
isomorphic to a CC-banyan, and that the ADM network 
L7] can be viewed as two overlapping CC-banyans. Per- 
muting and partitioning properties of SW-banyans and 
of equivalent MINs have been extensively studied else- 
where L[2,3,8,9, 10,11]. In this paper we focus on 
non-equivalent networks. We analyze the recursive 
structure and evaluate the total number of permuta- 
tions supported by a CC-banyan. Detailed derivations 
aa presented in this paper can be found in 
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2. SYSTEM DESCRIPTION 

The system model is composed of the resources and 
the interconnection network. In this paper we use the 
rectangular CC-banyan interconnection network to 
study system mapping properties. 

In the following we give a brief description of 
CC-banyans. A banyan network [3] is a network with a 
unique path from each source to each sink vertex. A 
multistage interconnection network is a network in 
which vertices can be arranged in stages, with all the 
source vertices at stage (level) 0, and all the out- 
puts at stage i connected to inputs at stage i+ 1. 
An L-level banyan, or an L-stage MIN, is a network in 
which every path from any source (base) to any sink 
(apex) has length L. An (f,L) banyan is an L-level 
banyan in which the indegree (spread-s) of every 
intermediate vertex equals its outdegree 
(fanout-f). 

(f,L) regular rectangular CC-banyan: An (f,L) 
banyan in which there is an edge from vertex i at ver- 
tex level k+1 to a vertex j at vertex level k whenever 
j=(it+mf*) mod fi for m = Oy de 2 iaesf=1s- 06 15.5.6 
f*-1 and 0 <k <L-l. 

Examples of (3,2) and (2,3) regular rectangular 
CC-banyans are shown in Fig. 1 and 2, respectively. 
The vertex levels are numbered from bases (level 0) to 
apexes (level L). The interpretation of the network 
graph is the one commonly used with respect to banyan 
networks: a vertex represents a tie point and an edge 
represents a switch contact. According to this inter- 
pretation, connections through vertices in a banyan 
graph are mutually exclusive. 

An example of a multicomputer system based on a 
rectangular CC-banyan network is shown in Fig. 2. In 
this system each computer (processor with local 
memory) is attached to both sides of an (f,L) 
CC-banyan. A similar approach utilizing a different 
network topology but similar connection of processing 
nodes to both sides of the MIN is used in the Star 
network [13]. The communication through the network 
is unidirectional and the control is based on the 
routing tag technique. The labeling and routing 
schemes for a system based on a CC-banyan are 
discussed in [14]. We just note here that the routing 
scheme in CC-banyan leads to relative addressing, 
based on the difference between destination and source 
address, in contrast to the systems based on 
SW-banyans which use absolute addressing. 

3. RECURSIVE STRUCTURE OF CC-BANYAN 

Whereas the original definition of CC-banyans is 
not recursive, it will be shown here that CC-banyans 
can be recursively defined. Conceptually, it is con- 
venient to view an (f,L) rectangular CC-banyan as 


a union of L stages, where each stage represents a 


bipartite graph [4]. For example, three bipartite 
graphs for (2,3) rectangular CC-banyan are shown in 
Fig. 3a. : 
Theorem 1: In an (f,L) rectangular CC-banyan a bi- 
partite graph of stage k (0 < k < L-1) contains f 
identical components (disjoint subgraphs). 

Different components of a (2,3) CC-banyan are 
shown in Fig. 3b. 

Corollary 1: In an (f,L) rectangular CC-banyan, 
the bipartite graphs of all L stages are not isomor- 
phic (to each other). 

Note that in the case of equivalent MINs (e.g., 
baseline, omega) the bipartite graphs of all stages 
are isomorphic, Since these MINs have the "buddy prop- 
erty" [2]. 


Corollar 
< L-1) has ant: k vertices. 


Corollar In an (f,L) rectangular CC-banyan 
the bipartite ph of stage-0 at the base side con- 
tains one component; and the bipartite 2 graph of stage- 
(L-1) at the apex side contains f'7* components, 
which are (fxf) crossbar graphs. 

Note that the recursive decomposition of a 
CC-banyan into networks of smaller size may be only 
possible from the base side (due to Corollary 3). In 
contrast to CC-banyans, SW-banyans and networks 
equivalent to them can be recursively decomposed from 
either side. 

Theorem 2: An (f,L+1) CC-banyan can be recur- 
sively decomposed from the base side into f disjoint 
(f,L) CC-banyans. 

Based on Theorem 2, we give the following recur- 
Sive definition. 

Definition: An (f,1) CC-banyan is simply an (fxf) 
crossbar graph. An (f,L+1) CC-banyan is constructed 
from f (f,L) CC-banyans by the _ following 
procedure: 

1. Assign a number m (m=0,1,...f-1) to each (f,L) 
CC-banyan. 

2. In the m-th (f,L) CC-banyan, label the ver- 
tices at every vertex level by the numbers 

m+jf , where j = 0,1,2,...,f 
(see Fig. 4a). 

3. Form stage-0 (base stage) of the composite 
(f,L+1) CC- pana by connecting the bases numbered 
i(is Oslieest 1) to the bases of vice component 
(f,L) CC-banyans numbered (i+mf)mod f for all 
m=0,1,...,f-1 (see Fig. 4b). 

4. Rearrange the vertices of the composite 
(f,L+1) CC-banyan in the order of increasing label 
numbers, thus resulting in the conventional represen- 
tation (see Fig. 4c). 


Each component of stage k (0 <k 


4, PERMUTING PROPERTIES 

In this section we evaluate the number of various 
permutations performable by an (f,L) CC-banyan in one 
pass. We view an (f,L) rectangular CC-banyan as a 
permutation network that performs a one-to-onemapping 
(permutation) of a set of N(N=f-) apexes onto a set of 
N bases. 

Since in a banyan network every input-output con- 
nection (base- ~apex path) is unique, the total number 
of permutations is equal to the product of all permu- 
tations performable by each stage: 


as shown in Fig. 5. 
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p=! |P, (1) 


k=0 


where Py is the number of stage-k permutations. 
In view of Theorem 1 and Corollary 3, 


L- 
; k 
p = (f1) fr [ any 


Here Cr , denotes the number of permutations per- 
formable by each of fk disjoint subgraphs of stage k. 
The indices indicate that this number, generally, de- 
pends on both fanout f and stage k. An exact anal yti- 
cal evaluation for Cr, does not seem feasible for 
arbitrary fanout f. However, we present a general ap- 
proach which enables systematic enumeration of all 
permutations. Our analysis uses unconventional repre- 
sentation for a stage-k subgraph, in which the ver- 
tices of a bipartite subgraph are located on the two 
concentric circles corresponding to the vertex levels 
k and k+l. An example of such symmetric circular 
representation, or circular diagram, is shown in Fig. 
5. Consider circular diagram of a stage-k subgraph. 
As follows from Corollary 2, this subgraph performs 
permutations on f*~* numbers. A permutation in this 
Subgraph is a one-to-one mapping of its level-(k+1) 
vertices onto its level-k vertices. 

A stage-k circular diagram may be divided into 
radial sectors of equal size (f-1) using radial cuts 
The radial cuts ane radial sec- 
tors are numbered clockwise from 0 to (f K_1)/(f- 1), 
so that a radial sector mis formed by radial cuts m 
and m+1, m = 0,1,..., (f 1)/(f-1). Then each of 
the sectors has (f-1) vertices, and the last sector 
has only one vertex. Consider a set of edges corre- 
Sponding to a permutation in a circular diagram. An 
edge corresponding to a permutation is called across 
edge if it crosses a radial cut in acircular diagram. 
Now we can state an important invariant property of 
permutations in CC-banyans. 

Theorem 3: For any stage-k subgraph permutation 
in a circular diagram, all radial cuts have the same 
number of cross edges i (0 <i < f-1). 

An example of two different permutations and corres- 
ponding cross edges in a circular diagram is given in 
Fig. 5. 

Theorem 4: In acircular diagram of stage-k sub- 
graph, there exists only one permutationcorresponding 
to 0 or (f-1) cross edges: 


(2) 


C ~ 
Speci fical iv, , the identity permutati op has Ocross 
edges, and the cyclic shift i>(i+f-1)modf has f-1 
cross edges (see Fig. 5). Theorem 4 also implies that 
in a CC-banyan with fanout f=2 all stage-k subgraphs 
can perform exactly 2 permutations, regardless of k. 
Theorem 5: The total number of permutations per form- 
able by rectangular CC-banyan with fanout f=2 and 
number of stages L is: 


p = 2er-l 
A systematic enumeration of all permutations in 


the case of arbitrary f uses a "divide-and-conquer" 
approach based on Theorems 1 and 3. Specifically, 


the number of permutations performable by a stage-k 
subgraph can be represented as 


c 5 
ity “fk 


fk ~ 


where c. , is the number of permutations which have 


exactly i cross edges. 

In view of Theorem 4, the above expression is 
Simplified to: 

f-2 

=2+ ) 

i=l 


The value of chy (1 <i < f - 2) can be found 
by enumeration of att permutations having the same 
number of cross edges i. A method for a systematic 
enumeration of such permutations is discussed below. 
Consider a set of permutations having the same number 
of cross edges i(1 <i <f- 2). We may enumerate all 
permutations in a circular diagram by choosing, suc- 
cessively, | gress edges for every radial cut m, m=0, 
1,2,..+, (f-"*-1)/(f-1). Since i cross edges ina 
radial cut (m+l) can be chosen independently from i 
cross edges in a preceding cut m, we can construct a 
tree of all possible outcomes (decision tree) for a 
systematic enumeration of all permutations. We illus- 
trate this method using an example of CC-banyan with 
fanout f=3. Only the case when the number of cross 


Cr (3) 


edges i=1 has to be considered, since C3. k= C3 ,=1. 

In this decision tree, a tree level corresponds “to the 
radial cut number (see Fig. 6). Since every sector 
has f-1=2 vertices, a vertex incident to across edge 
can be chosen by 2 different ways, and the decision 
tree fanout is two. If the vertices in each sector 
are numbered from 0 to f-2, as shown in Fig. 5, then 
the vertex m(0 <m< f-2) can be incident to f-1-m=2-m 
different cross edges. Therefore, two different cross 
edges can be chosen for vertex m=0, and only one cross 
edge can be chosen if m=1. Hence, we assign the 
weights 1 and 2 correspondingly to every "right" and 
"left" outgoing link in the decision tree. The total 
number of permutations can be found by adding the 
weights of all leaves where the weight of each leaf is 
a product of all link weights along the path from root 
to this leaf. Since we may choose outgoing links 
(with weights 1 or 2) independently at every tree 


level, the rae eae represents a binomial sum: 
J -J 
i Gai. =a" 


where h is ie decision tree height, or the number of 
radial cuts, whose i cross edges can be chosen inde- 
pendently: 
h = (fb-kK.1)/( £21) 
Therefore, the number of permutations having the 
number of cross edges i=1 in a stage-k subgraph of an 
(3,L) CC-banyan is: 

ct = 3(3t-k-1)/2 

3,k 0<k <L-2 (4) 

The total number of permutations in a (3,L) CC-banyan 
follows immediately from (2), (3) and (4). 

Theorem 6: The number of permutations in an (f,L) 
CC-banyan with pea f=3 is: 


L- 
; ane a c2eg(3t-k-1) 273" 


The method of decision trees can also be used for enu- 
meration of all permutations in the case of general 
fanout f. However, the process of constructing deci- 
sion trees becomes quite complex for the big values 
of f. 

5. CONCLUSIONS 

In this paper we presented a graph-theoretic 
approach to the analysis of rectangular CC-banyan 
net works with an arbitrary fanout f and an arbitrary 
number of stages L. It is shown that CC-banyans, like 
many other networks can be constructed recursively. 
This recursiveness enables the modular structure of 
CC-banyans and can be used in the analysis of its par- 
titioning properties. 

We presented a general approach to the analysis 
of permuting properties of CC-banyans. The analytical 
expressions for the number of permutations performable 
by CC-banyan with fanouts 2 and 3 are derived. In the 
case of general fanout f, we propose a method, based 
on decision tree analysis, for a systematic enumera- 
tion of all permutations. 
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with an example of routing. 


- Bipartite graphs b) Disjoint subgraphs 


{ 
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Rearrangeability of the 5-Stage 
Shuffle/Exchange Network for N = 8 * 
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ABSTRACT 


In this paper we prove the rearrangeability of a multi- 
stage shuffle/exchange network with 8 inputs and outputs 
consisting of five stages. A lower bound of (2 log,N— 1) 
stages for rearrangeability of a shuffle/exchange network with 
N = 2" inputs and outputs is known; we show its sufficiency for 
N= 8. We not only prove the rearrangeability, but also de- 
scribe an algorithm for routing arbitrary permutations on the 
network and prove its correctness. In contrast to previous ef- 
forts to prove rearrangeability, which rely on topological equi- 
valence to the Benes class of rearrangeable networks, our 
approach is based on first principles. We also show that two 
switches in the network are redundant. The results in this paper 
are useful for establishing an upper bound of (3 log,N— 4) 
stages for rearrangeability of a multistage shuffle/exchange 
network with N > 8, as demonstrated in [14]. 


1. INTRODUCTION 


Shuffle /Exchange networks, initially proposed by Stone 
[13] have been the subject of extensive treatment by several 
researchers (for a survey on shuffle/exchange networks, see [3]). 
A shuffle/exchange network with N = 2” inputs and outputs 
consists of a perfect shuffle permutation [13] followed by an 
exchange stage with N/2 switches, each of size 2 x 2. A mul- 
tistage shuffle/exchange network is constructed by cascading 
multiple shuffle/exchange stages in series. A _ multistage 
shuffle/exchange network for N = 8 with five stages is shown 
in Figure 1. 


Interconnection networks with the capability of passing 
all the N! permutations on N elements in one pass through the 
network are known as rearrangeable networks [2]. The Omega 
network [5], which is a multistage shuffle/exchange network 
with log,N stages, can perform only a small fraction of the N ! 
permutations when N is large. However, the permutation ca- 
pability can be improved by adding more stages. 


The universality of shuffle/exchange networks was 
studied by researchers in an attempt to determine how many 
stages are required to attain the capability of performing arbi- 
trary permutations, ie., rearrangeability [4, 7, 10, 11, 12, 14, 
16]. An asymptotic lower bound of (2 log,N — 1) stages of 
2 x 2 switching elements for rearrangeability has been known 
for long [15]; the validity of this lower bound for 
shuffle/exchange networks can be established for any N = 2” 
by showing the existence of permutations which cannot be 
“ This research is supported by the NSF Presidential Young Investigator Award 
alee an AT&T Grant, and DARPA/ARO Contract No. DAAG 
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performed with less than (2 log,N — 1) stages [14]. The suffi- 
ciency of this many stages for shuffle/exchange networks has 
neither been proved nor disproved. Stone observed that, by 
using an algorithm due to Batcher [1], sorting of arbitrary se- 
quences of data can be performed by cycling through a single 
shuffle/exchange stage ( log,)? times when the switching ele- 
ments are replaced by 2-input, 2-output sorting elements [13]. 
This algorithm can be used to map arbitrary permutations on a 
multistage network with ( log)? shuffle/exchange stages. An 
algorithm for performing arbitrary permutations on a single- 
stage shuffle/exchange network in 2( log,N)? passes was also 
described by Siegel [10], and the number of passes was subse- 
quently improved to (3/2)( log,N)? — (log,N)/2[11]. Parker 
improved this bound by showing that 3 log,N stages are suffi- 
cient for rearrangeability [7], though he did not specify a con- 
trol algorithm for finding the states of the switching elements 
for arbitrary permutations. Wu and Feng observed that 
(3 log,N — 1) stages are indeed sufficient for rearrangeability 
[16]; they also showed how to compute the switch settings for 
arbitrary permutations. Kothari et al. observed that 
(3 log,N — 3) stages are sufficient for N= 16 and 32 [4]. 
More recently, Varma and Raghavendra showed that, if the re- 
arrangeability of (2log,R — 1) stages can be proved for a 
shuffle/exchange network of size R=2° ,_ then 
3 log,N — (r+ 1) stages are indeed sufficient for rearrangea- 
bility of a network of size N>R [14]. 


For networks of small size, it may be possible to use 
exhaustive techniques to prove or disprove rearrangeability. 
Notice that, for N = 4, a cascade of three shuffle/exchange 
stages is identical to the Benes binary network and hence rear- 
rangeable. Parker showed, by exhaustive enumeration, that 5 
stages of shuffle/exchange are sufficient for rearrangeability 
when N=8 [8], thereby suggesting that the limit of 
(2 log,N — 1) may be reachable, at least for small networks. 
In this paper we provide a constructive proof for showing that 
the lower bound of (2 log,N — 1) stages for rearrangeability 
of a multistage shuffle/exchange network is also an upper 
bound for N = 8 by establishing the rearrangeability of the 
network in Figure 1. As opposed to previous efforts to prove 
rearrangeability, which relied on reduction of the network to the 
topology of the Benes rearrangeable network, our approach is 
based on first principles. It is possible, in fact, to show that the 
network in Figure 1 cannot be reduced to a topology equivalent 
to that of the Benes network. We also describe an algorithm for 
routing arbitrary permutations on the network and prove its 
correctness. When used in conjunction with the results in [14], 
this establishes an upper bound of (3 log,N — 4) stages for re- 
arrangeability of a multistage shuffle/exchange network for all 
N > 8. Our analysis also shows that two switches in the net- 


work are redundant and could be eliminated without affecting 
the rearrangeability (Note that the Benes binary network of size 
8 has three such switches). 


2. PARTITIONING OF PERMUTATIONS 


The procedure for routing arbitrary permutations on the 
5-stage network is based on partitioning the connections of the 
‘permutation into four connection sets, consisting of two con- 
nections each, and assigning each connection set to one of the 
switches in the middle stage of the network. This partitioning 
is done such that no conflicts arise in either the first two stages 
or the last two stages of the network. 


Definition 1: Let 7 be a permutation on the set of inte- 
gers {0,1, ... 7} . A connection set C of 7 is a collection of k 
input—output pairs of 7, k < 8, expressed as 


x} x? x* 
C = f (.65)-(.6) = (2) t. (1) 


where {x!,x?,... x*} © {0,1,...,7}. If k = 8, then C defines 
the whole permutation. 

Definition 2: The set of input terminals of a connection 
set C , denoted by C-, is the set of all inputs x, 0 < x < 7, such 


that the pair ( belongs to C. Formally, 


n(x) 
cH 5 | 4 ct 


Similarly, the set of output terminals of C, denoted by C*, is the 
set of output terminals of the pairs in C. 


Ct = {a(x)| xe Cc} 


(2) 


(3) 


Definition 3: A valid partition of a permutation 7 is a 
partition of 7 into four disjoint connection sets Cy, C;, C,, C; , 
with two elements each, satisfying the following properties: 


Property 1: Each C7, 0 <i < 3, contains exactly one element 
from each of the sets of input terminals J, and /, defined 
by 


Ig = {0,2,4,63; Ty = {1,3,5,7}. (4) 


Property 2: For every 0 < x < 7 with x = x,x,% (in binary), if 
x is contained in the set C5 UC;, then the element 
x! = X%X,% is in the set Cy UC; . 


Property 3: The sets C¢ U C+ and C} U C} each contain exactly 
one representative from each of the sets of output terminals 
O 9 O;, O,, O;, defined by 


Op = {0,1}; O, = {2,3}; O, = {4,5}; O; = {6,7} (5) 


Property 4: Each C+, 0 <i < 3, contains exactly one element 
from each of the sets O, U O, and O,U O,. 


Properties 1-4 have been selected so that, if Switch i in 
the middle stage of the network carries the connections in the 
set C, , for 0 <i < 3, no conflicts can arise in any of the 
switching stages of the network. 


The following algorithm constructs a valid partition of 
the permutation 7 : 


Algorithm 1 


Step 1: Partition the set of output terminals of 7 into four sets 
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P, , P;, Ps, P;, with two elements each, as defined below: 


Po = $4(0), m(4)}; P,; = {7(1), a(5)}; 


Py = {x(2), 1(6)3; P, = §2(3), a(7)} (6) 


Step 2: Construct two sets X, and X,, of four elements each, 
such that each contains exactly one representative from 
each P,and O, for0 <j < 3. That is, 


IXNP | = 1XNOl =1, (7) 
forallO0 <j < 3 andi=0,1. The set X, can be constructed 
in a strictly sequential manner as follows: Start from one 
of the P’s, say P,, and select an element arbitrarily for in- 
clusion in X). Let this element be ‘a’ and let a € O, for 
some 0<j <3. If ‘b’ is the other element in O, and 
b « P,, for some k, then b cannot be selected for inclusion 
in X, and the other element in P, has to be chosen. Thus 
the selection is performed by /ooping from P, to O, to P, and 
so on, until we reach a set P, from which an element has 
already been chosen, which indicates the completion of a 
cycle. At this point, a fresh cycle may be started by 
choosing any of the unvisited P’s, until all P’s are proc- 
essed. This procedure is identical to the looping algorithm 
for control of the Benes network [6], and can be likened to 
the vertex-coloring of a bipartite graph. 


Once X, is constructed, the remaining elements naturally 
fall into X,. That is, X; = {0,1, ... ,7} — Xp. 
Step 3: Construct sets Ro, R, , R,, R; as 


Ro = Xp Nn (Io); R, = Xp Nn a1); 


R, = XxX nN a1); R, = Xj nN a(I;) (8) 


where 7(J,) = {a(x)| x € Kh} and w(J,) = {a(x)| x € 4}. 


Step 4: Construct two sets Y, and Y,, of four elements each, 
such that each contains exactly one representative from 
each R, and O,, for 0 < j < 3. That is, 


I¥,NR|l=1YNO| =1, VOs js 3 andi=0,1. 
(9) 


This construction can be performed in an identical manner 
as the construction of X, and X, in Step 2. 


Step 5: Construct two sets X’, and X’, from Ro, R;, Rp, R; as 
follows: 


Case (i): If (XN Yp>) has representatives from both 
{0,1,2,3} and {4,5,6,7} then X’) = X, and X, = X}. 


Case (ii): If (XN Y,) has representatives from only one of 
the sets {0,1,2,3} and {4,5,6,7} then 


Xo = Ro N Ry; X', = R, UR». (10) 


Step 6: The connection sets Cy, C,, C,, and C, are now obtained 


as Cy.) = f ( a ) 12) eX,NY,b 


n(x) O<ij<1. 


(11) 


Before proceeding to prove that the algorithm produces 
a valid partition of the permutation 7, we show an example to 
illustrate the idea. 


Example 1: Consider the permutation 


7- (91234567 
~\01462357)° 


In this example, 7(J,) = {0, 4, 2, 5} and w(/,) = {1, 6, 3, 7}. 
Step I: 
Po = (0, 2}; P, = {1,3}; Py = {4,5}; Py = {6, 7}; 
Op = 10, 1}; O, = {2,3}; O, = {4,5}; O, = {6, 7}. 
Step 2: 
Xo = 10,3, 4,7} X, = {2, 1,5, 6}. 


The set X, was constructed by first choosing the element 
0 from Py; Since O € O,, the element 1 cannot be included. As 
1 € P,, the element 3 from P, has to be selected; this completes 
acycle. Since P, = O, and P; = O,, the representatives from P, 
and P; can be chosen arbitrarily. 


Step 3: 
Ry = XoN Up) = {0,4} 
R, = Xn a(J;) = {3,7} 
R, = Xj N aI) = {2,5} 
R, = X,Na(1,) = {1,6} 
Step 4: 


Y, and Y, are constructed by choosing one element from 
each of Ry, R;, Ro, R; so that they have one element in common 
with each of Oo, O,, O,, O,. If we force Y, to contain the ele- 
ment O, then 


Y = 10, 6,3,53; Yy = {4, 1,7, 2}. 


Step 5: 
Since XM Yo = {0, 3} © {0, 1, 2, 3}, case (ii) applies. 
Hence, 


Xo = RoUR; = 
X',=R, UR, = 


{0, 1, 4, 6}; 
{2, 3, 5, 7}. 
Step 6: 


Xo NY = {0, 6}; XO NY, = {1,4}; 
XN Vo = {3,5}; XN Y, = {2,7}. 


Therefore, the connection sets C, - C, are given by: 


«- £6). #9- £0). Oh 
o£). 4-£0).0)} 
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It may be verified that these sets satisfy Properties 1 
through 4. 


We now prove that Algorithm 1 produces a valid parti- 
tion of the permutation a as per Definition 3. 


Theorem I: The connection sets Cy, C,, C,, C, produced 
by Algorithm 1 satisfy Properties 1 through 4. 


Proof: Since XQNX,=YNY, =O and 
X',UX', = YU Y, = {0,1, ... ,7} , the connection sets Cy, C, 
C,, C; form a partition of 7. We now show that they satisfy each 
of Properties 1-4. 


Property 1: 


Co Nap) = (Xo N tUp)) NY. (12) 


Assuming that case (i) holds in Step 5 of Algorithm 1, 


Co AN a(Ip) = (Ro U Ry) N7UUp)) N Yo 
= (RyNa(Ip)) NY, since Ry Nw(Ip) = © 
= Ro Yo, since Ro& ao). (13) 


Now, if case (ii) holds, 


Co N w(Ip) = ((Ro U R3) N 7p) N Yo 
= RoNY, since R3N 7Up) =O and RyS a(h). 


(14) 

By construction, [|R,NY,;| =1, for O<i<¢3 and 

O<j<1. Hence |CgNw())| =1. Similarly, it can be 

shown that |C*Na(7J)| =1foralO<i<3andO0<j<1. 
Thus Property 1 is satisfied. 


Property 2: Tf case (i) holds during Step 5, then X, = Cj U C+ 
and X, = C+ UC}. Hence, Property 2 is satisfied by con- 
struction of X, and X;. 


If case (ii) were true, then X’, = Ct U C+ = R, U R,; and 
X',= CHUC} =R,UR,. Since R,»CX) and R,<X,, the ele- 
ments in R, and R, independently satisfy Property 2. Also, 
since R,Ca(f) and R,Ca(), (xxx) € Ro implies 
(XXX) € R; . Hence, 


T(X>X}Xo) € (Ro U R3) => 7(%x1X) € (Ro UR3). (15) 
Similarly, 


T(X>X4Xo) € (Ry UR,) => 7%Hx)X) €(R, UR) (16) 


Thus, X’, and X’, satisfy Property 2. 


Property 3: 
Co UCP = (Xo N Yo) U(X Yo) 
= X% (17) 
and C;/UC; = Yj. (18) 


By construction of Y), Property 3 is satisfied. 


Property 4: + = X ONY. 

If case (i) holds during Step 5 of the construction procedure, 
then o= and | (XN Yo) N {0, 1,2, 33] = 
| (XN Y) N {4,5,6,7} | =1. This is true of C, , CG, C;, also. 


Hence Property 4 holds. 


If case (ii) were true, then we have 


either (X)N Yo) © {0,1,2,3}, (19) 
or (XN Yo) € {4,5,6,7}. (20) 
Equation (19) implies 
((Ro U R;) Nn Yo) C {0,1,2,3} 
=> ((RoN Yo) U(R,N Yo)) © {0,1,2,3} 
Also, by construction, | YN {0,1,2,33| = 
| YN {4,5,6,7}] =2. Hence, 
(X, n Yo) Cc {4,5,6,7} 
=> ((R, U R;) n Yo) © {4,5,6,7 } 
> (R, n Yo) — {4,5,6,7}. (22) 
= (RoN Yo) U (RN Vo). 
By construction, |R,NY%| = |R;3NY| =1. Hence 


by Equations (21) and (22), Czy has exactly one representative 
from each of {0,1,2,3} and {4,5,6,73. Similarly Equation (20) 
can be shown to imply (RN Y,)©¢{4,5,6,7} 
(RN Y,) © {0,1,2,3} . Hence Cf satisfies Property 4. 


In a similar way, each C+, 1 < i < 3 , can be shown to 
possess representatives from both {0,1,2,3} and {4,5,6,7}. 
Thus Property 4 is satisfied. 


3. ROUTING OF PERMUTATIONS IN THE NETWORK 


Once a valid partition of the permutation 7 satisfying 
Definition 3 is obtained, actual routing of the permutation on 
the network is accomplished as follows: 


A 5-bit routing tag is assigned to each of the 
input—output connections of + . The routing tag for the con- 
nection from input terminal s = 5,5,5, to output terminal 
d = d,d,d, has the form c,cpd,d,d, , where the bits c, and cy are set 
depending on to which of the sets Cy, C, , Ci, C3, the connection 
belongs (c,cy = 00 for C, , 01 for C,, 10 for C,, and 11 for C,). 
The switches in the network are set according to the individual 
bits of the routing tag — the top output is chosen if the bit is 0 
and the bottom output if it is 1. The bit c, controls Stage 0, cy 
controls Stage 1, and d,, d,, d control Stages 3, 4, 5, respec- 
tively. 


Theorem 2: Tf Cy, C,, C,, C; form a valid partition of the 
permutation 7, the routing described above produces no con- 
flicts in any of the switching stages of the network. 


Proof: Refer to [9]. 

Figure 1 shows the routing of the permutation in Ex- 
ample 1 on the 5-stage shuffle/exchange network. The 5-bit 
vectors shown in parenthesis on the input side are the routing 
tags for the corresponding connections. Notice that the con- 
nections in the set C; pass through Switch i of the middle stage 
forallO <i<¢3. 


4. REDUNDANCY IN THE NETWORK 


Due to the flexibility in the construction of X) and_X, in 
Step 2 of Algorithm 1, and in the construction of Y, and Y, in 


and 
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Step 4, it is possible to fix the state of one switch in each of the 
last two stages (or the first two stages) of the 5-stage 
shuffle/exchange network. For example, if we force X, and ¥% 
to contain the element 0, then Switch 0 of the last two stages 
can be set permanently in the straight position. Similarly, if X, 
and Y, are forced to contain the element 7(0), Switch 0 of the 
first two stages can be set straight permanently. This shows that 
two switches in the network are redundant. Notice that a 
Benes binary network of the same size contains three redundant 
switches [6]. 


5. CONCLUDING REMARKS 


In this paper we showed that the lower bound of 
(2 log,N — 1) stages for rearrangeability of a multistage 
shuffle/exchange network is also an upper bound for N < 8. 
An algorithm for routing arbitrary permutations on the network 
was also described. The results in this paper have been used in 
[14] for establishing an upper bound of (3 log,N — 4) stages for 
rearrangeability of a multistage shuffle /exchange network with 
N > 8. 
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Abstract 


The cube networks are equipped with very efficient 
local control algorithms: the bit control by the des- 
tination tags. In this paper we present an equally 
efficient local control algorithm called the signed bit 
difference (SBD) tag control for the ADM 
(IADM). The signed bit (0, 1, 1) control in place 
of the bit (0, 1) control is a natural consequence of 


the (3x3) switch size of the ADM (IADM). 


We show that the (~1)-passable permutations 
are a subset of the ADM(IADM)-passable permuta- 
tions under the SBD tag control. We also show 
that for the 2(~1)-passable permutations the SBD 
tag can be replaced by the destination tag together 
with a single bit information of a switch number 
yielding the destination tag control for the ADM 
(IADM). 


A global control algorithm which realizes all the 
IADM-passable permutations is also given. 


1. Introduction 


In multiple processor systems an interconnection 
network provides communication paths for processor- 
processor or processor-memory information exchanges. 
The (NxN) augmented data manipulator (ADM) net- 
work {8] is an interconnection network, which con- 
sists of n switching stages each with N = 2" (3x3) 
switches plus one output stage. The inverse ADM 
(IADM) is identical to the ADM except that the 
direction from the input side to the output side is 
reversed. An (8x8) IADM is shown in Figure 1 
together with its switching stage numbers and switch 
numbers of each stage. The properties of the ADM 
(IADM) have been studied actively 

[1, 11, 8, 13, 9, 10, 6, 4, 5). 


The ADM control algorithm is based on the rout- 
ing tag T which is a representation of (D-S), where 
D is the destination and S is the source: 
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ane eee to = D-S (mod N), 


J Amer 
- (N-1) < T< (N-1), where t; can be 0, 1 
1. 


O 


r 


For a given D and S there exist many different 
representations of (D-S) modulo N, and each dif- 
ferent representation (or tag) corresponds to a pos- 
sible path from S to D. To realize a permutation we 
have to choose N such tags without causing a con- 
flict. Definitely some systematic generation of tags 
is required. One candidate is the signed mag- 
nitude difference (SMD) tag (called the natural 
tag in [11]) 

T= tt... --- tg = D-S, -(N-1)< T <(N-1), 


n-n-l 


where t; is 0 or 1, and t, is the sign bit. Thus the 
value of T is 
n al. 
value(T) = (-1)* %2?t., 
i=o + 

Unfortunately it was shown in [3] that a very small 
subset of ADM-passable permutations can be realized 
with this type of tags. In Section 2, a new tag 
called the signed bit difference (SBD) tag is 
defined. It can be regarded as a natural extension 
of the bit control of a cube network for the 


ADM(IADM), which is based on (3x3) switches. It 
was known that the ADM passes all the omega- 
passable permutations (9) [8]. We show that the 
set of ADM-passable permutations under the SBD 
tag control contains 9. The 2n-bit full routing tag 
obtained by bitwise exclusive-oring of the source and 
the destination for the cube-passable permutations 
[12] is in essence identical to the SBD tag. Also in 
Section 2 we show that the destination tag together 
with a single bit information from a switch number 
can replace the SBD tag for N(N~!). 


In Section 3 a global control algorithm of the 
IADM is given, which realizes all the IADM-passable 


permutations. The conclusions are drawn in Section 
4. 


2. Local Control Algorithms 


2.1. The Signed Bit Difference Tag Control 


Algorithm 
Let the source S = s n—1) Sj89, and the destina- 
tion D = d n-1) d,dy, where s.,d; € {0,1} for 
O<i<n. Define the signed bit difference (SBD) 
tag 
T = t(n-1) eee tito, 
where t, = 0 if d; = s;, 
t, = 1 if d, = 1 ands, = 0, 
oS 1 2b dye 0 -and = 1. 
In other words the bit t, of the SBD tag T is the 


bit difference (d, - s,) represented in the fully redun- 
dant binary number system. The IADM can be 
controlled locally at each switch as follows. 


SBD Tag Control Algorithm of IADM 
(ADM) 


The switch j of the switching stage i is controlled 
using the SBD tag in the following way: 


if t =0, choose the straight connection, 


if t; = I 1ys choose +2!(-21) connection, 


for O0<j<N, O<i<n. 


The SBD tag control of the ADM (IADM) is very 
natural, in that it covers each bit difference of the 
destination and the source at each corresponding 
switching stage, without generating any carry to or 
borrow from neighbor stages. It is a direct one-to- 
one mapping of the signed bit difference (d;-s;) to 
the i-th switching stage of the ADM (IADM). 


Let (Q~!) denote the set of the omega (inverse 
omega) network passable permutations [2]. Similarly 
let A(A~*) denote the set of the ADM (IADM) pass- 
able permutations. It has been known [8] that 


Q€a. 
Because of the symmetry this also implies that 
Oe A, 


In the following we show that the SBD tag is the 
control tag for 92(N~+) on the ADM(IADM). 


Theorem 1. 


The SBD tag control algorithm realizes 9~! on the 
IADM. 
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Proof 


For all i, 0<i<n, the SBD tag has the property: 


if s; = 0 then t, = d; = 0 orl, 
if (s,; = l.and d, = 1) then t, = 0 : 
else if (s,; = 1 and d, = 0) then t, = 1. 
Consider two input sources 5, = 8] {a 1) S10 
and S, = 8s, 2,(n—1) “°° 81,09 where S1i 0,5, = 8, + 
2) for an i, O<i<n. Let D, = 4, (n— D dy 9 Do 
= do (n- 1) doo Ti = tii) + tio T2 = 
n-1) ° ites be the destinations of g, S,, and the 
SB bape of ‘Sas Sy, respectively. Then by the 
above property of the SBD tag: 
(ty ar to 4) (0,0) 
(1,1) 
(0,1) 
(1,0) 


These four cases correspond to the four possible 
settings of a (2x2) switch in the inverse omega net- 
work: straight, cross, upper broadcast, lower broad- 
cast. (Only the first two are relevant for the 
permutations.) 


Q.E.D. 


Corollary 1. 


The SBD tag control algorithm realizes ] on the 
ADM. 


2.2. The Destination Tag Control Algorithm 


Next we show that 27! can be realized on the 
IADM using a different tag -- the destination tag 
together with a single bit information of a switch 
number as well. 


Destination Tag Control Algorithm of IADM 
(ADM) 


The switch number j of the switching stage i, j; = 

i,(n—1) - dio is controlled by d,; of the incoming 
dzatination tag D = din— 1) . do and I as follows, 
O0<i<n, 0<j:;<N: 


if (j,;,, = 0 and d, 0) or 
(J,,4 = 1 and d; = 1) then 
choose the straight connection, 

if (j,,, = 0 and d; = 1) then 
choose the +2} “connection, 

if (j,;,,; = 1 and d; = 0) then 
choose the -2+ “connection. 


Note that IADM(ADM) destination tag control is 
exactly like the cube destination control with the (i 
= 0) condition substituted for the upper input port, 
suid the Gi; = = 1) condition for the lower input port 
of a switch in a cube network. 


Theorem 2 


The IADM destination tag control algorithm real- 
izes 1~! on the IADM. 


Proof 


The two cases of the SBD tag (t Li ta, .) being 
(0,0) or (1,1) correspond to (s,; = = d,, = 0, and So i 
oe 1) or (s;; = 0, dy; = 1, and s,, = 1, dy 
= 0), respectively. The switch J, Teceives ‘its des- 
tination from one of possible 2! sources of stage 0. 
If di is equal to the i-th bit of any possible source 
numbers, the IADM destination tag control algorithm 
is equivalent to the SBD tag control algorithm for 
Q-!, thus realizing N-! on the IADM. 


Let S; n—1) “** 55,0 for 0g<N denote the N 
sources ‘at t tH stage 0. Then jg = S; for 0<j<N. 
At stage 1, by the destination tag control algorithm 
applied to the stage 0, the output from jg is the in- 
put to j, of the stage 1, 
Ji = $j,@-1) *°* $j,1 4,0 
In general at stage i, O<i<n, by the destination tag 
control algorithm 


Ji = 8j),(m-1) °° dio 


J: 


qi (i-t) 


receives the input destination created by jg of stage 
0. Thus 


Sivi 


for 0<j<N, O<i<n. 
Q.E.D. 


Corollary 2 


The destination tag control algorithm of ADM real- 
izes 12 on the ADM. 


The SBD tag control, the destination tag control of 
the [ADM and the destination tag control of the 
cube networks are shown in Figure 2. 


3. A Global Control Algorithm 


3.1. Background 


In the previous section we showed two local con- 
trol algorithms which can realize 07 1 on the IADM. 
The implication of the previous section is the fact 
that although the ADM is more powerful than a 
cube network, once a fixed local control algorithm is 
imposed upon, it is generally equal to a cube net- 
work. The power of the ADM is rather in the 
variety of control structures; many different types of 
local control tags with different permuting power can 
be generated, whereas only two control tags (the 
destination tag, and the tag obtained by bit 
exclusive-oring the destination and the source [11], 
which have the same permuting power) exist for a 
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cube network. It seems to be the case that a global 
control algorithm is in order to achieve the full 
potential of the ADM. In this section we describe a 
global control algorithm which is simple, systematic, 
and can pass all the [ADM-passable permutations. 

If a given permutation is not [ADM-passable, it 
prints the fact. 


Define a control matrix C(0:(N-1), 0:(n-1)) of 
which the element C(i,j) contains the switch setting 
information 0, 1, 1 for the i-th switch of the j-th 
stage. As before the straight path is chosen for the 
control signal 0, and +2)(-2)) path is chosen for 1(1). 
At the beginning, C is initialized by setting C(i,j) = 
j-th bit of the SBD tag at source i. The column 
C(-,j) represents the switch settings of N switches of 
the switching stage j, called the stage setting. A 
stage setting creates a conflict if there is any pair 
C(i,j), C(k,j) choosing the paths leading to the same 
switch at the next switching stage (j+1). 


A permutation is IADM-passable if all the stage 
settings are conflict free. In [3] a ring model is used 
to describe the IADM. Due to the 0,+2! connection 
patterns at stage i, all the switches of stage i are 
partitioned into 2! equivalence classes modulo 2(n-i) 
(see Figure 3). The stage setting of stage i consists 
of 2! independent settings of 2' partitions (called 
partition setting). 


A stage setting is conflict free if all the 2! par- 
tition settings are conflict free. A permutation is 
IADM-passable if every stage setting is conflict free. 
Each conflict free partition setting (called a good 
string) can be divided into two types [3]. In the 
following * (+) denotes a repetition of zero or more 


(one or more) times, and the length of the string is 


equal to the cardinality of a partition. 


(TI) Good string type 0. 
There is at least a pair of 0s, 
and can be represented as 
(0(11)*0 + O*)* 
Each type 0 string creates a unique 
partition at the next stage. 
(II) Good string type 1. 


There is no 0 switch setting ina 

string. There are four type l 

strings, and all these strings 

create identical partitions at 

the_ next stage: 2 
(1)*, (11)*, (1)*, (11)* 

The good string type 0 is based upon the fact that 
once a pair of switches in a partition are set 
straight, any switch between the pair should be set 
1 followed by 1 repeatedly to avoid a conflict. The 
basis for the good string type 1 is that if there is 
no switch set straight, then the given four switch 


settings are the only conflict free switch settings. 
There are (G,-2) type 0 strings, where G, = G k—1) 
+ G2); k(= string size) >3, G,=3, G,=1. The 
relation of Gy to the number of IADM-passable per- 
mutations Py is given as [3] 


_ 2 2 
Py = (Gy - 1) Pie 


with all (Gy-2) type O strings included in (Gy-1) 
plus one coming from type 1 strings. 


Thus the majority of the IADM-passable permuta- 
tions have unique switch settings. The choice of 
switch settings can only be provided when a par- 
tition setting does not require any straight switch 
setting, and thus belong to type 1 good string. Al- 
though the four different type 1 strings, (1)*, (11)t, 
(1)*, (11)T create identical partitions, they create 
two different permutations: one by (1)* or (11)*, 
and the other by (1)*, or (11)t. Hence given an 
overall permutation for the IADM a partition can be 
set at most in two different ways. 


3.2. IADM-Global-Control Algorithm 


We are ready to introduce the IADM-global-control 
algorithm. First a very brief description of the 
general flow is given. 


General Flow of the IADM-global-control 


Start with the SBD control tags for 
the IADM; 

loop for all stages 0 to (n-1); 

loop for all partitions of each stage; 


If the partition setting has any 0, then 
there is a unique partition setting (0(11)*0 
+ 0*)*. Change the original partition set- 
ting into this form. If it can not be done, 
backtrack to the most recent stage with an 
alternative. If there is no such stage, the 
given permutation is not IADM-passable. 
Stop. 


If the partition setting does not have any 
0, change it into (1)* (an arbitrary choice 
between (1)* and (11)+), mark it for back- 
tracking as there is an alternative, (1)+ (an 
arbitrary choice between (1)* and (11)*). 


end loop 
end loop 


Conflict free control tags have been 
obtained. 


end IADM-global-control 


Next we give the algorithm in detail. The 
procedures called by IADM-global-control are omitted 
due to space limitation. 
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C if the given permutation D is passable. 
*not passable” if D is not passable. 


Procedure IADM-global-control (D,C); 


/* IADM-global-control generates control signals in 
It prints 
*f 
bit array D(O: N-l, O: 
/* N n-bit destinations */ 
signed bit array C(O: N-1l, 0: n-1); 
/* control bits 0, 1 or l, 
for the entire IADM */ 


n-1); 


signed bit array STRING(O: N-1); 
/* control bits 0, 1, or ly; 
for a partition of a stage */ 
BTR.STAGE(0: n-2), 
BTR.PART(0: 29°-2)-1); 
/* stacks for backtracking * / 


integer array. 


integer j, k, m, length, 
btr-ptr; 
boolean success; 


initialize C with SBD tags; 
all stage j=0 to n-l; 


all partition k=0 to 2)-1; 

btr-ptr = 0; success = false; 

/* build a string for stage j, 
partition k */ 

call build-string(STRING, 

length); 


rh rh 
RR 
rt rm 


j, k, 


if any 0 in the STRING 
then 
/* type 0 string : 
setting */ 
call make-good-string(STRING, 
length, 0, success); 
if success = false 
then /* can we backtrack 
to change (1)* to 
(1)* in previous 
stages? */ 
if btr-ptr = 0 
then 
/* no backtracking 
possible */ 
print (“not 
passable 
permutation"); 


unique 


return; 


rh 


fi; 

/* start backtracking */ 
j = BTR.STAGE(btr-ptr); 
k = BTR.PART(btr-ptr); 


btr-ptr = btr-ptr - 1; 
call build-string 
(STRING, j, k, length); 
call make-good-string 
(STRING, j, k, length, 
1, success); 
Fi; 


else /* type 1 string: two 
alternative ways of 
setting (1)* or (1)* */ 
/* (1)* tried first */ 
call make-good-string 
(STRING, j, k, length, 
l, success); 


/* store in the back 
track stacks */ 
btr-ptr = btr-ptr + 1; 
BTR.STAGE (btr-ptr) 
BTR.STAGE (btr-ptr) 


TT eT 


j 
k 
£i 

Simulate the switch setting by moving 
the rows of C according to the 
control signals for this partition 
(only stages j+l through n-1). 


end for partition; 
end for stage; 


/* C contains the complete, correct 
switch setting signals for the IADM */ 


endproc IADM-global-control; 


3.8. An Example 


As an example consider the unshuffle permutation 
on an (8x8) IADM. We obtain the SBD tags C = 
C4C4Cq by calculating the signed bit difference d.-s; 
for c,, O <i < 3. 


D S C 
d, d, dy Sy S1 So Cy Cy Co 
0 0 0 0 0 O 0 0 O 
1 O 0 0 0 1 1 01 
0 0 1 0 1 0 01 1 
1 0 1 0 11 1 1 0 
0 1 0 1 0 0 1 1 0 
1 1 0 1 01 041i 
041i 1 1 0 1 01 
111 eke. es 0 0 O 


The unshuffle is not IADM-passable under the SBD 
tag control because at the very first stage (stage 0) 
we have the stage tag 
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which does not satisfy the condition of type 0 good 
string (0 (11)* 0 + O*)* as it is. Figure 4 shows 


the actual conflicts caused by 0 followed by 1. 


So by the IADM-global-control C(-,0) is changed to 
(Figure 5) a good type O string (0110)? or 


01100110 


resulting in new tags (C,). When we simulate the 
effect of switch settings at stage 0, we obtain Co. 
For stage 1, we have two independent partitions of 
switches (0, 2, 4, 6), (1, 3, 5, 7). Both partition 
tags are (0 1 1 0) obtained by c,. These should be 
changed to (0 1 1 0) as shown in Cy. Cy is ob- 
tained from C, by simulating switch settings of stage 
1. The stage 2 has four independent partitions of 
switches (0, 4), (2,6), (1, 5), (3, 7). All the par- 
tition tags are (0 0) which is a good type 0 string. 
Thus C, contains a complete, conflict free switch 
setting for the unshuffle. The application of C, to 
[ADM is given in Figure 6. This permutation did 
not require any backtracking, as all the partition set- 
tings were type 0 tags. 


3.4. ADM-global-control Algorithm 


The ring model of ADM [3] defines two types of 
good strings. Based on these the I[ADM-global- 
control was developed. We can obviously develop an 
ADM-global-control in much the same way. The 
only difference is that the ADM-global-control works 
from stage n to stage 1 reversing the order. The 
reverse direction does complicate the conflict free tag 
calculation. When we start from the lowest order 
bit, value 0 in that position can not change thus 
rendering some degree of determinism. This fact is 
propagated to bit position 1, 2 .... (n-1). But this 
is not the case when we work from the reverse 
direction. Thus we might end up with a conflict 
stage setting for stage i, while producing conflict free 
stage settings for stages < i, requiring much more 
backtracking than is required for the IADM in 
general. Hence, the [ADM can be controlled faster 
than the ADM for the realization of all the possible 
permutations. Instead of presenting another very 
similar procedure to the IADM-global-control, we 
simply show an example of the effect of such a pro- 
cedure. 


Consider a shuffle on the (8x8) ADM. We first 
calculate SBD tags C = cocjCpo. 


D S C 


Sq S1 So 


rKFooo°o 
OrFrFOO 
OrPOrFO 
a 
COOF'FO 
EH HIS HIO 


011 101 110 
101 1 1 0 011 
111i 111 0 0 0 


The stage 2 consists of four independent partitions 


of switches (0, 4), (2, 6), (1, 5), (3, 7) with partition 


settings (0 1), (1 0), (O 1), (1 0), respectively. 
These partition settings are not good type 0 strings, 
so we change all of them into (0 0) in C, (Figure 
7). The stage 1 consists of two independent par- 
titions of switches (0, 2, 4, 6) and (1, 3, 5, 7) with 
partition settings (0 1 1 1), (1 1 1 0), respectively. 
We change both into (0 1 1 0) in C,. By simulat- 


ing stage 1 stage settings C, is produced. The stage 


tag of stage 0 in C, is a good type 0 string, (011 
0011). The switch setting of the (8x8) ADM 
for the shuffle by C, is shown in Figure 8. 


4. Conclusions 


The signed ‘bit difference (SBD) tag was defined. 
Due to the three possible paths of each switch in 
the ADM a signed bit (1, 0, 1) is required to con- 
trol each switch. The signed bit control algorithm 
for the ADM can be regarded as a natural extension 
of the bit control algorithm for a cube network. 
The SBD tag control algorithm is the signed bit 
control algorithm, which maps the bit difference in 
the destination and the source directly onto the cor- 
responding switching stage. The (29)~—1-passable 
permutations are shown to be a subset of 
ADM(IADM)-passable permutations under the SBD 
tag control. 


We also showed that for the cube passable per- 
mutations the SBD tag can be replaced by the des- 
tination tag itself together with a single bit infor- 
mation from each switch (the i-th bit of the switch 
number in the i-th stage). Thus the destination tag 
control algorithm for the ADM was defined. 


A global control algorithm which realizes all the 
IADM-passable permutations was presented. 


Both local control algorithms are applicable to the 
gamma network [7]. A global control algorithm for 
the gamma network is being investigated. 
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ABSTRACT 


The Performance evaluation of concurrency 
control algorithms (CCA) has been extensively 
studied in both the centralized and decentralized 
environments (GARE79, HSIA81, SHMU82, CHES83, 
SHUM81, AGRA85, MITR84). However, differing 
performance models, assumptions and parameters 
studied have led to contradictory results 
[ AGRA85 , HELA86]. 


In this paper the behaviour and performance 
of two fundamentally different CCAs in single 
site databases has been investigated. These are 
the dynamic two-phase locking (2PL) and the 
commit-time validation (CTV). 2PL represents a 
pessimistic approach to concurrency control 
whereas CTV is an optimistic approach [KUNG81]. 
For each algorithm a performance model was 
constructed and the simulation technique was used 
to perform the study. Three parameters affecting 
data contention are studied in this paper. These 
parameters are: the degree of multiprogramming 
(the load effect), the read/write mix (ratio of 
query to update) and the database granularity. 
Unlike previous studies, in this paper we studied 
the three aforementioned parameters altogether; 
thus, providing insight into the composite effect 
of these parameters. 


Introduction 


Although, the parallel processing of 
transactions enhances the system's performance, 
data misuse may result in the absence of 
concurrency control mechanisms. There are three 
generically different approaches that can be used 
to design concurrency control algorithms. These 
are: locking, timestamping, and the optimistic 
approach [BERN81]. In this paper we study and 
compare the performance of two of these three 
generically different approaches. These are the 
Dynamic Two-Phase Locking (Locking approach) 
[Mena78, Rose 78] and the Commit-Time Validation 
algorithms (Optimistic Approach) [Bada79, Baye80, 
Kung81]. 


Dynamic Two-phase Locking synchronizes 
reads and writes by explicitly detecting and 
preventing conflicts between concurrent 
operations. It regulates data access as follows. 
Before reading a data item X, a transaction must 
own a readlock on X. Before writing into X, it 
must own a writelock on X. The ownership of 
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loeks is governed by two rules: (1) A 
transaction may own a readlock on an item x as 
long as no other transaction owns a writelock on 
X, and a transaction may own a writelock on an 
item X if no other transaction owns a readlock or 
a writelock on the same item X. (2) Once a 
transaction surrenders ownership of a lock it may 
never obtain any additional locks. The 2PL 
controller may block a transaction by causing it 
to wait for unavailable locks. Due to this 
blocking, deadlocks may result. Deadlocks could 
either be prevented or detected and then resolved 
[ELMA85]. In this paper, 2PL uses a deadlock 
detection scheme. 


In CTV (also called serial validation), the 
coneurrency controller keeps track of the 
write-sets of recently commited transactions. 
Transactions run freely until commit time, at 
which point each transaction is submitted toa 
validity test to see if commiting it will leave 
the database in a consistent state. Fora 
committing transaction Tj, the test considers all 
recently committed transactions Tro, where a 
recently committed transaction is one that 
has committed since Ty started running. The test 
results in Ty being commited iff: 

Readset(T;) NWriteset(Tre) = ¢ 
for all Tre and being restarted otherwise. 


Two-phase commit, a recovery protocal 
normally used in distributed database systems, is 
utilized both by the 2PL and the CTV algorithms 
(in single site database) in order to preserve 
transaction atomicity and database consistency in 
presence of system failures. There is no 
logging, shadowing, nor differential files used 
in this work. In two-phase commit (2PC), 
transactions write their updates in the main 
memory. When a transaction ends, its updates are 
committed in two-phases. In the first phase, the 
updates in the main memory are copied to a secure 
storage. The second phase starts by copying each 
single update to its place in the stored database. 
If a failure occurs before the second phase 
starts, then the database is still consistent. 
If a failure occurs during the second phase, the 
database may become inconsistent but this 
inconsitency can be rectified using the secure 
storage. 


In this study it is assumed that the 
transaction size is large. This gives worst case 
results for a system with transactions of mixed 
sizes. In the next two sections, the simulation 
performance model is explained for each of the 
2PL and the CTV algorithms. Section 4 lists the 


unified model assumptions. Section 5 explains 
the experiments and the results. Finally, the 
concluding remarks are given in section 6. 


2. Simulation of the Two-Phase Locking algorithm 


A fixed number of transactions are 


continuously cycled around the model shown in 
figure 1. Initially, all transactions are put on 
the lock request queue. A transaction then goes 
through the following stages. 


(1) A transaction requests the next needed 
lock from the lock manager which in turn tries to 
get this lock. If the lock manager succeeds, the 
transaction is granted the lock. If the lock 
granted was a read lock, the transaction is put 
on the low priority disk queue in order to read 
the item needed. When the read operation ends, 
the transaction is put back on the lock request 
queue to request the next needed lock when the 
lock manager is free. If the lock granted was a 
write lock, no disk write operation takes place 
(as the two-phase commit is employed, all writes 
take place at commit time) and again, the 
transaction is put back on the lock request queue 
to request the next needed lock when the lock 
manager is free. 


(2) If the lock manager fails to get the 
lock on the item needed, because other 
transaction(s) hold(s) a lock on that item, the 
lock manager tries to let the requesting 
transaction wait for the unavailable lock until 
it is released. If this waiting would result in 
a deadlock, the lock manager then restarts the 
requesting transaction. Otherwise, the 
requesting transaction is put on one of the block 
queues (virtually, each database item has a block 
queue) in which case the transaction is said to 
be blocked. 


(3) When the restarter restarts a 
transaction, it releases all locks held by that 
transaction. Consequently, some blocked 
transactions should become unblocked and they are 
put on the front of the lock request queue. The 
workspace of the restarted transaction is reset 
and it reexecutes from the beginning. 
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(4) When a transaction completes, it is 
put on the high priority disk queue and all its 
writes are committed following the two-phase 
commit protocol. When a transaction is 
committed, another transaction arrives 
instantaneously at the system. 


The concurrency control is modeled by two 
servers: The lock manager and the restarter. 
The lock manager provides two kinds of services. 
The first is the "Try to lock" service for a 
requesting transaction. This service takes a 
time delay Ayyz, to be completed. Azz , is found 
by evaluating the size of actually executed code 
for this service. Size of code for this service 
is found to be around 100 instructions. Assuming 
one MIPS processor, this service takes about 
0.0001 seconds to be completed. The second 
service provided by the lock manager is the "Try 
to block" service in which the lock manager 
performs several deadlock detection tests to see 
if blocking a transaction would result ina 
deadlock or not. This service takes a time delay 
Attb to be completed. Size of code for this 
service is not estimated but rather an actual 
deadlock detection algorithm is coded and a 
counter is used to count the number of actually 
executed instructions while calling this code. 
The restarter server releases locks held by an 
aborted transaction and advances the block queues. 
This service takes a time delay Ap to be 
completed. Size of actually executed code for 
this service is found to be about 5000 
Instructions. Assuming one MIPS processor, this 
service takes about 0.005 seconds to be 
completed. It should be noted that modeling of 
the concurrency controller as a set of servers 
allows the utilization of multiprocessing 
whenever possible. In case of one processor 
computer, at most one server is functioning at 
any time. 


The parameters which control and drive the 
Simulation are the following. 
1- Number of database granules (items): M 
2~- Number of transactions in the system 
(the degree of multiprogramming or 
load): T 
3- Number of granules accessed by a 


W-Lock 


High Pr. Que 
Completes | | | 
Granted R-Lock r 
a loc . Ss 
ov fr, dus A Comaitted 
Not yet 
completed 


Mot yet completed 


Performance Model of the 2PL Algorithm 


132 


transaction: I 
4- Read/Write ratio: THRESHOLD 
5- Disk average service rate: u 
6- Lock Management CPU time delay: Aggy 
7- Restart time delay: Ap 
8- Processor Power in MIPS 


The following performance indices are 
measured for each simulation run (experiment). 


1- Average transaction response time (RT) 
seconds 

2- System throughput (TP) : 
Transaction/Second 

3- Degree of concurrency (DC) 

H- Deadlock rate (DR) : deadlock/second 

5- Conflict rate (CR) : conflicts/second 

6- Maximum number of times a transaction 
was restarted (MXRST) 


The degree of concurrency is defined as 
follows. Let Ty, be the duration of time through 
which the number of active transactions in the 
system was n. The degree of concurrency is then 
given by: T 

DC = 1/(T*Ts) * D) n*Ty, 
n=1 
where Ts is the system time and T is the number 
of transactions in the system. Clearly, 
0 < DC <i. 


3. Simulation of the Commit-Time Validation 


A fixed number of transactions are 
continuously cycled around the model shown in 
figure 2. Initially, all transactions are put on 
the low priority disk queue. A transaction then 
goes through the following stages. 


(1) Each time a transaction submits a read 
operation, it is put on the low priority disk 
queue. No disk write operation takes place when 
a transaction submits a write operation (as the 
two-phase commit is employed, all writes take 
place at commit time). When a read or a write 
operation is complete, the transaction 
_- immediately submits its next read or write 
operation, if any. 


Migh Pr. Que 


(2) When the transaction completes, it is 
validated to see whether its writes could be 
committed or not. The transaction is put on the 
validator queue. When the transaction is 
considered by the validator, a validity test is 
performed. 


(3) If the transaction is successfully 
validated, it is put on the committer queue in 
order to commit its writes. If the transaction 
is a pure query (read only), no commitment occurs 
and the transaction finishes and it terminates. 


(4) If the validation test fails, the 
transaction is restarted by the restarter. The 
transaction workspace is reset and it reexecutes 
from the beginning. 


(5) When a transaction is committed, it 
finishes execution and it then terminates. When 
a transaction terminates, another one 
instantaneously arrives at the system. 


The concurrency control is modeled by a set 
of servers. These are the validator, the 
committer, and the restarter. The validator 
tests the completed transaction against all 
recently committed transactions. If the readset 
of the former intersects with the writeset of at 
least one recently committed transaction, the 
test fails, otherwise it succeeds. The 
validation service takes a time delay Ay to be 
completed. The committer inserts information 
about the committing transactions into the tables 
of the recently committed transactions and it 
also maintains these tables by deleting obsolete 
entries (entries with timestamps smaller (older) 
than the smallest timestamp of all active 
transactions are deleted). The commitment 
service takes a time delay A, to be completed. 
The restarter resets the workspace of the 
unsuccessfully validated transaction and inserts 
it back into the system to reexecute from the 
beginning. The restarter service takes a time 
delay Ap to be completed. Ay, Ag and Ap are 
estimated in a similar way as Att) and Ap of the 
2PL. Ay, Ag, and Ap are all estimated to be 
around 0.005 seconds. Again, the concurrency 
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controller is modeled as a set of servers so that 
multiprocessing may be utilized. 


The parameters which control and drive the 
Simulation are the following. 


1- Number of database granules (items): M 
2- Number of transactions in the system 
(the degree of multiprogramming or 
load): T 

Number of granules accessed by a 
transaction: I 

4}~ Read/Write ratio: THRESHOLD 
5- Disk average service rate: u 
6- Validation test time delay: 


3- 


Ay 


7- Commitment time delay: Ag 
8- Restart time delay: Ap 
9- Processor Power in MIPS 


The following quantities are measured for 
each simulation run (experiment). 


1- Average transaction response time (RT) 
Seconds | 

2- System Throughput (TP) : Trans./Sec. 

3- Conflict rate (CR) : Conflict/Second 

4- Maximum number of times a transaction 
was restarted (MXRST). 


4, Unified Model Assumptions 


Following is the set of model assumptions 
made in constructing the performance model of 
both the 2PL and the CTV algorithms. 


(1) Two-Phase commit is incorporated with 
the 2PL(CTV) algorithm, thus a write operation is 
treated as a non-disk operation, and on 
transaction completion, two-phase commit begins 
by issuing 1 + |Writeset| disk operations(one in 
the first phase and |Writeset| in the second 
phase). 


(2) Resources are finite: the physical 
database is stored in one disk unit with a given 
service rate. The case of more than one disk 
unit can equivalently be studied by one disk unit 
with a higher service rate. Also, there is only 
one CPU. 


(3) Elements of the readset and the 
writeset of a transaction are distinct. 


(4) Computation time needed by each 
transaction is negligible. This is justifiable 
because most transactions are I/0-bound 
processes. 


(5) Transaction size is large. This gives 
a worst case results for a system with mixed 
sized transactions. How large a transaction is, 
is explained in the next section. 
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(6) A Transaction workspace consists of 
two parts: the transaction definition and the 
transaction data section. When a transaction is 
restarted, only its data section (or part of it) 
is reset. 


5. Experiments and Results 


In all experiments, a fixed number of 
transactions is cycled around the 2PL(CTV) 
simulator. This number represents the system 
load. In 2PL, a given transaction j may be 
inactive (blocked) or actively running. An 
actively running transaction may be taking a 
service at some server (lock manager or 
restarter) or else be submitting a new lock 
request to the simulator. This lock request is 
denoted Li j, (transaction j requests a lock on 
item i in mode k e [Read,wWrite]). In CTV, a 
transaction j is always actively running taking a 
service at some server (validator, committer or 
restarter) or submitting a new access request to 
the simulator. This access request is denoted 
Rijxk (transaction j submits an access request on 
item i in mode k e [Read,Write]). The process of 
submitting an Lij, (Rij,) 18 repeated for an 
"Observation Interval" number of times. 
Observation interval ranges from 2000 to 40,000 
lock (access) requests. In the following set of 
experiments, the database size (M), was assumed 
to be a total of 640 items. The effect of 
granularity on the performance was investigated 
by studying granularities of 80, 160, 320, and 
640 granules. | 


According to Yao [Yao77], the value of the 
effective transaction size changes with varying 
the granularity. In these experiments, the 
effective transaction size is almost invariant to 
changing granularity over [80,160,320,640] for a 
database of 640 items and for transactions of 15 
items size. 


The size of each transaction, I, was assumed 
to be large (15 granule lock (access) requests). 
This amounts to about 2.5% of the whole database 
(640 granules). Usually, a transaction accesses 
less than 1% of a database, so the 2.5% figure 
represents a large size transactions. This gives 
a worst case study of a database system with 
transactions of mixed sizes. 


The effect of the Read/Write mix associated 
with the transactions was investigated by 
conducting experiments for mixes of 100/0, 90/10, 
80/20, . . -, and 0/100. 


The hardware parameters were assumed to be a 
disk average service rate of 20 access 
requests/second and one MIPS processor. These 
figures were fixed in all the experiments but 
were doubled and reduced to a half in one 
experiement in order to understand its effect on 
the performance. 


In 2PL(CTV) an actively running transaction 
j submits an Lj j, (Ry4,.) by uniformally selecting 
an item ie [1 ... My from the set of items not 
yet accessed by j and by selecting ke [ Read , 


Write ] according to a predefined Read/Write 
ratio (K is selected to be a "Read" with 


Probability less than Threshold and "Write" 
otherwise). | 


The domain of settings for all experiments 
conducted is listed below. 

1- Number of transactions 
[ 5,10,20,30,40 ] 
Number of Database granules (M): 
C 80,160,320,640 J 


(T): 


on 


3- Transaction size in granules (I): [ 15 J 

4- Read/Write mix (THRESHOLD): [ 100/00, 
90/10,...,0/100 J 

5- Disk average service rate (uy): [ 20,40 ] 
access/second 

6- Processor Power in MIPS : [ 1, 2 ] 


5.1 Response time and Throughput 


Response time and throughput of the 2PL and 
the CTV are studied under variation of three 
parameters. These are the system load, the 
transaction Read/Write mix and the granularity. 
In these experiments, Response time and 
throughput are plotted for various Read/Write 
mixes, while the degree of multiprogramming is 
fixed. This process is repeated for various 
degrees of multiprogramming (loads) and the 
resulting curves are grouped into a frame. The 
frame is repeated for various granularities. 
Thus the effect of the Read/Write mix, the effect 
of granularity, and the load effect are all 
incorporated into detalled design curves. Only 
response time curves are included (Fig 3). 


5.1.1 Load Effect 


In both the 2PL and CTV, increasing the 
system load (T) results in the following: 


i- Response time linearly increases 
2- Throughput linearly decreases. 


This effect is shown In figure 4. 
5.1.2 Effect of the Read/Write mix: 


In both the 2PL and the CTV, the R/W mix 
affects the response time as follows: 

_1- The R/W mix has a great impact on the RT, 
especially when the system is loaded (T=30,40). 

2- For small loads (T=#10), R/W mix has a 
negligable effect on RT. 

3- Starting from the 100% read (query) case, 
{nereasing the write percentile first impairs the 
RT till R/W mixes of 70/30-50/50, after which the 
RT improves, and in the case of the 2PL it 
reaches its minimum value at the pure update 
case. 


Explanation: 


The Third observation about Response Time 


is explained below. 


(a) For query-dominant transactions (R>50, 
W<50), unresolvably conflicting transactions are 
restarted after accessing most of their requests. 
Restarting a mostly finished transaction 
contributes to an increase in the useless I/0 
percentile. 


(b) For update-dominant transactions 
(R<50,W>50), since two-phase commit tis 
incorporated with the 2PL and the CTV, no write 
disk operation takes place during trasaction 
execution and hence most of the transactions disk 
access operations will not be done by the time it 
1s restarted because of an unresolvable conflict. 
Thus, we have the case of restarting a 
transaction that accessed a small fraction of its 
requests. This results in small useless I/0 
percentile. On the other hand, for 
update-dominant transactions, unresolvable 
conflict rate increases resulting in so many 
restarts. However, the restart operation neither 
contributes to high useless I/O percentile nor 
incurs high setup cost. 


3-1.3 Granularity Effect 


In both the 2PL and the CTV, 
affects the response time as follows: 
1- For small loads (T#10), granularities of 
160,320, and 640 have almost the same RT. 
Granularity of 80 incurs higher resnonse time. 


granularity 


_ This is further illustrated in figure 5. 
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2- Granularities of 80 and 160 highly 
impairs the RT, especially when the system is 
loaded (T=30,40). 

3- For the 2PL, the effect of doubling the 
granularity from 320 to 640 (that is from M/2 to 
M, where M is the database size) is almost 
negligible. This is true for the CTV only for 
small loads (T=10). 


5.1.4 Effect of Hardware Parameters 


The effect of doubling the disk average 
service rate and the Processor speed was 
presented on the response time only. The 
observation is shown in figure 6. It is observed 
that doubling the disk power results 
approximately in reducing the response time by 
half, whereas doubling the processor speed 
results in a trivial. improvement in the response 
time. 


5.2' The Comparison 


Insthis section, we compare the response 
time incurred by the 2PL and the CTV algorithms. 
In this comparison, a load of 20 transactions is 
fixed. However, conclusion of this comparison is 
similar to that for all other loads. Figure 7 
depicts the response time of both the 2PL and the 
CTV. From these curves the following are 
observed: 


1- CTV is better than the 2PL for 
granularity of 80, whereas 2PL is better than the 
CTV for granularities of 320 and 640. For 


Response Tine (Rt) 


te ae 


Response Time (At) 


We 


wi 


i, ee), el ed ove 


oa 9 UN S/d Got are afte 


Response Tiee (RI) 


Response Time (23) 


Resronse thee (at) 
108 
n 
ad ] 
ve 
tad 
& te 
+] 
# 
4e ve Le een 
je » — — b] \ 
, ane — men, 
” ” Tt) 


ome 9 F rman, 


we it oe is ON SI enon wee mt Ot oe Oe sey 


Ye or ase ase os 
Response Time (Rt) Response Time (A) 


2 ye eq 
crv 
. 4 @ 320 
v an a 
oy ~~ 
Peed 
ve 2 
0 WA - NINE 
80 el {| |= 
7 — ceo) ees met Re (ee 
RA 


wie ot eft 9 ON Sf) Soha) ore 


100 


we 9/0 oft 19 ON Sty 8 oe tte oe ore 


Response’ tee {21 


3 
ke 
» 
# 
” a 
: a“ 
x 
a ” — 
e 
RAN 
wow nn a ee ge ge ee, SE IN HEE A) AMY lO SDE 2p 
Figure 3 Response Time Design Curves 


Response Time (RT) 


Figure 4 


Load Effect 


wee Md we 9 ON oe MG oe ote oe ase 


Response Vine (81) 


we ft ft 1/9 ON SiS 8G oe tte ote ose 


Figure 5 Small Loads 


136 


Response Tine (RT) 


43.0 


wse oft aff 1/3 G/N S/S M6 Sf? te os ase 


Figure 6 


Ratponse Tiae (81) 


Wi ff 5S OM OSS Gt tte oe ooree 


Response Time (RT) 


w/a aft aft 0 ON fs fe ooft afe oe ese 


Figure 7 


granularity of 160, 2PL is better for high read 
percentile of the R/W mix, and for write 
percentiles greater than 50%, the two algorithms 
incur approximately the same response time. 


2- For granularity of 80, 

unacceptable response time. 
 3- In the 2PL with granularity of 640, the 

R/W mix has almost no effect on the response 
time. . 

4—- For high granularities, and for high read 
percentiles of the R/W mix, the two algorithms 
incur reasonable (acceptable) response time. 


2PL incurs 


6. Conclusion 


We have presented two simulation models for 
each of the dynamic two-phase locking and the 
commit-time validation. Model assumptions were 
unified and a comparison was made. We concluded 
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that CTV can work better than 2PL under very 
coarse granularity whereas 2PL performance is not 
acceptable for the same very coarse granularity. 
On the other hand, 2PL outperforms the CTV for 
granularities of 320, and 640. Also, we 
concluded that for query dominant transactions, 
the two algorithms incur the same performance. 


In 2PL, we have found that for large size 
transactions, the response time incured by a 
granularity M is almost the same of that itncured 
by a granularity M/2, where M is the database 
size. 


Under the assumption that two-phase commit 
is incorporated with the 2PL and the CTV, the 
Read/Write mix associated with transactions was 
found to affect the performance as follows: 
query dominant and update dominant transactions 
result in better performance than transactions 
with nearly equal read and write percentiles. 
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INCREASING PROCESSOR UTILIZATION DURING PARALLEL COMPUTATION RUNDOWN 
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Abstract -- Some parallel processing envi- 
ronments provide for asynchronous execution and 
completion of general purpose parallel computa- 
tions from a single computational phase. When 
all the computations from such a phase are com- 
plete, a new parallel computational phase is 
begun. Depending upon the granularity of the 
parallel computations to be performed, there may 
be a shortage of available work as a particular 
Computational phase draws to a close (computa- 
tional rundown). This can result in the waste 
of computing resources and the delay of the over- 
all problem. 


In many practical instances, strict sequen- 
tial ordering of phases of parallel computation 
is not totally required. In such cases, the 
"beginning" of one phase can be correctly com- 
puted before the "end" of a previous phase is 
completed. This allows additional work to be 
generated somewhat earlier to keep computing 
resources busy during each computational rundown. 
This paper identifies the conditions under which 
this can occur, reports the frequency of occur- 
rence of such overlapping in an actual parallel 
Navier-Stokes code, suggests a language con- 
struct, and discusses possible control strategies 
for the management of such computational phase 
overlapping. 


Introduction 


General purpose parallel computations are 
usually divided into phases that must execute 
sequentially in order to guarantee algorithmic 
integrity. For instance, the checkerboard 
approach to the successive over-relaxation solu- 
tion of the potential field problem divides into 
two such phases: the "odd" locations phase and 
the "even" locations phase. On the parallel 
phase level, the iterated values of the previous 
phase must be complete before the new values of 
the next phase can be correctly computed. 


In the checkerboard algorithm, the execution 
time of each location is definite (nominally, 
the time for four additions and a divide). Thus, 
the distribution of work among processors can be 
accurately planned. Under ideal conditions 
(involving the number of checkerboard locations 
in comparison to the number of processors), the 
distribution of work can be arranged so that 
each processor shares an exactly even portion of 
the work and, aS a consequence, each processor 
completes its work at exactly the same time. 
Perfect computation resource utilization is rea- 
lized (at least in a practical sense) since the 
next computational phase can begin immediately. 


Unfortunately, ideal conditions are infre- 
quently found in real applications. Continuing 
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with the checkerboard algorithm, consider the 
Situation when the potential grid is 1024 points 
on a side (2**20 grid points) and 1000 processors 
are available. Each computational phase will 
provide 524, 288 individual computations, or 524 
computations for each of the 1000 processors; 
however, 288 computations will be left over for 
distribution among the 1000 processors. This 
will leave 712 processors with nothing to do 
while the final 288 computations are carried out. 


The burden of experience gained by the 
author suggests that even this example is opti- 
mistic. Most computations carried out by the 
author's parallel Navier-Stokes solver (the Com- 
bined Aerodynamic and Structural Dynamic Problem 
Emulating Routines or CASPER [1] which was con- 
trolled by the Parallel, Asynchronous Executive 
Or PAX [2]) could not even be ascribed with 
definite execution times. In some instances, 
whether or not the computation was even to be 
carried out in a particular instance was a con- 
ditional part of the algorithm. No control over 
the computation-count-to-processor ratio was 
attempted -- processors were allocated as they 
became available on a the-more-the-merrier basis. 
Also, shared information access times were unpre- 
dictable and unrepeatable from instance to in- 
Stance. As a result, there was no assurance 
that individual processors could be kept busy as 
a particular computational phase drew to a close. 


The PAX/CASPER project provided the experi- 
ence base cited later in this paper. PAX/CASPER 
was focused on a parallel, general purpose, 
Navier-Stokes solver. Thus, this experience 
base is presented not as a grand generalization 
for all of parallel processing, but as a specific 
example in practical parallel processing. 


Certain other situations that might seem of 
interest in the overlapping of computational 
phases (for instance, the possibilities for over- 
lapping in a tight iterative loop) are not trea- 
ted for the simple reason that they did not occur 
in the PAX/CASPER project. PAX/CASPER was not 
So much a research project in parallel processing 
as an exploratory development of a far-term aero- 
dynamic tool. Thus, the motivation was to solve 
the problems that occurred rather than to solve 
the problems that one could imagine. 


It has been suggested that scheduling and 
overhead problems will be a particular problem 
in PAX/CASPER. So far, this has not been the 
case. Operational experience shows that the 
ratio of computation to management has been 
running at something in the neighborhood of 
200. This paper is an effort to chart a method 
of improving upon this situation so as to stave 
off any backsliding that might occur as the 


ratio of computational to management resources 
increases. There are additional strategies 
which have been identified for development. 
These include a middle management scheme to 
parallelize the serial management function, a 
direct worker-to-worker lateral communication 
scheme, and a data-proximity work assignment 
algorithm. These strategies combined with the 
overlapping of computational phases should 
enhance the management overhead situation. 


Various solutions to the computational run- 
down problem may be acceptable. Some parallel 
processing schemes for general purpose computa- 
tion may choose simply to accept the lower pro- 
cessor utilization as a minor design flaw. 
Another alternative is to create a multi-parallel- 
job-stream environment that allows computational 
work of one job stream to fill in when another 
job stream enters a computational rundown situ- 
ation. This will bring processor utilization 
up; however, it should be recognized that the 
primary goal of parallel processing is to reduce 
elapsed wall-clock time for a given job. The 
introduction of such a "batch" environment will 
inevitably distribute processor resources among 
the several job streams and, thus, reduce the 
total processing power on any particular job and 
lengthen its elapsed wall-clock time. 


Overlapping Computational Phases 


‘ 

The goal then is to find more ready-to- 
compute work from the parallel algorithm that is 
being computed. As mentioned previously, this 
is not possible at the parallel phase level: 
each phase must be completed before the next is 
begun in order to guarantee algorithmic inte- 
grity; however, if an examination is made at a 
deeper (sub-phase or, in the terminology of the 
author, task) level, it is frequently discovered 
that the completion of portions (tasks) of one 
phase will allow the correct computation of por- 
tions (tasks) of the succeeding phase. 


Consider again the checkerboard algorithm. 
If all the "odd" locations adjacent to a par- 
ticular “even" location have been updated with 
new values from the current computational phase, 
then the new value for that particular "even" 
location for the next computational phase can be 
correctly computed. Additionally, since all the 
computations requiring as an input the current 
value of that particular "even" location have 
been completed, the value for that "even" loca- 
tion can be updated without affecting the results 
of the current computational phase. 


At this point, it is necessary to make cer- 
tain assumptions (or, alternatively, set certain 
System design constraints) about the nature of 
computational phase rundown. Two basic situ- 
ations arise: one in which task assignments and 
releases are statically determined and one in 
which such matters are dynamically determined. 


The static situation is much simpler from 
the standpoint of next-phase task release timing 
since everything is determined ahead of time. 
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In this case, it can be acceptable for computa- 
tional rundown to begin almost immediately since 
the scheduling of the next-phase task has already 
been statically determined. No completion pro- 
cessing of current-phase tasks is required to 
schedule the release of the next-phase task. 

(In fact, work in this area for the purposes of 
real-time simulation has been conducted for some 
years at NASA Lewis [3, 4]). 


The dynamic scheduling situation is sub- 
stantially more interesting. Some time delay 
must be available between the completion of the 
first current-phase tasks and the onset of com- 
putational rundown. This delay is needed to 
provide time to process the completion of the 
early current-phase tasks and, in so doing, 
schedule the next-phase tasks that are thus 
enabled. During this delay, there must be 
enough current-phase tasks to keep the proces- 
sing resources busy in order to avoid a computa- 


‘tional load dip while the next-phase tasks are 


scheduled. 


In the dynamic scheduling situation, enable- 
ment relationships between the current-phase 
tasks and the next-phase tasks (i.e., the rela- 
tionship that enables a next-phase task based 
upon the completion of a current-phase task) may 
be either static or dynamic. That is, the com- 
pletion of a particular current-phase task may 
always enable the same next-phase task (the 
Static enablement case) or it may enable some 
next-phase task that can only be identified at 
the time of execution (the dynamic enablement 
case). The nature of the enablement relation- 
Ships is important because it is involved in 
setting the time delay from the completion of 
the first current-phase tasks to the availability 
of the first enabled next-phase tasks. 


Considering these characteristics of the 
dynamic scheduling situation (i.e., the time to 
process current-phase task completion, the time 
to recognize enablement relationships, and the 
time to schedule enabled next-phase tasks), it 
can be observed that the number of tasks should 
substantially outnumber the number of processors. 
Certainly, there should be at the outset of the 
Ccurrent-phase work at least two tasks for each 
processor so that at least one task execution 
time will be available to process the completion 
of the first task assigned to the processor and 
to schedule the enabled next-phase task. This 
presumes that completion processing and task 
Scheduling time is small with respect to task 
execution time. In particular, it assumes that 
One such completion, enablement, and scheduling 
cycle for each of the processors in the system 
can be completed in a single task execution time. 
(The author's experience with PAX suggests that 
this is reasonable even for dynamic managerial 
style parallel processing systems. Systems that 
use hardware-level synchronization primitives 
presumably would be at even areater advantage in 
this area.) 


The conditions under which this overlapping 
of computational phases can correctly occur are 


the same as those that allow parallel computa- 
tions within a particular phase. Let the logical 
predicate PARALLEL(x,y) return the condition 
TRUE when x and y are such that parallel compu- 
tations are allowed. Clearly, PARALLEL(n,m) 
must always be TRUE if n and m are distinct com- 
putational granules of the same parallel com- 
putational phase. Let q be an uncompleted 
granule of the current phase and r be a granule 
of the next phase that has been enabled by some 
completed granule, p, of the current phase. If 
PARALLEL(q,r) necessarily returns the value TRUE, 
then the current-phase and next-phase can be 
correctly overlapped. 


The exact nature of the logical predicate 
PARALLEL(x,y) is, of course, of substantial prac- 
tical interest; however, it has no direct impact 
upon the ability to overlap phases as outlined 
above. Different parallel systems may identify 
different logical predicates. 


Identifying Enabled Granules 


The first challenge to be met is to find a 
way of identifying enabled next-phase granules 
for overlapping. It is easy to postulate that 
some mapping function exists either to map from 
the set of completed granules, p, to the set of 
enabled granules, r, or to map from the set of 
uncompleted granules, q, to the set of enabled 
granules, r. It is very difficult to establish 
what this mapping function might be in any gen- 
eral way. Fortunately, this mapping function is 
much more easily identified when each concrete 
situation is faced. 


First, consider the simplest imaginable 


case as represented by the following Fortran 
code segment: 


DO 100 I=1,N 
B(I)=A(T) 
CONTINUE 
DO 200 I=1,N 
D(I)=C(1) 
CONTINUE 


First computational phase 


100 


200 


Assuming that there are not shared output area 
constraints, it can be observed that these two 
parallel computational phases can be computed in 
parallel with each other. This represents what 
might be called a universal mapping function 
wherein any granule of the second computational 
phase is enabled by any granule or set of gran- 
ules (including the null set) of the first com- 
putational phase. 


PAX/CASPER experience shows that 6 out of 
22 (or 27 percent) of the parallel computational 
phases allow universal mapping enablement of the 
succeeding phases. This represents 266 out of 
1188 lines (or 22 percent) of the code that is 
executed in parallel in PAX/CASPER. 


Second computational phase 
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This universal mapping usually occurs in 
PAX/CASPER when the nature of the larger com- 
putational process is changing. For instance, 
the change over from power of compression com- 
putations to interpolator matrix generation is 
One such character change. The two computations 
do not involve shared information of any kind 
and, thus, they can be entirely overlapped. Of 
Course, the two phases could be merged into one 
by a preprocessor of the parallel control stream; 
however, since the mechanisms necessary to handle 
this case would be a subset of those needed for 
the following case, it might well be simpler to 
support this enablement mapping. 


For the next case, consider the following 
Fortran fragment that is to be computed in 
parallel as two succeeding computational phases: 


DO 100 I=1,N 
B(I)=A(1) 
CONTINUE 
DO 200 I=1,N 
C(1)=B(1) 
CONTINUE 


First computational phase 


Second computational phase 


200 


Again assuming that there are not shared output 
area constraints, it can be observed by inspec- 
tion that the identity mapping function (I = I) 
maps from completed granules, p, to enabled 
granules, r. This is also a simple and easily 
identified mapping. 


PAX/CASPER experience indicates that it 
applies in 9 out of 22 (or 41 percent) of the 
parallel computational phases (representing 551 
of 1188 code lines, or about 46 percent of the 
parallel code in PAX/CASPER). Combining this 
direct mapping with the simpler universal mapping 
above indicates that (at least in PAX/CASPER 
experience) 68 percent of the parallel computa- 
tional phases and 68 percent of the code executed 
in parallel can be easily overlapped to defeat 
computational rundown. These two enablement 
mapping possibilities are the most frequently 
occurring situations in PAX/CASPER experience. 


The next most frequently occurring enable- 
ment mapping in PAX/CASPER experience is what 
could be called null mapping, that is, the situ- 
ation in which no overlapping is possible. This 
occurs in 4 out of 22 (or 18 percent) of the 
computational phases and represents 262 out of 
1188 (or 22 percent) of the lines of code execut- 
ing in parallel. In all cases the cause was not 
that such an overlapping did not exist between 
the parallel computations but was, in fact, that 
serial actions and decisions had to occur between 
the phases. This is important.since it allows 
one to assess how often the extra effort of sup- 
porting overlapping features will be entirely 
defeated, regardless of the sophistication of 
the overlapping phase support features. 


Another enablement mapping occurring in 
PAX/CASPER experience is a reverse indirect 


mapping. Consider the following Fortran 
fragment: 
DO 10 I=1,N Set up source mapping 
DO 10 J=1,10 
IMAP(J,I)=IRAND() ! IRAND produces an integer 
10 CONTINUE in the range 1 to N 
DO 100 I=1,N First computational 


A(1)=FUNC (I) phase generates some 
number in A(x) 
100 CONTINUE 
DO 200 I=1,N 
DO 200 J=1,10 


B(1)=A(IMAP)(J,1)) 


Second computational 
phase sums subsets of 
the results of the 
first computationa| 
phase 

200 CONTINUE 


i] 
' 
, 
' 
\ 
t 
i 
1 
1 
' 
Clearly, this computation can be overlapped; 
however, determining the enablement mapping is 
very difficult. This is because knowing that a 
particular first phase granule is complete does 
not directly identify any distinct second phase 
granule as computable; however, a reverse mapping 
from desired second phase granule to required 
first phase granules is possible. 


In PAX/CASPER experience, this situation 
occurs in 2 of 22 (or 9 percent) of the computa- 
tional phases representing 78 out of 1188 (or 7 
percent) of the lines of code executing in paral- 
lel. While this is not a frequently occurring 
Situation in PAX/CASPER experience, it cannot be 
ignored out of hand. Some engineering judgement 
must be made to weigh the cost (in terms of man- 
agement overhead, computational resource trans- 
ferred from workers to management, etc.) of some 
reverse enablement mapping solution against the 
cost of computational rundown in 9 percent of 
the parallel computational phases. 


Certainly, a solution exists for the reverse, 
indirect enablement mapping. Once the values of 
the information selection map (represented in 
the code fragment by the array IMAP) have been 
determined, it is a simple matter to produce a 
composite map of first phase granules that must 
be completed in order to enable a particular 
second phase granule. The executive can then 
use this map upon each first phase granule com- 
pletion to determine the computability of par- 
ticular second phase granules. This map could 
also be used to direct a preferred order of first 
phase granule dispatching so as to enable a known 
second phase granule as early as possible. 


Two important facts about this reverse 
enablement mapping must be included. First, 
both occurrences of this situation involved a 
dynamically generated information selection map. 
Thus, the composite granule map would have to be 
generated by the executive at or after first 
phase initiation but before any second phase 
enablements. Second, the impact of executive 
Computation must be considered. In the PAX/ 
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CASPER UNIVAC 1100 test bed, executive computa- 
tion was done at the direct expense of worker | 
computation. Thus, extensive composite granule 
map generation could be self defeating. Some 
real parallel machines may provide separate 
executive computing resources, in which case the 
generation and use of composite granule maps 
would not be out of the question. 


A final enablement form was observed in 
PAX/CASPER that could be characterized as a for- 
ward, indirect mapped situation. Consider the 
following Fortran fragment: 


DO 10 I=1,M 

IMAP(1)=IRAND( ) 
10 CONTINUE 

DO 100 I=1,M 


Generate forward 
map 


Use forward map 
to operate on a 
subset of the 
arrays 
Perform some further 
further opera- 
tion on the 
complete arrays 


B( IMAP(I) )=A(IMAP(T)) 
100 CONTINUE 
DO 200 I=1,N 
C(1)=B(1) 
200 CONTINUE 


This situation is somewhat easier than the 
reverse, indirect mapping in that the identifi- 
cation of a particular granule in the first phase 
can be directly mapped to an enabled granule in 
the successor phase; however, much of the com- 
plication of a mapped enablement remains. This 
form was the least frequently occurring situation 
in PAX/CASPER showing up only once (5 percent of 
the phases) and accounting for only 31 of 1188 _ 
lines of code executed in parallel. 


No other forms of enablement mapping were 
observed in PAX/CASPER. Certainly, extensions 
of the forms already presented can be imagined. 
Additionally, a seam mapping problem (such as 
would be appropriate for the checkerboard 
approach to the successive over-relaxation 
problem) can be foreseen. These other forms 
are beyond the scope of the present paper. 


Language Construction 


The developing PAX/CASPER language is 
simple and requires the user to make specific 
statements concerning choices for the management 
of each parallel computational phase. Statements 
involving the enablement of a succeeding phase 
could be made at two times: during the definition 
of a computational phase to the management system 
and during the invocation of the phase for actual 
computations. The difficulty to be faced is that 
the statements no longer apply solely to the 
phase being referenced, but rely also on the 
characteristics of the succeeding phase. 


The simplest approach is to require the 
user to specify the appropriate enablement 
mapping method when the phase in invoked. It 
might appear as in the following PAX parallel 
language fragment: 


DISPATCH phase-name 
ENABLE/MAPPING=option 


This is simple and explicit; however, it 
leaves the door wide open to user mistakes. 
There is no interlock between this phase and the 
next that can be verified by the executive. A 
simple solution to this would be to identify the 
name of the enabled next phase so that the execu- 
tive system (or language processor) can verify 
that, in fact, that phase is following. This 
might appear as follows: 


DISPATCH phase-name 
ENABLE [ phase-name/MAPPING=option | 


This allows the desired verification, but 
also brings up a new possibility. Occasionally, 
a conditional branch that is not dependent on 
the computational phase separates that phase 
from two or more succeeding phases, each of which 
may (or may not) be overlappable. If each of 
these phases were identified in the above con- 
struct, the executive could preprocess the branch 
and overlap the appropriate phase. This could 
look as follows: 


DISPATCH phase-name 
ENABLE / BRANCH INDEPENDENT 
[ phase-name-1 /MAPPING=option 
phase-name-2/MAPPING=option ] 
IF ( IMOD(LOOPCOUNTER,10).NE.O) 
THEN GO TO branch-target 
DISPATCH phase-—name-1 


GO TO rejoin 
branch-target: 

DISPATCH phase-name-2 
rejoin: 


Finally, the matching of mapping selections 
and phases and the invocation of the appropriate 
Overlapping services is something that could be 
done when the parallel phase is defined to the 
System; however, it would still be necessary to 
identify pre-processable branches at the compu- 
tation invocation site. This could appear as 
follows: 

DEFINE PHASE phase-name 
ENABLE [ 
phase-name-1/MAPPING=option 
phase-name-2/MAPPING=option 
Gee enema terete 
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DISPATCH phase-name 


ENABLE / BRANCHDEPENDENT 


The ENABLE/BRANCHINDEPENDENT would be 
deleted when branch pre-processing was either 
not appropriate or not needed. The executive 
system could perform the appropriate lookahead 
to see whether any of the named succeeding phases 
was actually following and apply, as appropriate, 
the specified enablement mapping. 


Control Strategies 


Control strategies for enabling and schedul- 
ing overlapped parallel computational phases are, 
of course, highly dependent upon the overal1 
parallel processing strategies. As alluded to 
earlier, some approaches to parallel processing 
may do all of this before any computations are 
begun. Indeed, the entire process may be done 
manually by a human being when the pattern of 
parallel processing is fixed for the life of the 
system. 


Within the PAX system, the opposite is true: 
the identification and scheduling of computable 
granules is entirely automatic. A scheduling 
mechanism for enabled computational granules 
already exists within the PAX system. It was 
developed to schedule dynamically created compu- 
tations that conflicted (usually in terms of 
Shared data access) with pre-existing computa- 
tional granules. 


Within PAX, each internal description of 
one (or more) computational granules included a 
queue head for a double circularly-linked list 
of computable but conflicting computational 
granules. Upon completion of the described com- 
putation, all the queued conflicting computations 
became unconditionally computable and were placed 
in the waiting computation queue. The waiting 
Computation queue was kept in a known order and, 
for the purposes of the conflicting computation 
problem, it was determined that such conflicting 
computations would be placed ahead of the normal 
Computations in the queue and, thus, given higher 
priority. 


The scheduling of universally mapped suc- 
cessor phases within this system is very easy 
indeed. At the time of phase initiation, the 
successor phase is also initiated and the result- 
ing computation description placed in the waiting 
computation queue behind the current phase des- 
cription. 


The scheduling of directly enabled successor 
phases is similarly easy at first sight. At the 
time of phase initiation, the successor phase is 
also initiated and the resulting computation 
description placed in the conflicted computation 
queue of the current phase description. Thus, 
when the current phase computation is completed, 


the now-enabled successor computation will be 
placed in the waiting computation queue to be 
considered for scheduling. 


The above approach for directly enabled 
successor phases is fine if each indivisible 
granule of computation is described separately. 
Unfortunately, this is usually not economical 
(in terms of storage space and task search times, 
among other things) and was not the choice taken 
in PAX design. Computations were, instead, des- 
cribed as large, contiguous collections of gran- 
ules. The descriptions were split apart as 
necessary to produce conveniently sized tasks 
for workers and then merged back into single 
descriptions when the work was completed. This 
splitting of descriptions renuires that queued 
computation descriptions also be split so that 
each queued description will accurately reflect 
the enablement relationship between the computa- 
tion and its queued successor computation. 


While this is certainly possible, it forces 
a further design decision for the executive 
software. PAX computation splitting was demand- 
driven by the presence of an idle worker. It 
was felt that the delay while splitting a task 
description was acceptable; however, the addi- 
tional delays of splitting queued successor com- 
putation descriptions may represent an unaccept- 
able situation. Two possible solutions exist. 
One possibility is to pre-split the tasks before 
idle workers present themselves to the executive. 
This would allow the executive to work ahead in 
otherwise idle time. Alternatively, the split- 
ting of a computation could generate a successor- 
splitting task that could be quickly queued for 
later attention when the executive would again 
be idle. 


The successor computation description could 
be removed from the current computation descrip- 
tion and included in the successor-splitting 
task information. When the successor-splitting 
task is executed the successor computation could 
be split and requeued to the appropriate current 
Computation descriptions. 


Management of indirectly (both forward and- 
reverse) mapped successor computations is a good 
deal more interesting. The description of the 
Successor computation cannot simply be queued to 
the description of the current computation since 
there is no guarantee of the enablement relation- 
ship. Additionally, it would seem wise to get 
the current phase into execution without the 
delay of constructing the necessary information 
for enabling successor computations. Both for- 
ward and reverse indirection would seem wel] 
handled by much the same mechanisms since the 
only significant difference is the direction of 
the indirection. Each leads naturally to a list 
of current phase granules that must be completed 
to enable a particular successor phase granule. 
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It would seem appropriate to identify a 
subset group of successor-phase granules that 
are to be the subject of the enablement operation 
So as to avoid solving an unnecessarily large 
enablement problem. Once this subset has been 
identified, the current-phase granules that 
enable the successor subset can be identified. 
Since these are not necessarily the current phase 
granules that would be naturally selected by the 
scheduling mechanism, they should be split into 
individual descriptions and placed in the waiting 
computation queue in such a manner as to elevate 
their computational priority. 


It is important to note that the description 
of the successor subset cannot simply be queued 
to any one of the identified current-phase gran- 
ules since it is enabled not by the completion 
of any one such granule but by the completion of 
all the identified granules. This enablement on 
completion of all identified current-phase gran- 
ules can be handled by any number of simple 
mechanisms. For instance, during completion 
processing, a status bit (set when the current- 
phase granules were identified and split into 
individual descriptions) can be checked and, if 
it is set, an enablement counter decremented. 
When the enablement counter reaches zero, it can 
be taken as a signal that the successor-phase 
granules are computable. 


Concluding Remarks 


This paper has discussed the possibilities 
for overlapping parallel computations in a gen- 
eral purpose parallel-computation environment so 
as to minimize loss of computational resources. 
Practical experience with PAX/CASPER, a parallel 
Navier-Stokes solver, suggests that simple and 
plausible steps could provide such overlapping 
in 68 percent of the computational phases and 
that, with extended effort, more than 90 percent 
of the computational phases are amenable to some 
form of phase overlapping. 
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ABSTRACT 


New analytical techniques are presented for probabilistic 
systems with concurrency and applied to computing the 
expected execution time and speedup of parallel programs. 
These new techniques are formulated for a program description 
language called PEL (Performance Evaluation Language), 
which allows essential control and timing characteristics of 
parallel programs to be expressed in a precise manner. The 
analysis techniques for PEL are an extension of standard linear 
system techniques to the realm of concurrency, based on the 
development of a new mathematical operator called the "join 
product." Efficient approximation techniques are also 
introduced for estimating the execution time and speedup of 
parallel programs. 


1. INTRODUCTION 


Since the main purpose of concurrency is improved 
performance, it is important to have analysis tools for 
evaluating the expected performance of various concurrent 
algorithms with respect to their sequential counterparts. One 
practical analysis technique is to simply write the program and 
run it on the target computer. However, the execution time 
for different input data may vary considerably, and it may be 
difficult to generalize the results from a small number of data 
sets. Also, the execution time will be highly dependent on 
uncontrollable environmental factors such as system load and 
placement of data files on the disk. Another common analysis 
technique is to do a direct complexity analysis of the abstract 
algorithm itself. This method has proved useful in computing 
overall relative complexity of algorithms in terms of "big O" 
properties, but is not exact enough for performance prediction 
of real programs. Also, this type of complexity analysis is 
difficult for large programs, especially those involving disk 
access. 


This paper presents an alternative analysis technique for 
concurrent programs which is somewhere in between these two 
techniques. The actual program is written out in a special 
language which allows the programmer to specify the 
estimated execution time of primitive operations, and to model 
the effect of differing input data with conditional branching 
probabilities. One of the earliest efforts to use probabilistic 
analysis techniques for computer programs is found in a paper 
by Ramamoorthy [1], in which traditional flow graph 
techniques for Markov processes are used to estimate the 
execution time of programs expressed as flow charts, with 
conditional branches represented by probabilities. Using 
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traditional z-transform techniques, the complete probability 
distribution for the execution time of the whole program is 
calculated, from which important statistics such as average 
execution time and variance are easily computed. Deo [2] gives 
an efficient technique for finding the expected execution time 
using a stochastic model of a program in conjunction with the 
standard mean first passage time techniques used for Markov 
processes [3]. Similar techniques are also presented by Trivedi 
[4]. However, none of these sequential program techniques is 
directly applicable to programs with concurrency. 


With the increasing importance of parallel computation, 
there is a growing body of research on_ performance 
evaluation of parallel systems. Most of. this research uses 
timed or stochastic Petri nets as the underlying model for 
the parallel system. Research results are encouraging and 
show that analytical or simulation techniques can be used to 
analyze the behavior of parallel systems represented as a 
Petri net with timed transitions and probabilistic choices. 
Zuberek [5] and Molloy [6] present techniques and examples 
of application to communication protocols, and Marsan [7] 
applies Petri net techniques to performance evaluation of 
multiprocessor systems. The disadvantage of these 
techniques is they are based on generating the reachability 
graph of all possible states of the Petri net, which may be 
exponential in the size of the net. Thus, these techniques are 
not practical for analyzing concurrent computer programs, 
typically involving large numbers of potentially concurrent 
operations. Sahner and Trivedi [8] describe a tool for 
performance evaluation of concurrent systems called SPADE, 
which uses an acyclic directed graph to model precedence 
constraints among events. SPADE is useful for analyzing 
concurrent program task structures, but is not general enough 
for detailed internal analysis of concurrent programs. 


2. PERFORMANCE EVALUATION LANGUAGE 


For the purpose of performance evaluation of computer 
programs, a simple language called PEL (Performance 
Evaluation Language) is used to represent important 
characteristics of the program, including the basic control 
structure of the program, concurrency, operation execution 
times, conditional branching probabilities, and loop repetition 
counts. A description of a program in PEL may be written 
directly by the programmer, or created by a compiler from the 
source code of the original program. If the compiler technique 
is used, then the programmer must supply additional 
information about branching probabilities and loop repetition 
counts in the program. Individual operation execution times 
may be already known to the compiler or may also be specified 
by the programmer. 


A similar technique is used by the Parafrase system [9] at 
the University of Illinois to measure program execution times 
for the purpose of analyzing speedups resulting from various 
types of program restructuring techniques that create 
concurrency. In the Parafrase system, the user may specify 
conditional branching probabilities in Fortran programs with 
special "assertions". If no assertion is made concerning a given 
branch, then Parafrase assumes that all branches have equal 
probability. Parafrase also estimates the number of iterations 
of each DO-loop, with the help of user assertions and user- 
definable global default values. Estimates for operation 
execution times such as arithmetic operations, logical 
operations, subscript calculations, store operations, and 
memory fetches are built into Parafrase, and thus not specified 
by the user. 


The syntax of PEL is quite simple with five basic types of 
statements: delay, tf-then-else, sequence, loop, and fork-join. 
The three statement types tf-then-else, sequence, and loop 
correspond to the traditional control structures found in any 
block-structured language like PASCAL. The delay statement 
is used to model the execution time delays associated with 
primitive operations in the program such as arithmetic or 
assignment operations. The fork-join statement is used as the 
basic control structure for concurrency. PEL is used to create a 
probabilistic model of the program control structure and 
timing delays, in order to evaluate the overall execution time 
of the whole program. PEL programs have no input data, and 
their execution is based entirely on probabilities, which are 
supposed to represent average values for typical input data 
sets of the original program. Figure 1 shows an example of a 
simple PEL program. 


PROGRAM; 
IF RANDOM < .3 
THEN BEGIN DELAY(5); DELAY(2); DELAY(7) END 
ELSE 
FORK 
IF RANDOM < .5 THEN DELAY(20) 
ELSE DELAY(35); 
LOOP 10 DO DELAY(3); 
IF RANDOM < .4 THEN DELAY(5) 
ELSE DELAY(10) 
JOIN; 
END. 


Figure 1 - A Sample PEL Program 


Each time an if-then-else statement in PEL is executed a 
new random number between 0 and 1 is generated (represented 
by the word "RANDOM") and then compared with a constant, 
which represents the probability of choosing the "THEN" 
clause. In the above program, the outer "IF" chooses the 
"THEN" clause with probability .3 and the "ELSE" clause 
with probability .7 . The "THEN" clause consists of a 
sequence of three "DELAY" statements whose execution time 
is given in parenthesis. The "ELSE" clause contains a fork- 
join statement which executes three separate statements 
concurrently. The loop statement executes its body the number 
of times specified by the fixed constant loop count (in this 
example the loop count is 10). Statements may be nested to 
an arbitrary depth as in ordinary block-structured languages 


like PASCAL. 
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The use of fork-join primitives for specifying concurrency 
in programming languages originated with Dennis and 
VanHorn [10], and has become a standard which is found in 
many languages and operating systems today [11]. There are 
three types of fork-join statements in PEL to represent some of 
the major types of concurrency primitives used in existing 
programming languages: fork-list, fork-n, and fork-gen. The 
fork-list statement is simply a list of statements which are all 
executed concurrently — this is similar to the "cobegin-coend" 
concurrency primitive found in languages such as Concurrent 
Pascal [12], or the "PAR" constructor in the parallel 
programming language OCCAM [13]. The fork-n statement 
has the following form: : 


FORK(n): statement ; JOIN (where n is any positive integer) 


The fork-n statement creates n concurrent executions of its 
statement body, and is used to model the "forall" statement 
which is very common among concurrent programming 
languages such as SISAL [14], ARGUS [15], and OCCAM [13]. 
The major use of this "forall" construct is as a parallel form of 
a DO-loop for doing highly parallel operations on vectors and 
arrays, and is found in many parallel array languages [16]. 


The fork-gen statement in PEL consists of a list of labeled 
statements along with a partial ordering giving the execution 
precedence constraints. Such partial orderings result, when flow 
analysis is performed on ordinary sequential programs by a 
compiler to automatically locate potential parallelism (see 
Kuck [17]). Figure 2 shows an example of a fork-gen 
statement: 


FORK 
1: LOOP 5 DO DELAY(4); 
2: IF RANDOM < .6 THEN DELAY(23) 
ELSE DELAY(10); 
3: DELAY(5); 
4: DELAY(3); 
5: IF RANDOM < .7 THEN DELAY(15) 
ELSE DELAY(5) 
PREC 
1<4;2<3;2<5;3<4 
JOIN 


Figure 2 - A Fork-gen Statement 


The statement labels are used in the "PREC" section to 
specify the precedence constraints for executing the 
statements. The symbol "<" is used to represent an execution 
precedence constraint and may be interpreted as "is an 
immediate predecessor of." For example, in the above fork-gen, 
statement 1 and 2 may be executed concurrently, but 
statement 4 may begin only after 1 is completed and statement 
2 must precede both 3 and 5. The PREC list for this fork-gen 
statement defines the execution precedence graph shown in 
figure 3. 


3. EXECUTION TIME OF PROGRAMS 


A system has been developed to compute the probability 
distribution for the overall program execution time of any PEL 
program, including the average execution time and the speedup 
resulting from concurrency. In PEL, all of the execution time 
delays in the program are assumed to be represented by the 
delay statements. The control operations such as IF 


branching, LOOP, FORK, and JOIN are assumed to involve 
no time delays. However, if it is desired for the PEL program 
to model time delays in these statements, then delay 
statements may be added to account for actual execution times 
of these control operations. 


The program execution time is represented by a probability 
mass function f(t), which gives the probability that the 
program executes in exactly ¢ time units. For simplicity only 
integral values of ¢ are permitted. (Fractional values of ¢ can 
be accomodated by using a smaller time unit.) The function f 
satisfies the usual rule for probability distributions that the 
sum of all probabilities gives 1: 


For the purposes of computing the overall program execution 
time, the probability mass functions will be represented by 
their geometric transform: 


In the literature on probability theory, this is sometimes called 
the z-transform, discrete transform, or generating function. 
Transform techniques are useful in the analysis of all 
stochastic processes, especially Markov processes. The value of 
the transform is that it turns a stochastic process into a linear 
system, which can then be analyzed using a powerful set of 
linear system techniques such as flow graph analysis [3]. When 
parallelism is. allowed, however, then the system is no longer 
strictly linear, and so the standard linear flow graph 
techniques are not directly applicable. To analyze PEL 
programs, we have developed new flow graph techniques which 
can be used to analyze probabilistic systems with parallelism. 


For any PEL statement S, T(S) is used to denote the 
transform of the probability mass function for the execution 
time of S. The transform for any PEL statement may be 
computed recursively from its component statements according 
to the following rules: 


1. IfS is a delay statement of the form DELAY(t), then 
T(S) = 2'. 


2. IfS is a sequence statement of the form 
BEGIN 8§)359};...3;5,, END, then 
T(S) = T(S,)T(S,)...T(S,)- 

3. If S is an if-then-else of the form 


IF RANDOM <p THEN Sy, ELSE Sp, 
then T(S) = p T(Sp) + (1-p) T(S,). 


4. IfS is a loop statement of the form LOOP n DO S,, 
then T(S) = [T(S,)]*. 


These rules for the sequential portions of the program are 
derived from the standard properties of geometric transforms. 
Similar formulas are given in Appendix E of ref. [4] for the 
Laplace transforms of structured program control statements. 
However, in order to compute the transform for the concurrent 
fork-join statements, an entirely new operation on transforms 
is needed called a "join product." 


Definition: The join product (denoted ¢) is a binary infix 
operator with the following properties: 
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A. Let F® denote the set of all geometric transforms. 
o: FIX FI. FI 
B. eis commutative and associative 
C. o distributes over addition 
D. Let a,b,2,y be any real numbers. Then 
az*obz¥ = abz™2(2,9) 
Properties A-C of the join product are identical to 


multiplication. The only difference is in property D where the 
maximum of exponents is used instead of the sum. This simple 
join product operation is used to help compute the execution 
time transform for fork-join statements according to the 
following rules: 


5. IfS is a fork-list statement of the form 
FORK 54;59;...35, JOIN, then 
T(S) = T(S,)eT(Sp)e - - + oT(S,) 


6. IfS is a fork-n statement of the form 
FORK(n): S,; JOIN, then 
T(S) = T(S,)eT(S,)e + + + eT (54) 
————$. 
n repetitions 

For a more complete presentation of all of these rules including 
proofs of their validity, the reader is referred to Lester [18]. 
Applying rules 1-4 to the program of figure 1 gives the 
following transforms for the execution time of each of the three 
statements inside the fork-join: 


229+ 5239 
23)10— ,30 


Az°+ 6210 


Performing a join product on these three transforms according 
to rule 5 gives the transform for the whole FORK-JOIN as 
follows: 


(.527° + 52°) 02°94 (.42° + 621°) 

(.5.270 0 230 + 529° o 230) o(.42° + .621°) 

(529° + .5 23%) 0(.42° + 621°) 

5279 0.425 + 52506 62194 5235 0.4254 5735. 621° 
2279 + 3299+ 92354 3235 — 52304 5 235 


For the outer IF of the program, the true branch has transform 
22227 = 214 according to rules 1 and 2. Combining this with 
the FORK transform from the false branch using rule 3 gives 
the following: 


32" + .7(.5279 + 5235) = 3214 + 35299 + 35235 


This is the transform for the execution time of the whole 
program. The expected (average) execution time of the 
program can be computed directly from the transform to be 
26.95 time units. 


The transform computation rule for fork-gen statements is 
more complex and is only summarized here. The reader is 
referred to Lester [18] for a more complete detailed 
presentation. The transform for a fork-gen is computed 
recursively from the transform for each of the component 
statements 5S; having the following form: 


M18) = 3 fee 


In a transform, the coefficient of each power of z is a real 
number representing a probability. For the purposes of 
computing the overall transform of a fork-gen statement, these 
coefficients are represented symbolically as a character string 
called an f-symbol of the form f,(t), where i and t are positive 
integers. This symbolic form of the execution time transform 
is called the transmtsston. For example, the transmission for 


statement 2 in the fork-gen of figure 2 is f,(10)z1°+ f,(23)z2°. 


There is a special multiplication rule for these f-symbols as 
follows: 


Let f,(t,) and f,(t,) be any f-symbols, then 


f (ta) if t,=t, 
Filta) £4) = 1 9 


otherwise 


Using the execution precedence graph for the fork-gen 
statement, the flow is computed for each arc in the graph 
according to the following rules: 


1. The flow on the input arc to the fork-gen is 1. 


2. The flow on each output arc of a statement is the join 
product of all its input flows multiplied by the statement 
transmission. 


The execution time transform for the whole fork-gen is found 
by computing the flow on the output arc of the execution 
precedence graph and then substituting the corresponding 
numerical values for the f-symbols. For example, the above 
flow computation rules can be used on the fork-gen of figure 3 
to compute the following flow for the output arc: 


f(10) f 5(5)2?>+ f_(10) f5(15)2?°+ 
fo(23) f5(5)2**+ fo(23) f5(15)2° 


Substituting the specific values of the f-symbols gives the 
transform for the whole fork-join: 


12279 + 28275 + 182314 42778 


The expected execution time for the whole fork-gen can easily 
be computed from this transform as 31.3 time units. 


4. PARALLEL SPEEDUP FACTOR 


For large programs, the computation of the complete 
transform may be very time consuming since the number of 
terms in the transforms may grow quite large, especially if 
there are lots of complex loops with high repetitions. For such 
programs it is necessary to have an efficient approximation 
technique for estimating bounds on the expected execution 
time. Since the branching probabilities, delay times, and loop 
repetition counts are usually only approximations themselves, 
a good approximation technique for analyzing the whole 
program may be sufficient in many cases. The parallel speedup 
factor may be computed exactly using transforms or 
approximated efficiently with one of these approximation 
techniques. 


The recursive definition of PEL statements allows a simple 
recursive computation rule for the lower and upper bounds on 
the expected execution time: the bounds for each statement are 
computed from the bounds of its component statements. The 
lower bound will be calculated by using expectations or 
averages, and the upper bound by using maximums. For most 
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programs, the lower bound will be much tighter than the upper 
— in fact, for purely sequential programs with no parallelism, 
the lower bound will be exactly the expected execution time. 
However, we shall see later that for certain programs with a 
high degree of concurrency, the actual expected execution time 
may be nearer to the upper bound. Similar approximation 
techniques were introduced for performance evaluation of 
asynchronous systems in Lester [19]. 


Definttton: Let S denote any PEL statement. The average 
approximation (denoted AVE(S)) is a nonnegative number 
defined recursively as follows: 


(a) If Sis a delay statement of the form DELAY(t), then 
AVE(S) = t. 


If S is a sequence statement of the form 
BEGIN S;843--.;5, END, then 
AVE(S) = AVE(S,)+ AVE(S_)+ °°: +AVE(S,). 


If S is a loop statement of the form LOOP n DO Sj, 
then AVE(S) = n AVE(S,). 


If S is an if statement of the form 

IF RANDOM <p THEN Sy ELSE Sp, then 

AVE(S) = p AVE(Sr) + (1—p) AVE(S,). 

If S is a fork-join statement with component statements 
51,5945, » then AVE(S) is length of the longest path 


through $ (critical path), where path length is the sum of 
AVE(S,) for all statements S$; encountered on the path. 


(b) 


(c) 


(d) 


(e) 


Figure 3 — Execution Precedence Graph 


For any PEL statement S, the maztmum approximation 
(denoted MAX(S)) is defined according to the same rules as 
AVE, except for rule (d) where the maximum of the true and 
false branches is used instead of the average. It is possible to 
show that for any PEL statement S$, AVE(S) and MAX(S) 
represent lower and upper bounds on the expected execution 


time £(S): 
AVE(S) = E(S) = MAX(S) 


In fact, MAX(S) is also an upper bound on the execution time 
itself, not just the expectation. The definition of AVE and 
MAX provides a linear time recursive algorithm for computing 
an upper and lower bound on the expected execution time of 
any PEL program. Recall for the program of figures 1, the 
exact expected execution time was calculated earlier as 26.95 
time units. For this program, AVE is 25.2 and MAX is 35. 
The reason that AVE is so close to the actual expected time is 
that the program has a relatively low level of parallelism. For 
the fork-gen statement of figure 2, the exact flow methods and 
f-symbol technique were used to calculate the expected 
execution time as 31.3 time units. Using the approximation 
techniques, AVE is computed as 29.8, which compares 
favorably to the exact expected time, and MAX is 38. 


For any PEL statement S, the sequential execution time 
(denoted SEQ(S)) is defined according to the same rules as 
AVE, except for rule (e) where the sum of the times for all the 
component statements of each fork-join is used instead of the 
critical path. It is easily shown that SEQ corresponds exactly 
to the expected execution time assuming that fork-join 
statements are executed sequentially rather than concurrently. 
So far in our computation of the execution time, we have 
assumed an unlimited number of processors so that all the 
concurrency represented in the fork-join statements may be 
exploited. SEQ is the expected execution time assuming the 
program is executed on a one processor system with no 
concurrency. Just as for the AVE and MAX approximations, 
SEQ can also be calculated in time which is linear in the 
number of statements in the program. 


For any PEL program, the parallel speedup factor is 
defined as the expected sequential execution time divided by 
the expected execution time. That is, the parallel speedup 
factor is the expected execution time assuming one processor 
divided by the expected time assuming an unlimited number of 
processors. The expected parallel execution time can be 
calculated exactly by using the transform method, or it can be 
approximated using the AVE or MAX approximations. The 
parallel speedup factor can then easily be calculated by 
dividing into SEQ. For the example program of figure 1, SEQ 
is 50.05 and the parallel speedup factor is therefore (1.86). For 
the fork-gen of figure 3, the sequential execution time (SEQ) is 
57.8 and the parallel speedup factor is (1.85). 


5. NORMAL APPROXIMATION TECHNIQUE 


The methods of section 4 provide a linear time algorithm to 
compute upper and lower bounds on the expected execution 
time of any PEL program. For programs with a relatively low 
level of parallelism, the lower bound (AVE) is a good 
approximation to the actual expected execution time; and for 
purely sequential programs, AVE is the exact value of the 
expected time. However, programs with moderate or high 
parallelism may have expected execution times that differ 
significantly from AVE, and in some cases even approaches the 
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Definition: Let S be any PEL statement. 


MAX approximation. This section introduces an additional 
approximation technique called the "normal approximation," 
which uses the degree of parallelism to compute a more exact 
approximation to the expected execution time somewhere 
between the upper and lower bounds provided by MAX and 
AVE. 


The normal approximation improves significantly on the 
AVE approximation by considering both the expectation and 
the variance of the execution time for each statement. There 
is a simple recursive computation rule for the normal 
approximation that is similar to the AVE approximation, 
except that two numbers are computed for each statement: the 
NORM (an approximation to the expectation) and the VAR 
(an approximation to the variance). It is interesting that for 
purely sequential programs, the variance does not effect the 
expected execution time, but for highly parallel programs, 
changes in the variance may significantly effect the overall 
expectation. Thus, by considering the variance, the normal 
approximation is much more accurate than the AVE or MAX 
approximations, and yet is still efficiently computable. 


For delay, sequence, loop, and if statements in PEL, 
standard probabilistic techniques provide simple formulas for 
computing the expectation (NORM) and the variance (VAR) 
from the component statements. For the fork-join statements, 
a new technique has been developed which assumes a normally 
distributed execution time for each component statement and 
then uses the exact transform techniques outlined in section 3 
to compute the overall expectation and variance. To reduce the 
complexity of the transform computation, a _ discrete 
approximation to a normal distribution is used for each 
component statement. This discrete approximation has only 
seven nonzero terms computed directly from the expectation 
and variance of the component statement, thus insuring that 
the transform techniques are fast and efficient. 


The normal 
approzimation is a pair of numbers NORM(S) and VAR(S) 
defined recursively by the following rules: 


(a) IfS is a delay statement of the form DELAY(t), then 
NORM(S) =t and VAR(S) = 0. | 


If S is a sequence statement of the form 

BEGIN S;54}...;5, END, then 

NORM(S)=NORM(S,)+ NORM(S,.)+ + -:+NORM(S,) 
and VAR(S) = VAR(S,)+ VAR(S.)+ ---+VAR(S,). 


If S is a loop statement of the form LOOP n DO S,, 
then NORM(S) = n NORM(S,) and 
VAR(S) = n VAR(S)). 


If S is an if statement of the form 

IF RANDOM <p THEN S> ELSE Sp, then 
NORM(S) = p NORM(S,) + (1-—p) NORM(S,) and 
VAR(S) = p(NORM(S7)?+ VAR(S7))+ 
(1—p)(NORM(S,)?+ VAR(S;))—NORM(S) 


If S is a fork-join statement, then for each component 
statement S,, let N denote NORM(S,) and D denote 


(VAR(S;))*. Create the following discrete normal 
approximation to the transform of S,. 
T(S;)=.00621 2% ~ 39 + 06062" ~29 + 2417249 + 

38292" + .24172N*? + 06062 *? + 0062124 +3? 
(where the exponents of z in T(S5,) are rounded to the 
nearest whole number). The usual transform techniques 


(b) 


of section 3 are applied to these approximate transforms 
to compute an approximate transform T(S) for the 
whole fork-join. NORM(S) is defined as the expectation 
for T(S), and VAR(S) is defined as the variance for 
T(S). 


The accuracy of this normal approximation technique 
depends on how closely the actual execution time distributions 
of the individual statements follow a normal distribution. 
There is an important result in standard probability theory 
called the central limit theorem, which states that the 
probability distribution for a sum of n independent random 
variables approaches a normal distribution as n grows large. 


Thus, if the individual statements in the fork-join contain a . 


loop or have many component statements, then chances are 
good that their execution time will indeed begin to approach a 
normal distribution. We have found that loop statements with 
anything in excess of a few repetitions, can be approximated 
by normal distributions with an accuracy of 1-2% . Our 
empirical studies of a wide range of parallel programs have 
shown that the normal approximation is accurate within 1-5%, 
representing a significant improvement over the AVE and 
MAX approximations, and yet still efficiently computable even 
for large programs. 


6. IMPLEMENTATION 


For research purposes, a prototype system has been 
developed for analyzing PEL programs using the exact 
transform techniques and the various approximation 
techniques presented in this paper. The system is written in 
LISP and is currently running on a VAX 11/780 under the 
VMS operating system. The PEL program is entered into the 
system as a simple text file in a format similar to that shown 
in the figures of this paper. To illustrate the practical use of 
this system in designing and analyzing real parallel programs, 
we give two simple examples here: highly parallel vector 
division and parallel file processing with double buffering. For 
both these examples, the process of formulating the algorithm 
and creating the PEL program are illustrated, then the actual 
results from the prototype PEL analysis system are given. 

The first example is a simple vector division function with 
a special rule for handling division by zero: if the divisor and 
dividend are both zero, then the quotient is one, but if the 
dividend is nonzero, then the quotient is positive or negative 
infinity (represented by the maximum size number for the 
computer). Using the standard "forall" primitive to represent 
the parallel form of a loop, the function to compute a vector C 
by dividing the sixteen component vector A by vector B is as 
follows: 


FORALL i := 1 TO 16 PARDO 
IF Bii]<>0 THEN Cli] := Afi]/Bii] 
ELSE IF Ali] := 0 THEN C[i] := 1 
ELSE C[i] := MAXNUM ¢ SIGN(A{i]) 


In the above statement, there are sixteen concurrent 
activations of the body of the "forall", one for each value of 
the index i. "MAXNUM!" is a predefined constant, and "SIGN" 
is a predefined system function. To translate this statement 
into PEL, some estimates are needed for the relative execution 
time of the primitive operations in the language. Let us assume 
that assignment requires 5 time units, array subscripting and 
the forall setup each require 10 time units, the SIGN function 
requires 40 time units, and division, multiplication, and 
boolean comparison each require 10 time units each. Also, 


assume that 80% of the vector entries are nonzero in a typical 
input data set. The "forall" primitive can be represented in 
PEL by a fork-n statement, and the above statement is 
translated into the following PEL program: 


PROGRAM; 
BEGIN 
DELAY (10); 
FORK(16): 
DELAY(25); 
IF RANDOM > .8 THEN DELAY(50) 
ELSE BEGIN 
DELAY(25); 
IF RANDOM > .2 THEN DELAY(15) 
ELSE DELAY(80) 
END. 


For this program, the PEL analysis system computed the 
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following results: 


Expected Execution Time: 136.6 
Sequential Execution Time: 1494.4 
Parallel Speedup Factor: 10.93 


In a second example, we consider a file processing program 
which reads a sequence of student records from a disk file, 
computes the grade point average of each student, and stores 
the results as a new sequential disk file. The program uses 
double buffering of records for both files so that all three 
activities of reading, computing, and writing can occurr 
concurrently. The overall flow of activity is shown in figure 4. 
It seems from this diagram that the parallel speedup factor will 
be a maximum of three since there are three concurrent 
activities. However, a more detailed analysis reveals that the 
parallel speedup factor is highly dependent on file storage 
techniques, disk access times, and execution times for language 
primitives. 


Let us assume that the Student File is sequential and 
contains three student records per disk block, so that a disk 
access to read in a new block is required with probability (.25) 
for each student record on the average. Also, assume that the 
G.P.A. file is sequential with a much smaller record size for 
which 19 records can be stored in each disk block. Thus, the 
probability that writing a single G.P.A. record will require a 
disk access is (.05) on the average. When the program issues 
an operating system call to read the next record for a 
sequential disk file, let us assume that the execution time is 
2000 time units if a disk access is required, but only 500 time 
units if the needed block is already in memory and no new disk 
access is required. Thus the PEL statement for reading each 
student record is as follows: 


IF RANDOM < .25 THEN DELAY(2000) ELSE DELAY(500) 


The PEL statement for writing each G.P.A. record is as 
follows: 


IF RANDOM < .05 THEN DELAY(2000) ELSE DELAY(500) 


The execution time for computing the G.P.A. of each student 
will depend on some timing assumptions for the primitives of 
the source programming language and the size of the student 
record. Assuming that each student record contains an 
average of 40 course grades, and the primitive arithmetic and 
other operations of the language vary from 8 to 12 time units, 
a simple looping program for the "Compute G.P.A." activity 
shown in figure 4 translates into the following PEL statement: 


BEGIN 
DELAY(8); 
LOOP 40 DO BEGIN DELAY(12); DELAY(28) END; 
DELAY(8); 
DELAY(8) 
END; 


To execute the Read, Compute, and Write activities 
concurrently, a fork-list statement is needed in PEL, and this 
statement will be executed once for each student record. 
Assuming there are 1000 student records in the file, there will 
be an outer loop with 1000 repetitions. Finally, some timing 
assumptions are needed regarding file and buffer initialization 
times: opening a sequential file for input requires 2500 time 
units, opening a new sequential file for output requires 600 
time units, and initializing buffers requires 40 time units. 
From all of the above assumptions, the following PEL program 
results: 


PROGRAM; 
BEGIN 
DELAY(2500); DELAY(600); DELAY(40); DELAY(40); 
LOOP 1000 DO 
BEGIN 

DELAY(12); 
DELAY(12); 
FORK DELAY(8); DELAY(8) JOIN; 
DELAY(12); 


FORK 
IF RANDOM < .25 THEN DELAY(2000) 
ELSE DELAY(500); 


BEGIN 
DELAY(8); 
LOOP 40 DO 

BEGIN 
DELAY(12); 
DELAY(28); 

END; 


IF RANDOM < .05 THEN DELAY(2000) 
ELSE DELAY(500); 
JOIN 
END; 
END. 


Using the prototype PEL analysis system for this program 
results in the following output: 


Expected Execution Time: 17.79 x 10° 
Sequential Execution Time: 31.18 x 10° 
Parallel Speedup Factor: 1.75 


Thus, it is seen that waiting time and lack of balance between 
the execution time of the three concurrent activities in figure 4 
has reduced the apparent concurrency factor of 3 to an actual 
parallel speedup factor of only (1.75). If the time unit used in 
making the initial estimates for the basic operations is 10 
microseconds, then the expected execution time for the 
program is 17.8 seconds. 
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7. CONCLUSIONS 


The PEL language is a formal system for describing the 
overall control structure and timing delays in a parallel 
program, as an aid in computing the parallel speedup factor. 
Real parallel programs are sufficiently large and complex that 
such a formal description language is useful as a supplement to 
standard algorithm analysis techniques. The transform 
techniques and efficient approximation techniques presented in 
this paper for PEL programs can then be used to compute the 
expected execution time and parallel speedup factor for the 
program. The mathematical complexities introduced by 
program parallelism make formal analysis techniques necessary 
even for small programs. 


There is a large body of standard analysis techniques for 
sequential probabilistic systems using z-transforms and linear 
flow graph reduction. The major contribution of this paper is 
the extension of these techniques to parallel systems in the 
form of PEL programs. This paper also presents efficient 
approximation techniques for estimating the expected 
execution time and parallel speedup factor for large programs. 
Future work in this area will focus on the development of 
compilers to translate parallel programs directly into PEL for 
analysis. Also, future research is needed to extend the PEL 
primitives to allow message passing between parallel processes, 
a feature which is contained in many concurrent programming 
languages. 
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Introduction 


Execution of parallel programs on parallel computer ar- 

cchitectures requires scheduling, coordination, and resource 
management. These control activities themselves consume 
computer resources. Centralized resource serial control will 
become the performance bottleneck when large numbers of 
processes must be scheduled and managed. 
_ This paper studies parallel structuring of process control 
and resource management. The environment for the study is 
the Run-time System of the Computation Structures Lan- 
guage (CSL) parallel programming system. The approach 
has been to design and evaluate a parallel version of the con- 
tro] and resource management functions of the CSL run-time 
system. The performance characteristics of the parallel ver- 
sion are predicted and an algorithm for determining the op- 
timal degree of parallelism for the run-time system is given. 

Parallel structuring of resource management is essential 
in order to alleviate the well-known serialization bottleneck 
that occurs whenever a large number of parallel processes 


must access a shared resource through a serial monitor 
[MAD74, CEZ83]. 


The serialization bottleneck of a centralized resource 


management system can be demonstrated by considering the 
fact that the resource manager is a shared resource of the 
tasks. The resource manager can impose a constant upper 
bound on the speedup obtainable in the system even if the 
tasks themselves run completely in parallel without inter- 
dependencies and have a linear speedup potential. 

The Computation Structures Language (CSL) is a lan- 
guage for specifying the structure of parallel computations 
and controlling the execution of these structures. (The next 
section gives an introduction to and examples of CSL 
programs.) The CSL Run-time System (RTS) thus forms a 
serialization bottleneck when the execution of a complex 
computation structure must be managed, and limits the 
maximum speedup obtainable in parallel programs in the 
CSL system. The design of a Parallel Run-time System 
(PARTS) consists of a method of parallelizing the resource 
management functions of the original Run-time System. It 
consists of a restructuring of the system tables such that 
they support multiple copies of the original RTS along with 
specifications for these access procedures. 

A simulation model for PARTS executing on a certain 
class of CSL programs was developed. The data used to 
parameterize the modes is taken from a working RTS on a 
real system where appropriate. The performance of the 
‘model under varying conditions is shown. 
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An analytical equation that closely approximates the per- 
formance of the simulation is presented. The equation is 
used to predict the performance characteristics of PARTS 
under variations of degree of parallelism and to develop a 
schedule for creating multiple copies of the RTS to execute 
in parallel that maximizes the speedup under the condition 
that all tasks have equal execution times. The restriction to 
tasks with equal execution times is not as limiting as it 
might seem at first glance. We have found that execution of 
CSL programs is often dominated by execution of replicated 
arrays of processes which will have closely similar execution 
times. 

The results reported here should have at least conceptual 
application to any system for control and source manage- 
ment of complex dynamic parallel computation structures. 


Parallelizing the Centralized RTS 


The Computation Structures Language (CSL) is a lan- 
guage for expressing the synchronization and communication 
requirements of a set of tasks [Browne 82]. CSL specifies the 
traversal of a computation graph. The CSL program does 
not, however, perform any of the computations associated 
with the parallel program. A simple CSL program for the 
producer/consumer problem is shown below. 


JOB Sample-CsL; 


CONSTRUCT 
TASKS 
T{i] : Filei; RANGE i=1 to 2; 
T3 : File2; 
CHANNELS 
Ch [i] DATACHANNEL from T[i] to T3; 
RANGE i=1 to 2; 
END; (* Construct *) 
BEGIN (* Executable Code *) 
COBEGIN 
// EXECUTE T[1]; 
SEND X to Ch[il]; 
RECEIVE Y from Ch[i1i]; 
// EXECUTE T[2]; 
SEND Y to Ch[2]; 
RECEIVE Y from Ch[2]; 
COEND; 
EXECUTE TS3; 
END. (* CSL Program *) 


In this program it is specified that T[1] and T[2] may ex- 


ecute in parallel. The three statements following each "//" 


comprise a parallel stream. Each statement within a stream. 


is executed serially. After T3 receives both X and Y, the 
Cobegin completes and T3 may execute. 

The CSL RTS is the resource manager for a running 
parallel program. It can be thought of as defining a local 
operating system for the parallel program. When a Cobegin 
is encountered, the RTS sends messages to tasks telling them 


to begin (Initializing), waits for completion, and determines 


status information through a message sent back from the 


completed task (Finishing). The RTS also coordinates the. 
passing of data between tasks. The result of any action by 


the RTS is a modification of the global state of the parallel 
computation. The global state of the computation is 
recorded in a set of system tables maintained by the run- 
time system. 

By definition of CSL, tasks only communicate with each 
other before or after their execution. Thus the computation 


granularity of the tasks depends on the frequency of com- 


munication among the tasks. In programs written in the 
CSL environment, this has varied from 2 MS to 1000 MS. 

The set of statements managed within the Cobegin at a 
given time are serviced in a Round-robin fashion with time 
being allocated on the basis of what work needs to be done. 
Thus when a statement is ready to be Initialized or Finished,: 
it may have to wait for service. This causes a serialization 
bottleneck to develop within the system. 

The PARTS system partitions the Cobegin into groups of 
statements, each managed by a separate RTS. The par- 


titioning process attempts to balance the workload between 
RTS’s. When a Cobegin is encountered, the current RTS: 


"Initializes" the other RTS’s telling each to commence ex- 
ecution. This is an overhead factor in the PARTS system. 
As soon as the RTS’s are Initialized, they begin managing 
their portion of the Cobegin statements. Since a global state 
is preserved, the system tables are made available for access 
by all RTS’s in the system. Since RTS’s modify the global 
state, the system tables must be accessed serially which may 
introduce wait time into RTS execution. 

In summary, the serialization bottleneck of Initializing 
and Finishing statements in the Cobegin has been parallel- 
ized in PARTS, but has been compromised somewhat by the 
time to Initialize and Finish the new RTS’s plus the ad- 
ditional wait time due to shared memory contention for sys- 
tem table access. However, the first-order serialization point 
has been removed. 


Simulation Model 


The simulation model developed for PARTS is an exten- 
sion of a model of a single RTS managing a given number of 
streams. Since the Cobegin is the major feature of the lan- 
guage, and the schema by which parallel computation may 
take place in CSL, the simulation models just this activity. 
(Often an entire parallel program structure can be described 
by a series of Cobegin statements.) 

The simulation was constructed using the Performance 
Analyst’s Workbench System [IRA84], an event driven 


simulation language especially designed for modelling com- 
puter and information systems. The operation of PAWS is 


described here through Information Processing Graphs 


(IPG’s). 
Figure 1 shows the IPG of a single RTS managing 3 com- 


putation streams comprised of a single statement. This, for 


instance, would model the following: 


COBEGIN 
// EXECUTE T1; 
// EXECUTE T2; 
// EXECUTE T3; 
COEND ; 


The result of a Cobegin being encountered is a trans- 
action leaving Source and proceeding to Fork. There, 3 new 
transactions are generated with phases 1, 2, and 3. After 
waiting in series in the Queue IANDF for Initialization, the 
transactions cause the operation of the stream to commence. 
E1 through E3 model the execution times of each of the 
three streams. When a stream completes, a transaction 
passes through the Compute node for routing purposes, 
enters the IANDF Queue to be Finished, and then waits at 
Join for the rest of the streams to complete and be Finished. 
When all 3 transactions enter Join, the simulation completes. 
For large numbers of streams IANDF will be nonempty for 
the duration of the simulation. 

An implicit assumption made within the model is that 
the streams execute independently. This was done 
deliberately to expose the severe performance degradation 
experienced by parallel computations without sufficient 
parallel resource management, even when the underlying 
computation is experiencing optimal performance. Any com- 
munication between streams would increase the workload of 
the RTS’s, increasing the serialization bottleneck. 

Figure 2 shows the extension of the model to a PARTS 
configuration with 3 RTS’s each managing their portion of a 
partitioning of the streams in a Cobegin. In this model, 
RTS1 Initializes RTS2 and RTS3 before proceeding on to 
manage its streams. Since RTS2 and RTS3 are tasks them- 


selves, they must pass through the IANDF node of RTS1 to 


be Finished. When all three RTS’s have completed, the 
Cobegin is complete. 

An allowance for the effect of shared memory contention 
between RTS’s is made by introducing a delay node 
representing the time taken by each RTS to acquire control 
of the shared tables. In PARTS this is nearly constant for 
any given stream. 

The parameters of the model, then, are: 


_—_ 


. X - the number of streams in the Cobegin 

Y - the number of RTS’s defined to execute in 
parallel 

. 1- the task Initializing time 

. F - the task Finishing time 

. RTSi - the RTS Initializing time 

. RTSf - the RTS Finishing time 

. E - the execution time of a given stream (task) 

. SM - the shared memory access time 


~ 
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An Analytical Model 


_ An analytical model is shown that describes the behavior 
of the PARTS system when all streams have equal execution 
times. CSL allows for convenient methods of executing tasks 
in parallel that are replicas of each other. This feature has 


lbeen used extensively in CSL programs written to date. 


Speedup in PARTS is defined as: 
(I+ E+ F) / £(, Y, SM) 


where "I + E + F™" is the time for a single RTS to ex- 
ecute 1 stream of execution time E. The function "f" is the 
completion time of a PARTS configuration (for a given 
Cobegin) normalized to that of a single stream. 


£(X, Y, SM = (A+B+C+r+D) / X 
where: 
A= (X/Y) (I + F) 
B= (Y-1) (RTSi + RTSf) 
C= 2 (X/Y) SM (Y-1) p 
p = MIN [((y-1)SM) / (SM+I) ,1] 
D= MAX [LO, E- (C(X/Y) - 1)F ] 


"A" represents the savings achieved by parallelizing the 
resource management functions of Initializing and Finishing 
tasks. “B" is derived from the fact that (Y-1) RTS’s must 
be Initialized and Finished by the master RTS. "C" models 
the added time due to the serialization of shared memory ac- 
cess. When p=1, the queue for shared memory access is 
saturated. D models the idle wait time an RTS spends wait- 
ing for work before the first task completes, but after all 
streams have been initialized. If an RTS is kept completely 
busy, then D=O. 


Results 


Simulation runs were made for many values of the 
‘parameters. Space precludes detailed presentation of results. 
Additional results may be obtained from the authors by re- 
quest. 

Figure 3 shows the speedups for tasks with a typical ex- 
ecution length (25 MS) observed in the CSL environment. 
The values for I, F, RTSi, and RTSf were determined to be 4 
+- 1.25 MS on a Dual Cyber 170/750 system. This time was 
comprised mainly of Cyber operating system overhead in the 
‘message passing system. The value of SM was given the pes- 
simistic value of 0.5 MS. 

If the right choice is made for the number of RTS’s to be 
used in a given configuration, PARTS is an effective way to 
increase the speedup potential. This scheduling choice can 
be easily accomplished by differentiating the speedup equa- 
tions with respect to Y. This gives the schedule of RTS’s for 
maximum speedup as: 


Y SQRT ( X (I - 2 SM p)/ 


(RTSi + RTSf)) 


With this choice of schedule, the speedup can be shown 
to be proportional to SQRT(X). This holds even though the 
performance curve of any given RTS reaches an upperbound. 
If the execution time of each task is the same (as we have as- 
sumed above), the maximum speedup is independent of the 
execution time. Of course, the speedup using that schedule 
may vary widely 

An arena in which PARTS turns out not to be so promis- 
ing is when increasing numbers of tasks compute over a con- 
stant workload (decreasing granularity of units of 
computation). (In the previous examples, increasing X im- 
plied increasing the problem size. Each task continued to 
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have the same execution time regardless of the number of 
tasks.) In this case, we define the sum of the execution times 
of all tasks to be constant. 

A nonintuitive result is shown in Figure 4. In this case 
the simulation of Figure 5 is modified by allowing the execu- 
tion times of the tasks to vary in a hyperexponential dis- 
tribution around a mean of 25 MS. An increase in the 
variance-to-mean ratio to 0.4 caused the results shown. For 
a certain range of X, the speedup is actually better for a 
larger variance! | 

This is explained in the following way. The initial per- 
formance decrease is due to the fact that the increased 
variance causes some streams to finish much later than 
others. Time is lost while an RTS waits for it to complete. 
When more streams are introduced, however, the sheer num- 
ber of streams causes each RTS to have no idle wait time in 
finishing streams. The higher variance, then means that the 
RTS has a higher probability of "starting" the finishing 
work earlier and thus "completing" it earlier. This results 
in greater parallelism among RTS’s and less contention for 


system tables. 


Conclusion 


This paper has presented a study of the effect of parallel 
structuring of control and resource management algorithms 
upon the execution of parallel computations. A simple 
analytic model of general parallel computation incorporating 
overhead effects was developed and applied to this example. 
Speed-up shows a square root dependency on the degree of 
parallelism in the control and resource management al- 
gorithms. 

Variance in processing time for the tasks being controlled 
was observed to be beneficial over certain ranges of execu- 
tion to the parallelism obtainable by the control system. 
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ESTIMATING THE SPEEDUP IN PARALLEL PARSING * 
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Abstract. A method for estimating the speedup for 
asynchronous bottom-up parallel parsing has been pre- 
sented. Two models for bottom-up parallel parsing are 
proposed, and the speedup for each of the two models 
is estimated using the technique developed here. The 
speedup obtained for one model is very close to the sim- 
ulation result already available in literature. The second 
model shows a greater speedup than the first. 


Index Terms: compilers, parallelism, parsing 


I. INTRODUCTION 

The term parallel computer means different things 
to different people. In this paper it is an MIMD machine 
— a general-purpose, tightly—coupled, collection of a fixed 
number of identical processors, working asynchronously 
under one operating system to solve a single computa- 
tional problem. Such parallel machines have been in exis- 
tence for some time. Two primary types of such machines 
seem to be emerging (i) the fixed—connection model, such 


_. as Intel’s iPSC family and (ii) the shared-memory mod- 


els, such as the HEP computer. When a large number of 
processors are to be connected together, the former has 
an advantage from the hardware point of view, but the 
latter is more convenient to construct an algorithm on. 

It is too early to tell which of the two organizations 
is going to be the dominant commercial machine. For a 
good introduction to various architectural issues see (11, 
13]. For this paper either of two models can be assumed. 

Although a good deal of work has been reported 
on parallel algorithms in various application areas, the 
amount of work done on compilation in parallel has been 
relatively meager. From the results of the earlier work it 
is clear that parallelism in compilation is promising. But 
the questions of speedup, efficiency, and so forth are yet 
to be answered satisfactorily. In this paper we consider 
one such aspect of parallel compilation. 

The compilation can be divided, broadly, into three 
tasks — lexical analysis, parsing, and code generation [1,2]. 
Lexical analysis and code generation appear more paral- 
lelizable than parsing [7,8,12]. Therefore, we deal only 
with parsing in this paper. 

Cohen and Kolodner [5] proposed an asynchronous 
bottom-up parallel parsing model and estimated, by sim- 
ulation, the speedup obtainable for Pascal-—like languages. 
If the processors are connected as a linear array (as they 
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have implied), their model may not always parse a string 
successfully, because a processor goes into inactive mode 
when its stack becomes empty, and a processor can di- 
rectly communicate only to its adjacent processors. For 
example, let P;, P:, and P3 be three processors, arranged 
in that order, parsing a string. If the stack of P, be- 
comes empty before those of P, and P3 then the parsing 
cannot be completed, because P; and P3 can no longer 
communicate. 

We modify the model proposed in [5], to run on a 
linearly connected array of processors, as well as propose a 
new model that yields higher speedup. Both these models 
can be realized on a shared-memory machine as well as 
on a fixed—connection machine. 

In Section II related previous work is surveyed briefly. 
In Section III two parallel parsing models are proposed. A 
technique to estimate speedup in asynchronous bottom-— 
up parallel parsing is presented in Section IV, while in 
Section V we estimate the speedup that can be obtained 
with the two models proposed in Section III. It is shown 
that the new model gives a greater speedup than the 
model presented in [5]. Our result also shows that the 
conjecture made about the average speedup for bottom-— 
up parallel parsing in [4] is true for one of the models 
proposed here. 


Il, PREVIOUS WORK 


For parallel computer architecture, and programming 
the readers are referred to Hwang and Brigs [11], and 
Kuck [13]. Ellis [9] presented two algorithms for compila- 
tion which can be implemented on a parallel machine with 
ILLIAC IV-type cellular architecture. Various techniques 
for lexical analysis and parsing, using the CDC-STAR- 
100 vector instruction set, were introduced by Donegan 
and Katzke [8]. Krohn [13] also exploited the vector pro- 
cessing capability of CDC-STAR-100 to generate object 
code in parallel for Fortran—like languages. The program 
slicing mechanism discussed in [16] can also be used for 
parallel compilation. 


Baer and Ellis [3] have shown that by modelling an 
existing sequential compiler we get an understanding of 
modifications necessary to transform the sequential struc- 
ture into a pipeline of processes. They have evaluated a 
pipelined compiler through measurement and simulation. 
But the pipeline of processes has a limitation that only 
a fixed and small number of processors can be used, and 
that the speedup obtainable is not large. 


Dekel and Sahni [7] considered the translation of in- 
fix arithmetic expressions into their postfix or syntax tree 


form using synchronous parallel processors, sharing com- 
mon global memory. The speedup obtainable is high. 
Mickunas and Schell [15] extended the LR parsing tech- 
nique [1,2] for a multiprocessor environment so that many 
processors can be used to parse a given string starting at 
different tokens in the string. Fischer [10] has proposed al- 
gorithms for bottom-up synchronous parallel parsing and 
has shown through simulation that appreciable speedups 
can be achieved with these algorithms. Ligett et al. [14] 
extended the LR technique to allow arbitrarily many pro- 
cessors to build parse tree simultanteously and measured 
the performance of the algorithm experimentally. 


Cohen and Hickey [4] made the first attempt to com- 
pute upper and lower bounds for the speed- up that can be 
obtained by asynchronous, bottom-up, non—backtracking 
parsing of strings generated by a context-free grammar. 
They analytically found upper and lower bounds for 
speedup in parallel parsing. To develop these bounds they 
ignored the processor coordination and communication 
time. A practical asynchronous parallel parsing model 
to parse Pascal—like languages was proposed in [5]. The 
simulation result presented in [5] was compared with the 
bounds in [4]. The comparison showed a wide gap be- 
tween the two. 


III. MODELS 

In this section two models for asynchronous bottom-— 
up parallel parsing are presented. These models are used 
in later sections to estimate speedup. 

As pointed out in Section I the asynchronous paral- 
lel parsing model proposed in [5] may not, under certain 
situations, complete parsing successfully. We modify this 
model to run on a linearly connected processor array and 
call it Model A, to distinguish it from a second model, 
called Model B, presented later in this section. 

Model A: Let there be q processors P,, Po,...,P, 
arranged in a linear array, such that processor P, is di- 
rectly connected with processor P;,; and P;_; (see Fig. 
1(a)). A processor P; is called a predecessor of the pro- 


Fig. 1(a) A Linear. Array of Processors 
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cessor P; if 7 < k and processor P, is referred to as a suc- 
cessor of P;. Thus, processor P; has (i — 1) predecessors 
and (g —7) successors. Processor P; has no predecessor 
and processor P, has no successor. Every processor P; 
has a stack, which is referred as ST K;. 

The given input string of length D is divided into 
approximately q equal parts. The ith processor FP; start- 
ing at token |(¢ — 1)L/q| scans to the right for the next 
synchronizing token (e.g., semi-colon, end, etc.) and ini- 
tiates parsing from the next token. A processor can be 
in one of the four states — active, wait, merge—only, and 
inactive. 

A processor remains in the active state if it is able 
to perform either of the two parse steps — namely, shift 
or reduce; or it is performing the stack—merge opera- 
tion. Stack—merge is the process in which a processor 
P; transfers the contents of its stack, from bottom, to 
the stack of another processor P; until P; encounters 
a stack-separator or its stack becomes empty. (Stack— 
separator is a special symbol used as a marker to sep- 
arate the content of the stack of a processor.) When a 
processor cannot reduce due to insufficient information 
in its stack, but has received the next synchronizing to- 
ken, it places a stack separator on the top of the stack 
and continues parsing. By placing a stack separator a 
new stack is simulated. 

When the end processor P, has completed parsing 
its part and is left with a nonempty stack, it enters into 
merge-only state. When any other processor P;,1 < 
t < q, has completed parsing its part and is left with 
nonempty stack, it requests a merge to its successor P,+ 
and enters into wait state. In the wait state processor P; 
may be acknowledged by the processor P;,, or may get 
merge request from processor P,;_;. In the former case 
the state of P; is changed to active and P, receives tokens 
from P;41, while in the latter case processor P; cancels its 
merge request to the processor P,;,,. If a processor FP; is 
not in wait state and receives a merge request from F,-_1, 
then P; sends acknowledgement to P;,. After sending 
acknowledgement processor P; starts stack-merge with 
P;.,. Processor Pj,1 <1 <q, with nonempty stack goes 
to merge—only state when its successor processor P;+; is 
inactive. In merge-only state a processor P; waits for a 
merge request from its predecessor P;_,. Processor P; be- 
comes inactive if its stack is empty and P;+, is inactive. 
A processor P; in wait state with empty stack does not 
acknowledge merge request immediately but waits for the 
contents of the stack of P;,; and transfers them to P;_,. 


Model B: In this model every processor can com- 
municate directly with every other processor. As ex- 
pected, the extra cost of interconnection provides an en- 
hancement in parsing speed by reducing the interpro- 
cessor coordination and communication time. In such 
a completely connected system, although there is no pre- 
decessors and successors of a processor in strict sense, 
to identify each processor and its substring we number 
them as in Model A and use the same terms predecessor 
and successor. The processors of this model also have 
four states (as in Model A), but the state transition is 
different in a few cases, as discussed below. 


Fig. 1(b) 


A Completely Connected Five 


Processor System. 


Processor P,,, is called the immediate successor of 
P;. As soon as the stack of a processor becomes empty, 
it enters into inactive state irrespective of the state of its 
immediate successor. A processor P; knows which proces- 
sor P; is its immediate active successor. Before a proces- 
sor P; enters into inactive state it informs its immediate 
active predecessor P; the index of its immediate active 
successor P;. For example, let the immediate active suc- 
cessor of P3; be Ps and that of Ps be Pio. Consider the 
situation where Ps; becomes inactive before P3; and Pio. 
In this situation, before P; becomes inactive it informs 
P3 that Pio is henceforth the immediate active successor 
of P3. 

A method to estimate the speedup for asynchronous 
bottom-up parallel parsing is presented in the next sec- 
tion. 


IV. ESTIMATING THE SPEEDUP 
In this section we develop a technique to estimate 
speedup for parallel parsing. We divide the parsing time 
into three parts as in [5]. 
Let T,, be the total time of parsing a string of to- 
kens in parallel using gq processors. In parallel parsing, 
time is spent for processor coordination and communica- 
tion, besides the reduction time 7, and shift time D'sq- 
Let T,q be the time spent for processor coordination and 
communication. (The second suffix g is the number of 
processors used.) We assume that the parse tree has a 
level h such that for levels 1 to h the number of nodes 
at each level is smaller than the number of processors, 
and all processors are not being utilized. The nodes at 
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levels (2 +1) and below are sufficient so that all q proces- 
sors work simultaneously and independently with negli- 
gible coordination and communication. From level h to 
level 1, processor coordination and communication time 
is significant. Therefore, we consider coordination and 
communication time, T,, for this part only. 

The total time to parse in parallel, T,, can be ex- 
pressed as 


Dpq = Taq + Trq + Teg 


Since the coordination and communication time is 
zero in case of a single processor, the parse time T'p; 
with a single processor is 


Ty = Ty + Tar 
Therefore, the speedup obtained with q processors is 


Tr + T51 

SP(q) = T. +7, +Ts (1) 

Now, let L be the length of the input, the estimated 
number of nodes in its parse tree be N and the number of 
internal nodes in levels 1 to h be N,. The number of shift 
operations is the length L of the string, which is also the 
number of leaves in the parse tree. The number of reduce 
operations is the number of internal nodes in the parse 
tree. If t, and t, be the average reduce and shift time, 
respectively, (for one operation) then we can express the 
total parse time 7, with a single processor as 


Ty =(N—L)-t,+L-t, 


In parallel parsing shift operations are executed in 
parallel. The reduction operations corresponding to the 
internal nodes below level h are executed almost indepen- 
dently and in parallel (as the number of nodes at each 
level exceeds the number of processors). The internal 
nodes at levels 1 to h require h units of reduction time 
[4], apart from the processor coordination and communi- 
cation time. Therefore, 


L-t, 


w (NL Ne) +ty 7 


Tq = q +th-ty+T eq 


which gives the speedup with q processors as 


= (N —L)-t,+L-t, 
~ (N-L—-N,) baie ck auc 
2) 
The number of nodes, N, depends on the length of 
the string and the language. The level h depends on the 
number of processors, and the language. The processor 
coordination and communication time, T,,, depends on 
the number of processors as well as on the connection 
topology. First we determine N, N,, and h. Then Tyg 
will be estimated for both models in the next section. 
Consider a deterministic, context—free language with 
m production rules and v nonterminals. Let the produc- 
tion rules be numbered as 1,2,...,m. In derivation of a 
string of length L, let the ith production rule be used r; 


SP(q,L) 


times. Then we can express the number of internal nodes 
in the parse tree as, 


Neer (3) 


t=1 


In [6] a method was developed for evaluating r; in 
terms of occurrences of (m—v) terminals T,,T2,...,Tq; 
where a = (m— v). Using this method, the number of 
uses of each production in terms of the occurrences of the 
terminals if, else, case, while, repeat, ;, +, *, > and () 
for a Pascal-like language (as in [5]) is shown in Table 1, 
where ny, is the number of times terminal T; occurs. 


q= q?-} 
From which we get 


= loga(q) +1 (5) 


Similarly, N,, the number of nodes at levels 1 to h, 
can be expressed in terms of g and d as follows: 


dh —4 
d—1 


d-q-—1l 


N. = i.€., 


c 


Rule No. Rule 


Number of uses in a successful parse 


1 P:=S 1 
2 =S;1 n, 
3 S:=I Nye + Netee + while + Nrepeat + Mease + 1 
4 I:=id — E 2, + Netse + 1 
5 1 := if B then Sf Mas — Teles 
6 I := if B then S 
else S fi Nelse 
7 1 := while B do S od Nwhile 
8 I := repeat S yntil B repeat 
9 I := case E of S end Rease 
10 E:=E+T Ny 
11 E:=T M; + Netee + Nese + M1) + 2ZN> +1 
12 T:=T*°F Ne 
13 T :=F Ny +1) +N; + Netee + Mease + 2N> +1 
14 F := id Ny + ny +, + Neiee + Meare + 2N> +1 
15 f := (E) nm) 
16 B:=E[>/2>/</</=/#E %™ 


TABLE 1 Syntax of Pascal-Like Language and 
Count Relations Between Terminals 


If the frequency of occurrence of the terminals is 
known, we can estimate N — L. Two methods are dis- 
cussed in [6] to determine the average frequency of the 
terminals. Using these techniques we can approximate 
the number of internal nodes as a fraction of L,as N — 


L =k’'-L, where k! depends on the language. Therefore, 
we can write 


N=k-L, where k=k' +1. (4) 
The average number of sons of an internal node is 


ga Na-1t_k-AL 
“NL” kot 


The average number of sons for internal nodes at _ 


levels 1 to h is also assumed to be d. Assuming that at 
level h there are exactly g internal nodes, we can write 
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aaa (6) 
Substitutions of (4), (5), and (6) in (2) gives, 


SP(L,q) = 


Ki Let, + Let 
(b= (d-q—1)/(d— 1))tr + L-t,)/¢ + (loga(a) + 1)tr + Teg 


(7) 


In the next section we estimate the coordination and 
communication time T,, for Model A and Model B and 
derive expressions for speedup for a Pascal-like language. 


V. SPEEDUP FOR MODELS A AND B 


In [4] it was shown that the upper bound for speedup 
for parallel parsing increases monotonically with the num- 
ber of processors and reaches a limiting value. Beyond 
this critical number of processors no further speedup is 
obtained. In obtaining this result Cohen and Hickey [4] 
neglected the processor coordination and communication 
time. Furthermore, they conjectured that the average 
speedup curve for strings of a given length would be of 
the same shape as their maximum speedup curve. 

In this section we show that the Cohen—Hickey con- 
jecture holds for Model B; but for Model A we get an 
expression for speedup which is close to the simulation re- 
sult obtained in [5], but quite different from the speedup 
conjectured in [4]. 

The processor coordination and communication time 
depends on the model for the parallel parsing. We deter- 
mine the value of T,, for each of the two models and then 
substitute them to get the expression for the speedup. 

Model A: In Model A the average coordination and 
communication time T,, is determined by the average 
number of tokens left in STK, and the number of pro- 
cessors, q. For a merge request to travel to P, from Py, it 
takes (q¢—1) units of time, and for the first token to reach 
P, takes (q—1) units of time, where the unit of time is the 
period required by two adjacent processors to exchange 
a merge or datum. If k, is the average number of tokens 
in processor P,’s stack (when it enters into merge—only 
state) then next (k,; — 1) tokens can be passed to P, in 
next (k, — 1) units of time using pipeling. This gives 


Teq = 2(q — 1) + (Ai — 1) 


Model B: In this model every processor can com- 
municate with every other processor directly. Hence, to 
collect all irreducible tokens from P,, in h reduction steps, 
P, may need (h—1) request and k, data transfers. Hence, 


(h-1 +h 
= loga(q) + ky 


Tq = 
T. 


Next we estimate k,, the average number of un- 
parsed tokens on a stack. 


Estimating k,, the Average Number of Tokens: Let 
o; be the number of tokens left on STK; when P,; com- 


pletes parsing its part of the input. We define, 


-}) 


o = Maz(o,,7 € {1,2,.. 


o 


= > ipr(i) 


=1 


Then, k, (8) 


Where pr(2) is the probability that at least one stack 
has 2 tokens and no stack has more than 1 tokens. 

Assuming that a processor has any number of tokens 
between one and o with equal probability of 1, we derive 
an expression for pr(z). 

Probability that ST.K, has o tokens when P; com- 
pletes parsing the input to it is }. Probability that STK, 
has fewerer than o tokens but STK, has o tokens is 
(1— +)-2. Similarly, the probability that (q— 1) stacks 
have fewer that o tokens and STK, has o tokens is given 


by, ‘ ‘ 

— —)¢-l 2 

FBG 

The probability that at least one stack has o tokens, 
ae! 
o o 


: 1 ; ee | 
i.e., prio) = — + (1-=)-= +--+ (1- 


1 

—1]—(}— —)9 
=1-(1 5) 

Similarly we get, 
1 1 

me ne bo eee | ee Gs ee 

pr(o — 1) = (1- 5)*(1- (1-5), 


In general, 


pr(o -i) = (1 S)¥e( - (1-5), 
for 7€ {1,2,...,0—1} 
and 
hee ee Sree (¢) =(1 — =)“ 


t=2 
Substitution of pr(z) in equation (8) gives 


o—2 


le -aa- sj 


2=0 


ky = = *)(o-1)-9 


-(-5))+(1 


—aa-5yit4+(1 


= (b= a-49 Se 


#=0 


1 o—1)- 
1- =) )-q 
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go — (0 + 1)(1 = 8)8— 2(e - 2) (1 — $Y + (20 = 9)(- 2)" 
~ 1— (1-3) 


In practical situations 0 > 2 and if the number of 
processors is large then we can approximate k, by 


(1-4) 


k cane eee aes 
I= =)) 


~~! 


When o = L then we get k, ~ L — 4, 


In the rest of the paper this expression for k, is used. 


Nature of Speedup Function: The expression for av- 
erage speedup for Model A is 


ko L-tp+L-t, 


SPA i 5@) ses ee ee 
ee ((R-L— (d-q~1)/(d— 1) tr + L-ts)/9 + (logala) + 1)tr + 2(g—1) + 2- E-1 


(9) 


The general shape of the speedup curve for a given 
length of the string with varying number of processors 
can be obtained as follows. 


The numerator in SP1(L,q) does not depend on the 
number of processors. Hence, we consider only the de- 
nominator. Let the denominator be denoted by DSP1. 
Taking the first derivative of DS P1 with respect to q and 
equating this to zero, (after removing those terms that 
asymptotically go to zero), the value of q is 

q = (kK +t, +t, + 1)L/2)'/? = 

Substituting gg in the second derivative of DSP1 
(with respect to g) an expression with positive value is 
obtained. Therefore, go is the number of processors which 
parse a string of length LZ in a minimum time. The 
speedup increases with the number of processor up to 
a maximum and then decreases. 


Similarly, the expression for speedup for Model B is 
given by 


SP2(L,q) = 
ki L-tr+L-ts 
(H+ L~ (d-q— 1)/(d— 1))tr + L-ts)/a + (logala) + 1) -tr + logg(a) + ¢ — 


(10) 

It can be shown that SP2(L,q) increases to a maxi- 

mum value monotonically, and then it remains constant. 

Unlike SP1(L,q), SP2(Z,q) does not decrease as the 

number of processors increases beyond the critical num- 
ber. 


Speedup for Pascal—like Languages: To find the 


speedup for Pascal—like languages we calculate N ’ esti- 
mated number of nodes and d, estimated degree in the 


parse tree as follows. 
From Table 1 we get, 


16 
N-L=)>4%; 


*=1 


=(5n, + 2n,; ft 5Nelse + 22while + ZNrepeat + Sncase 


(11) 


“+ 8n4 + 4ny + Tn> + 2n, +6) 


Terminals Frequency of occurrence 
(every 100 terminals) 

id 60 

- 6 

if 2 

else 0.9 

while 0.1 

repeat 0.05 

case 0.15 

() 6.6 

: 12 

+ 4.6 

- 4.6 

> 4.6 


TABLE 2 Average occurrence of some terminals 
in Pascal like languages 
Ref. [5] 


Using expression (11) and the frequency of occur- 
rence of each terminal shown in Table 2 we get 


N = 2.4175L 


d = 1.70547 


Substituti : 
— - stituting these value of N and d in (9) and (10) 


1.41752 . ty + L * ts 


SPAL,q) = ((1-4175L — (1.7047q — 1)/0.7047)tr + Lts)/a+ (logy. 7047(9))tr + tr + 2(¢— 1) + LE - 4 


and, 


LALISL-te + Lote 
{(1-4175L — (1.7047q — 1)/0.7047)ty + L-ts)/@ + (log, 7047(a) + Ite + logs.r047(a) + ¢ ~ oF 


ele cea eeeees 


SP2L,q) = 


In Fig. 2 the speedup curves are given with t, = ts; = 
1 and L = 1000. The dotted curve presents the speedup 


obtained by simulation in [5]. 
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Fig. 2. Number of Processors vs Speedup Curves 
for Pascal-like Languages with kj = % — 4 


VI. CONCLUSION 
We have presented a method for estimating the speed- 


up for asynchronous, bottom-up, parallel parsing. To 


develop this result we have made a few simplifying as- 
sumptions about the nature of the parse tree. Hence the 
expressions may not give exact speedup, but the close- 
ness of the estimated speedup using the method devel- 
oped here and the simulation result in [5] indicates that 
the assumptions are realistic, and that the significant pa- 
rameters have been taken into account. 

Next, it may be useful to investigate how the struc- 
ture of a language determines the speedup in parallel com- 
pilation. Parallelization of the existing sequential code— 
generation techniques may be useful for speedup and ma- 
chine utilization. : 

The Model A is restrictive in that a processor can 
directly communicate with its immediate left and imme- 
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diate right neighbors only. This increases the processor 
coordination and interprocessor communication time. On 
the other hand, Model B is expensive to construct if the 
number of processors is large. A model which is a combi- 
nation of the two, for example C’m* [11], should also be 
investigated. 
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Abstract 


We present a new parallel programming assistant, called 
PTOOL, that identifies loops suitable for parallel execution. It 
employs a theory of dependence to determine which loops in a 
FORTRAN program will produce the same answers when itera- 
tions are scheduled in an arbitrary order. When a loop fails the 
test, an explanation is provided in the form of a report of the 
dependences that can be violated by some schedule. PTOOL 
also provides assistance in the allocation of variables between 
local and shared global memory. Interprocedural data flow 
analysis techniques extend its scope of applicability to whole pro- 
grams. PTOOL, which is in use at Los Alamos National Labora- 
tory, represents a first step toward a sophisticated interactive 
programming system based on incremental dependence analysis. 


1. Introduction 


Humans and machines bring very different strengths to bear 
on the problem-solving process. Humans make effective use of 
abstraction to develop broad strategies and are adept at 
simplification of complex situations and at generalization from 
simple principles. Unfortunately, they are not good at exhaus- 
tive search and seem unable pay rigorous attention to details— 
areas where the computer excels. As a result, programmers often 
fail to precisely specify the details of their intended solutions to 
the machine, inevitably leading to “‘bugs”’ in their programs. 


The advent of parallel machine architectures has added a 


new level of complexity to the debugging process. Parallelism 


implies nondeterminism; nondeterminism, in turn, implies non- 
repeatability. In other words, program sections that are executed 
in parallel do not necessarily follow the same execution path 
every time the program is run—even when the program is run on 
the same data. This additional complexity requires additional 
precision from the programmer. Not only must he ensure that 
the computations performed by the sequential sections are those 
that he intended, but he must also ensure that the computations 
‘indicated in parallel regions perform the desired calculations 
regardless of the order in which they are executed. This 
increased complexity permits some very elusive bugs. 


Given the difficulties that can arise in parallel program- 
‘ming, it seems clear that new methods and tools will be needed 
[18]. There are several possible lines of attack. At one extreme, a 
parallel program might be repeatedly run on the same test data 
to ensure that it computes the same answers on each execution. 
Although this approach is easy to implement, it is inelegant and 
probably would not be effective, since the conditions required to 
produce an error need not arise during the test. At the opposite 
extreme, formal correctness proofs could be applied to a parallel 
program. In spite of significant levels of activity on verification 


over the past decade, complete automatic proofs for realistic pro-_ 


grams appear impractical for the time being. 


A third approach would be to develop an autoniatic system 


to convert implicit parallelism within a sequential program into 


an explicit form. The advantage of this method results from the. 


ability to debug the program using standard techniques on 
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sequential machines prior to parallel conversion. Since the 
conversion involves provably correct transformations, the parallel 
version should execute correctly, thereby avoiding the need for a 
complex parallel debugging phase. Additionally, this approach 
provides a mechanism for converting ‘‘dusty deck” programs. 


In spite of these advantages, fully automatic conversion to 
parallelism is not practical using current language processing 
technology. Although many promising techniques have been 
developed [10,22,21,6], the programmer is still an essential 
ingredient in the exploitation of parallelism. There are several. 
reasons for this. First, most programs contain a large number of 
potential parallel regions. Searching all regions for parallelism 
can be an extremely expensive task. In addition, synchronization 
overhead on many machines can easily overwhelm the advan- 
tages of parallel execution [12], unless the granularity of the 
parallel regions is very large. Often, this determination depends 
upon values that are known only at run-time. Hence, the 
programmer’s judgement is an important part of the parallel pro- 
gramming process. 


This paper describes a new parallel programming assistant, 
called PTOOL, that combines the advantages of the manual and 
automatic approaches. When using PTOOL, a programmer 
develops and debugs a program on a sequential machine using 
traditional techniques. Once the program is debugged, the pro- 
grammer identifies the regions, usually separate iterations of 
loops, that should be executed in parallel. PTOOL then reports 
whether the results of execution depend tn any way on the order | 
of execution of the regions. If not, the regions can be safely exe- 
cuted in parallel. Otherwise, the programmer can ask PTOOL 
for diagnostic information to help identify problems. This infor- 
mation is presented as reports on potential conflicts in the use of 
variables shared by different parallel regions. 


The discussion of PTOOL begins in Section 2 with an over- 
view of the parallel programming approach it supports. Section 
3 gives an introductory treatment of dependence theory, the pri- 
mary tool for analyzing programs. The method used by PTOOL 
to determine loops that can be executed in parallel is described 
in Section 4, while details of the system design are presented in 
Section 5. The user interface is discussed in Section 6. Finally, 
Section 7 summarizes the work and suggests future directions. 


2. Parallel Programming Process 


This section introduces a model for the parallel program- 


ming process that PTOOL is intended to support. Fundamen- 
tally, PTOOL is a system designed to help programmers 


transform sequential loops in FORTRAN to parallel loops. 
Although this model is not universal, it covers many interesting 
cases. 


Multiprocessors are the most efficient when each processor : 


‘is able to compute at full speed, without requiring synchroniza- 
‘tion with other processors. To achieve this ideal, we not only. 
must identify a collection of program fragments that can be exe-. 


cuted in parallel, but we must also schedule the computation in a 
manner that balances the execution load across processors. In. 


FORTRAN, the lingua franca of scientific computation, the con- 
‘struct most likely to give rise to a large number of parallel 


regions with comparable execution times is the DO loop. There 


are two reasons why. First, if they can be made to execute 
correctly in parallel, the separate iterations of a’ DO loop can 
provide enough computation to keep a significant number of pro- 
cessors busy. Second, since each iteration of a DO loop should 
take roughly the same time to execute, the load will be automati- 
cally balanced across the processors. Experience with actual 
multiprocessors strongly supports the importance of loops in the 
parallel programming [19]. 


In the light of these observations, we have designed 
PTOOL to determine the conditions that inhibit parallel execu- 
tion of loops. In doing so, we have been careful not to restrict 
the analysis to DO loops, because iterative regions in FORTRAN 
are often coded using backward GOTO statements. 


In deciding whether or not the iterations of a given loop 
can be executed in parallel, it is important to decide which vari- 
ables mentioned in the loop body must be shared by parallel 
tasks and which can be allocated to storage local to the proces- 
sor. This decision is important because shared variables can be 
accessed in different orders by different processors. If we wish to 
compute the same results every time the program is run, all pos- 
sible schedules for a set of parallel tasks must lead to a function- 
ally equivalent access sequence for shared variables. For exam- 
ple, it should not be possible to have two schedules in which the 
order. of a load and store of a shared variable are reversed. If we 
are to avoid the insertion of explicit synchronization between 
loop iterations, we must establish that, without synchronization, 
such a reversal can never happen. In performing the requisite 
analysis, we need not concern ourselves with local variables, since 
the relevant loads and stores happen on the same processor. 


Since PTOOL was developed to support parallel program- 
‘ming research at Los Alamos National Laboratory, where FOR- 
TRAN code was being converted to the Delnelcor HEP and the 


Cray X-MP!, the methods for expressing parallelism on those - 


machines strongly influenced its design. The standard method on 
those machines is to express a parallel body of code as a subrou- 
tine, because that expression provides a natural encapsulation of 
the code to be broadcast to the different processors. 


Furthermore, reentrant code is straightforward to generate. 


In the subroutine model, the loop body is replaced by a. 


parallel subroutine invocation that takes as parameters an indica- 
tion of the loop iteration to execute and variables that have to be 
globally available to all processors. PTOOL assumes that sub- 
routine parameters and variables in COMMON are shared, but 
all other variables used in the loop body are assigned to storage 
local to each processor. As we shall see, PTOOL offers a sugges- 
tion, based upon dependence analysis, of the variables which 


should be parameters. The programmer is free to ignore this: 


suggestion and select his own parameters. PTOOL will accept 
this modification and tailor its advice to the set of shared vari- 
ables specified by the user. 


Under the model we have introduced, the principal require- 
ment that must be preserved when converting a program for 
parallel execution is the access order to shared variables. The 


main tool for analyzing patterns of loads and stores in a program 


is a powerful theory of dependence, the subject of the next sec- 
tion. 


3. Data Dependence 


In a sequential language such as Fortran, the execution 


order of statements is well defined. Therefore, the behavior of | 


the program under sequential execution order can be used as a 
basis for evaluating other execution orders. Specifically, it is pos- 
sible to examine alternative execution orders to see whether they 


will also produce the same results. Parallel execution implies - 


that the statements in a collection of parallel regions (iterations 
of a loop, for our purposes) can be executed in any order so long 
as the statements within a particular region are executed in 
sequence. Hence, for a collection of parallel regions to be suit- 
able for parallel execution, there must be no dependences 
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between statements of different parallel regions. In our model, 
there must be no dependences that cross loop iterations. 


To determine whether this condition holds, we must iden- 
tify the pairs of statements whose relative execution order must 
be preserved under any program transformation if the results are 
to be preserved. This relationship is represented by a collection 
of dependences among the statements of the program. A state- 
ment S, is said to depend upon a statement S, if S, follows S, in 


dynamic execution of the sequential program and it must follow 


5, in any reordering that preserves the correct results. 


There are two ways for a statement S, to depend on state-. 
ment S,. First, S, can cause a change in the control flow that 
determines whether S, is executed, creating a control dependence 
of S, on Ss Second, the two statements may access the same 
variable in a way that requires that their order be preserved. A 
dependence created to prevent incorrect access order to a vari- 
able is called a data dependence. Although both types of depen- 
dence must be considered when rearranging statements, programs 
can always be transformed so that all control dependences 
become data dependences(1]. Therefore, we restrict the 


. remainder of this discussion to data dependences. 


Data dependence arises most naturally when one statement 
defines a variable that is later used by a second statement. How- 
ever, Kuck has identified three types of data dependence [17]: 

(1) true dependence — S, stores into a variable that S, later 
uses. 

(2) antidependence — S, fetches from a variable that S, later 
stores into. 


(3) 


output dependence — two statements both store into the 
same variable. 


All three types of data dependences must be considered to safely 
reorder a program. 


It is also useful to distinguish dependences that cross loop 
iteration boundaries from those that do not. To see this, con- 
sider the following example. 


DO 1001 = 1,N 

S, A(I) = ... 

S, .. = A(t) 
100 CONTINUE 


S, quite obviously has a true dependence upon §,, implying that 
S, must be executed before S, in order for the computation to be 
correct. However, this dependence in no way precludes the 
separate iterations of the loop from executing in parallel, because 
values created within the loop are used on the same iteration, 
and need not be saved for later iterations. Such a dependence is 


said to be loop independent [2|. On the other hand, consider a 


slightly different loop. 


DO 1001 = 1,N 

S A(I) = ... 

; ... == A(I-1) 
100 CONTINUE 


-~ 


In this case, we may not safely transform the loop to execute in 


parallel, because the dependence crosses loop iterations; that is, 


values created on one iteration are used on a later iteration. A 
dependence of this sort is said to be loop carried, because it 
exists only by virtue of the iteration of the loop—if the loop 
body is executed only once, the dependence ceases to exist. 


It is useful to observe that a loop carried dependence arises 
because of the iteration of a particular loop. For instance, in the 
following nest of loops 


DO 2001 =1,M 
DO 100J=1,N 


S, ALJ) =... 
S, .. = A(L-1,J) 
100 CONTINUE 

200 CONTINUE 


iteration of the outer loop (on I) gives rise to the dependence. So 
long as the outer loop is iterated sequentially, the dependence 
will be satisfied. In the light of this observation, we classify car- 
ried dependences by indicating the particular loop that creates 
the dependence. 


Loop carried and loop independent dependence provide a 
precise characterization of the execution orders that are impor- 
tant within a program. So long as these dependences are 
preserved, the results of a computation will also be preserved. 
This fact is extremely important in vectorization as performed in 


PFC, an automatic vectorization system written at Rice [3,16]. 


PFC constructs a complete dependence graph for the program to 


which it is applied, in order to distinguish the statements that 


can be executed as vector operations from those that cannot. 
‘This graph can also be used to determine parallel loops, as the 
next section shows. 


4. Identifying Parallel Loops 


From the parallel programming paradigm presented in sec- 
tion 2, it should be clear that there are two tasks to be per- 
formed in converting sequential code: 1) identifying local and glo- 
bal variables, and 2) finding loops suitable for parallel execution. 
These two tasks are highly interdependent as the following loop 
illustrates: 


DO 1001 = 1, N 
T=... 
. =T 


100 CONTINUE 


If T is a variable that can be kept in the local memory of indivi- 
. dual processors, as above, then there is no dependence that 
prevents parallel execution of the loop. If, on the other hand, T 
must be globally available (as would happen if there were a use 
of the last value of T after the loop), then the loop would not 
execute correctly in parallel. Once T becomes a global variable, 
different processors may intermix intermediate computations 
involving T, with unpredictable results. 


Identifying the variables that must be global to a parallel 
loop turns out to be a relatively straightforward process. Simply 
stated, a variable must be globally available if it holds a value 
that is used outside the loop, or, in the symmetric case, if a 
definition from outside the loop reaches into the loop. Given the 
dependence graph described in the previous section, recognition 
of variables that must be globally allocated becomes a fairly 
trivial task. The system need only analyze the true dependences 
coming into and going out of a loop; any variables that give rise 
to such a dependence must be made globally available. In actual 
practice, the problem is somewhat more complicated, because full 
dependence graphs are normally constructed only over loop 
bodies. However, definition-use chains [15] are quite commonly 
constructed over an entire program (as they are in PFC), and can 
be used in the absence of stronger dependence information. 


The second aspect of parallel programming is the 
identification of loops suitable for parallel execution or, more 
correctly, determination of loops which cannot be executed in 
parallel. Recall that parallel execution, as used in this paper, 
means execution of different iterations of a loop on different pro- 
cessors. We can achieve this without synchronization only so 
long as one processor does not compute a value needed by 
another processor (or one processor does not store on top of a 
value needed by another processor, etc.). Accordingly, the key 
restraints to parallel execution are dependences carried by the 
candidate loop that involve a global variable. For instance, in 
the following example 
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COMMON (100,80) 


DO 100 I = 1, 80 
DO 100 J = 1, 100 
A(J,I) = ... | 
= A(J-1, D+... 


100 CONTINUE 


the I loop carries no dependences, and may be safely executed in 
parallel. The J loop, however, is a different matter. If it is exe- 
cuted on multiple processors, each processor will create values 
needed by a different processor. Hence, a loop may be executed 
correctly in parallel if and only if that loop carries no depen- 
dences based on a global variable (4]. 


These two principles are the fundamental considerations in 
converting a sequential code for parallel execution. Because both 
conditions are tedious to verify, a human converting a code for 
parallel execution may easily overlook an important problem. 
More specifically, a human programmer might not notice that a 
variable must be globally allocated or might miss a loop carried 
dependence. Since the resulting faults may manifest themselves 
as errors only under specific schedules, they can be extremely 
difficult to locate. 


5. PTOOL Design 


5.1. Overview 


PTOOL divides quite naturally into two subcomponents: 
one to construct the dependence graph and the other to answer 
queries about potential parallelism in the program. The first of 
these components, called PSERVE, is a modified version of PFC 
running on an IBM 370. When PTOOL is invoked, the user sub- 
mits the program source, including all the subroutines, to 
PSERVE for analysis. PSERVE constructs the dependence 
graph and saves it in a database of two files. This database is 
shipped back to the user for interactive analysis. 


The user then conducts an interactive dialog with the 
display facility, called PQUERY. PQUERY shows the actual 
program source on the screen and permits the user to select a 
loop for analysis. It uses the database constructed by PSERVE 
to identify the global variables in the proposed parallel region 
and it presents any dependences that might impede parallel exe- 
cution. 


While the division of PTOOL’s functions into two rela 
tively independent components prohibits incremental reanalysis 
of the program as the user eliminates problems, it does permit 


-PQUERY to be run on a variety of machines. In fact, we have 


already implemented versions for an IBM PC and a SUN works- 
tation. 


5.2. PSERVE 


5.2.1. PFC Modifications 


Before computing a dependence graph, PFC employs a 
number of preliminary transformations to enhance the precision 
with which it can determine dependences. These transformations 
can radically change the structure of a program. Since the pur- 
pose of PTOOL was to display the dependences as they exist in 
the user’s source program (and not in a transformed program), it 
was necessary to carefully alter the transformations so that 
dependences could be accurately calculated for the original pro- 


gram. 


The two main preliminary transformations in PFC are 
“induction variable substitution” [24, 3] and “IF conversion” [1]. 
Induction variable substitution is the process of replacing auxili- 
ary induction variables in a loop with a direct expression based 
on the loop induction variable. For instance, in the following 
example 


DO 1001 = 1, 100 


IX = IX-1 
A(IX) = A(IX) - 1 
100 CONTINUE 


IX is used as an auxiliary induction variable to run the loop 
‘backwards’. The difficulty with such variables is also illus-- 
trated above—an automatic system cannot easily determine 
whether the references to A are independent. This difficulty can 
be removed as follows: 


DO 1001 = 1, 100 

A(IX - 1) = A(IX-1)-1 
CONTINUE 
IX = IX - 100 


100 


This process, which is essentially the inverse of the classic optim- 
ization strength reduction (15], is normally performed in PFC so 
that the array dependences of A may be accurately calculated. 


The problem with this transformation is that it eliminates some » 


scalar dependences that inhibit parallel execution. 


In order to correctly handle induction variables, PTOOL 
must use multiple passes to add dependences to the graph. 
Dependences for induction variables must be added before induc- 
tion variable substitution is performed. After induction variable 
substitution, array dependences can be accurately calculated and 
added to the graph. As PTOOL adds dependences for the induc- 
tion variables, it flags the edges so that it can inform the user 
that these dependences can be removed by substitution. 


Control dependences are transformed into data depen-— 


dences by a process known as “IF conversion’”’ [1]. The basic 
idea is to transform a statement whose execution is affected by a 
transfer of control to a conditional statement, controlled by a 
Boolean ‘“‘guard” that exactly represents the conditions under 
which control flow would have reached the statement 


GOTO’s can be quite naturally classified into two 
categories: forward GOTO’s and backward GOTO’s. Each of 
these categories may possess the further property of being an eztt 
branch, meaning that the branch exits one or more loops. 
Although non-exit forward branches cannot directly cause any 
problems with parallelization, they can cause indirect problems 
by forcing some variables under their control to be global. Thus, 
‘while not strictly necessary, forward branches are converted 
because of this problem. 

Backward branches are converted via IF conversion into 
DO WHILE loops, which permit more effective analysis by PFC. 
Because of the conversion, the user can select a backward GOTO 
when analyzing the program in exactly the same manner that he 
selects a DO loop. IF conversion as implemented in PFC is 
powerful enough to convert any sequence of branches into an 
equivalent branchless program. 


Transformations are only one of the ways in which PFC 
attempts to produce a precise dependence graph. Another tech- 


nique that has proved extremely successful is interprocedural 
analysis. 


5.2.2. Interprocedural Analysis 


Most automatic vectorizers make no attempt to trace 
dependence edges across procedure boundaries. As a result, they 
either ignore loops containing procedure calls (declaring them 
unvectorizable), or assume that all possible variables (i.e. param- 
eters and COMMON) are modified. PFC employs a more sophis- 
-ticated approach: it uses iterative data flow analysis on the pro- 
gram call graph to determine the side effects of procedure calls 
[5]. This additional information permits a substantially more 
precise dependence graph in the presence of subroutines. 


The most significant advantage of this approach is the 
.Teduction in size of the dependence graph. On GAMTEB, a 504 
line program written at Los Alamos, the dependence graph com- 
puted without interprocedural analysis contained 22,998 edges. 
With the application of interprocedural analysis, the graph was 
reduced to only 2,605 edges. The smaller graph is beneficial to 
PTOOL’s user both directly and indirectly. Directly, the user no 
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longer sees spurious edges associated with variables in COM- 
MON; hence he is able to narrow his focus to real problems. 
Indirectly, the user experiences a performance improvement in 
PQUERY because the database is substantially smaller. 


A second benefit of the interprocedural information has 
been the identification of COMMON variables actually used in 
calls. A fairly standard programming practice for large programs 
is to have the same COMMON blocks across all subprograms, 
thereby avoiding the problems of determining which variables 
have to be passed as parameters and of determining how actual 
parameters line up with formal parameters. Since all COMMON 
variables have to be in global memory, it is possible that a tem- 
porary value placed in such a variable at the beginning of a loop 
iteration will prevent parallel execution. Additionally, access to 
global memory is usually slower than access to local memory. 
Because PTOOL is able to pick out precisely the variables that 


‘are used and modified across procedures, it can aid a user in 


moving these variables to local memory. 


In those situations where a procedure’s source code is not 
available, PFC assumes that all parameters and COMMON vari- 
ables are both used and modified. 


5.2.3. Constructing the Graph 


Recall that the original goal of PFC was to vectorize FOR- 
TRAN programs. It proceeds by generating a dependence graph 
and then producing a vector FORTRAN equivalent of the input 
FORTRAN program. For PSERVE, PFC has been modified to 
execute up to the point that the dependence graph and 
definition-use chains are assembled. Loop independent edges are. 
then filtered out of the dependence graph and each loop carried 
edge is annotated with additional information (such as nesting 
level, whether the variable is in COMMON, etc.) before being: 
added to PSERVE’s dependence graph file. In addition, a second 
file, containing information about loop structure and definition 


use chains, is built. This information allows the display process 
to discern which lines make up a loop and identify the parame- 
ters of a given loop. 


5.3. PQUERY 


The primary novelty of PTOOL is the interactive display 
of a dependence graph in PQUERY. This section illustrates the 
power of PTOOL’s display through the example in Figure 1, 
which captures a number of important characteristics from 
scientific applications. The code itself can be viewed as solving a 
wave equation, or computing an unknown function at a mesh of 
points given only its partial derivative in one direction. The first 
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PROGRAM MAIN 
c 
COMMON 14,1010, 10),PC10, 10) 
DATAWN/10/ 
c 
DO 10K*1.N 
HIK, 1) © FUNC(K, 1) 
HC 1,K) © FUNC( IK) 
10 CONTINUE 
c 
00 30K= 2,1 
DO 20 J=2,N-1 
T © DERIV(H(S- 1,K)) 
H(J,K) © HOEK) # T 
IF (HKU,K).£0.0) GOTO 20 
E(J.K) © HOJ,K) # PSI(U,K) 
POS.) = (EC, K) ®2)/(2"M) 
20 CONTINUE 
30 CONTINUE 
c 


WRITE(6,*X'The resulting H array fs °, (HKJ,K) Je IN) K=1 wN)) 
WRITE(6,*X'The result ing E array ts‘, (CE(J,K) Je 1,N) K=1,N)) 
WRITE(6,*X'The resulting P array is °, ((PCJ,K) Je IN) K=1,N)) 
WRITE(6,*X The last DERIV is’, T) 

STOP 

END 


Figure 1: Test Program 


loop initializes boundary conditions at the border of the mesh. 
The next loop nest sweeps the calculation across the array. The 
inner loop moves up a column, calculating the value at any par- 
ticular location 7 by using the value calculated at location 7-1. It 
also performs some auxiliary computations, based on the value at 
the location. The outer loop sweeps the computation across all 
the columns. The calculations moving up the columns are all 
dependent upon previous values, and must proceed sequentially. 
However, nothing prevents parallel computation of different 
columns. 


In addition to the code that is shown in Figure 1, the 
source for all of the function and subroutine calls was included 
when the code was run though PSERVE. The only important 


property of these procedures was the fact that they were “pure’’;’ 
that is, they did no READs or WRITES, and did not access glo-. 


bal memory. 


Once PSERVE has constructed the dependence database 
and downloaded it to the PQUERY machine, the system is ready 
for an interactive session. The user begins by browsing the com- 
plete FORTRAN source file. The first task is to identify loops 
that are likely candidates for parallelization. For example, in 
Figure 1, we can see that either of the two outer loops (DO 30 
and DO 20) should execute correctly in parallel. A cursory 


examination of the code reveals no obvious problems with this 
conclusion. 


The PQUERY browser provides a typical set of editor com- 
mands for moving about the file (eg. search, go-to-line, page- 
forward), The user can thus proceed to the first loop to be 
checked. Once the DO statement (or backwards GOTO) defining 
this loop is visible on the screen, the loop can be selected by 
placing the cursor on the appropriate line (using either a mouse 
or cursor keys) and hitting a selection key. 


When a loop is selected, PQUERY displays a list of vari- 
ables not in COMMON that need to be global to the loop. As 
stated earlier, these variables must be parameters to the resulting 
parallel subroutine call. Figure 2 presents the screen that a user 
would see after selecting the DO 30 loop (from the code in Figure 
1) for parallel execution. PTOOL has detected two variables 
that must be shared—T and N. T is somewhat surprising, since 
it appears to be a local temporary. 


Since it is natural for the programmer to question 
PTOOL’s decisions, we provided a simple explanation facility. 
For example, if the programmer questions the global allocation of 
T, he will be presented with the screen in Figure 3. The figure 
shows that T is used on line 32 after being defined in the selected 
loop (lines 12-20)—a use outside the loop. The variable, if 
present in the source (it is possible that COMMON variables will 
not appear at a call site), is highlighted and the reason that the 
variable must be global is given. 


After reviewing the parameters, the user can add or delete 
any parameters. This permits PTOOL to solve a number of 
problems. For instance, on one pass over a loop, a user can allow 
all of the parameters to remain and discover the scope of the 
overall problem. Then he can delete all parameters to 


DO 30 K = 2,N-1 «12> 


The list of parameters we think necessary for parallel execution 


T,N 


Would you like to examine any of the parameters (YES, [NO])? 
Would you like to DELETE from the list (YES, [NO], ALL)? 
Would you like to ADD to the list (YES, [NO])? 


Please wart while canthcts are larated oo. 


Figure 2: Global variables as displayed by PTOOL 
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DO 30 K = 2,N-1 
DO 20 J = 2,N-1 
* 
a 
* 


20 CONTINUE 
30 CONTINUE 


<129 


<20> 


* 
* 
* 


WRITE(6,*)'The last DERIV is’, ID 32> 


OmO 


Tis used after the loap on tine i¢ 


Figure 3: PTOOL’s explanation of the parameter list 


investigate dependences on variables in COMMON. If depen- 
dences are present, the user can attempt to eliminate them by 
transforming the program source. When no dependences arise 
from COMMON variables, the user can examine parameters 
singly, using transformations to eliminate each of the problems. 
PSERVE can then be invoked on the transformed source to ver- 
ify that the process has been successful. 


Figure 4 illustrates PTOOL displaying dependence prob- 


lems after the user is satisfied with the parameter list. PTOOL 


has highlighted the variable T in two statements, because there 
is a dependence (caused by the global allocation of T) that can 
cause incorrect parallel execution. If the dependence can be 
eliminated by forward substitution or induction variable substitu- 
tion, PTOOL will inform the user of that fact. In the example 
program, all problem dependences arise because T must be in 
global storage. If the programmer eliminates that requirement 
by changing the code (or simulates it by removing T from the 
parameter list), no conflicts will appear. 


6. An Assessment 


A number of studies have observed that there are two gen- 
eral classes of errors introduced in converting programs for paral- 
lel execution [20,7]. 


1) Errors due to unintentional data sharing or access to shared 
variables in an improper order. 


2) Errors due to incorrect synchronization code. 


PTOOL is effective in helping the programmer identify the 
causes of errors in the first class. By displaying dependences in 
an understandable manner, PTOOL can immediately pinpoint . 
problems that might require enormous amounts of unaided 
human time to find. In fact, in the first demonstration it found a 
problem that had consumed three man-months of effort at Los 
Alamos. Interprocedural analysis has been an essential element 
in the success of PTOOL, because it eliminates most of the 
spurious dependences caused by large COMMON blocks. Furth- 
ermore, it makes it possible to analyze complete Fortran pro- 
grams of substantial size. 


DO 30 K - 2,N-1 <t2y 
DO 20 J = 2,N-1 
= DERI V(H(U-1,K)) 14 
| H(J,K) = H(J-1,K) * <15> 
IF (H(J,K).EQ.0) GOTO 20 
E(J,K) = H(J,K) * PSI(J,K) 
P(J,K) = (E(U,K) 2)/(2™) 
20 CONTINUE 
30 CONTINUE <20) 
C 
DO 40K = 1,N 
DO 50J=1,N 


fype -> Ante T could be redetined beture itis used m 


Figure 4: An anti-dependence displayed by PTOOL 


In this context, it is useful to contrast PTOOL with the 
DAPP (Data Flow Analysis for Parallel Programs) system of 
Appelbe and McDowell [7]. When completed, DAPP will accept 
a Fortran program in which synchronization code has already 
been inserted and produce a report of parallel access 
anomaltes—pairs of statements that can access the same location 
simultaneously during a legal schedule. DAPP has the advantage 
over PTOOL of applicability to a variety of concurrent struc- 
tures. On the other hand, it uses an analysis technique that is 
potentially exponential in the size of the program (the single- 
procedure analysis of PTOOL is proportional to the square of the 
procedure size and the interprocedural analysis runs in time 
almost linear in the size of the call graph). Another drawback of 
DAPP is that it presents an exhaustive report of potential 
anomalies. Our experience with PTOOL and the experience of 
other researchers [9] is that exhaustive reporting produces an 
enormous amount of information in which the important facts 
can be lost. The selective display provided by PTOOL permits 
the programmer to focus on one problem at a time. For these 
reasons, we think that PTOOL will be more useful in analyzing 
large programs. 


On the other hand, PTOOL is of no help whatever for 
problems of the second class—errors in synchronization code. It 
is our belief that writing synchronization code is inherently 
difficult and that programmers should be able to deal with paral- 
lelism at a higher level of abstraction. There are several systems 
of language extensions, implemented through preprocessing, that 
provide a higher-level interface [8,14,11]. 


Another approach would be to extend PTOOL to support 
parallel programming in a more direct fashion. For example, 
once the programmer has dealt with all dependences that prohi- 
bit parallel execution of a given loop, he might request that the 
system generate the code required to initiate and synchronize the 
parallel execution. Such a facility would be the first step in 
evolving PTOOL from a debugging aid to a programming sys- 
tem. Dependence analysis can be a very powerful tool during the 
program development process. When displayed in an informative 
way, dependences can provide information about how effectively 
the program is making use of parallel hardware. 


One major drawback to using PTOOL as a programming 
support system is that it is not sufficiently interactive, since it 
cannot redisplay dependences immediately after a change. 
Redisplay requires that PSERVE be invoked again on the whole 
program, an expensive process. If PTOOL is to be a truly 
effective programming support tool, it must be converted to an 
interactive system. But an interactive programming system must 
also support other functions, such as editing, compilation, and 
execution. In short, the system must be a complete program- 
ming environment. 
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An interactive programming environment is appealing for 
other reasons. One of the main hindrances to accurate depen- 
dence analysis is variables whose values are known only at run- 
time. A compile-time analysis must make worst case assump- 
tions about the values of such variables, resulting in dependences 
that may not be present at run-time. An interactive system can 
query the programmer regarding values of such variables, and 
can embed the information provided into the dependence graph 
and into the resulting code (in case a programmer’s assertion 
turns out to be false). Additionally, a programming environment 
makes reasonable an enhanced form of interprocedural analysis, 
in which the effects of procedures on portions of arrays and vec- 
tors can be summarized. Such analysis can be especially — 
beneficial on a parallel multiprocessor, since the ability to run 
procedure invocations in parallel is one of its primary advan- 
tages(23]. 

For these and other reasons, we are convinced that the 
proper context in which to employ dependence analysis is a pro- 
gramming environment such as R"(13}. Such a system takes 
advantage of the strengths of the system and the programmer: 
the programmer is able to concentrate on developing highly 
parallel algorithms, with prompting from the system when he 
strays from parallelism, while the system is able to focus on the 
mundane aspects of uncovering and scheduling parallelism, with 


help from the programmer when some information regarding the 
run-time values of variables is necessary. The resulting system 
should provide a highly effective mechanism for developing paral- 
lel programs. 


Although PTOOL is still a fairly primitive system, it has 
demonstrated that loop carried dependence, properly presented, 
can be an effective debugging tool. We see it as a first step 
toward more powerful parallel programming tools based on 
interactive computation of loop carried dependence. 
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Abstract. The KAP/ST-100 is a Fortran source translator that gram then calls other system library routines to bring data back from 
simplifies the task of programming the Star Technologies’ ST-100 the ST-100 to the host for further processing, printing, and so on. As 
attached processor. The ST-100 can execute certain vector operations with all array processors, writing a program for the ST-100 is a non- 
very quickly. It is programmed using a Fortran-like language, called trivial project. Because of this, array processors are commonly res- 


APCL; APCL routines are invoked from the host through system tricted to special applications. Now with the KAP/ST-100, the power of 
library calls. The KAP/ST-100 locates parts of Fortran programs that the ST-100 is available to Fortran programmers. 
will execute efficiently on the ST-100. An APCL routine is created for The KAP/ST-100 is a Fortran translator that eases the task of 
each of these regions, and the original code is replaced by system programming the ST-100. The KAP/ST-100 generates APCL routines 
library calls to invoke these routines. This paper explains the compli- for the regions of programs which execute efficiently on the ST-100. 
cations which were encountered in translating for the ST-100 and the = Four output files are generated by the KAP/ST-100: an APCL file, a 
methods which were used to overcome them. host program file, an auxiliary file to assist in linking the APCL rou- 
tines, and a listing. The advantages of using the KAP/ST-100 are that 
the Fortran program remains portable and the user need not learn 


1. Introduction APCL. 

The Star Technologies’ ST-100 attached processor [1] is designed The next section of this paper describes the ST-100 attached pro- 
for high speed execution of single precision (32-bit) signal processing cessor, focusing on those aspects of the machine that require special 
kernels. It also has a wide range of application areas, due to the high handling in the KAP/ST-100. The third section gives an overview of 
speed of its arithmetic unit. The ST-100 is attached to a host com- the KAP/ST-100. The fourth section focuses on how the KAP/ST-100 
puter, such as a VAX, IBM or Control Data mainframe, to which it generates APCL code from the Fortran program. The fifth section 
looks like an I/O device. Since the ST-100 has its own main memory, explains how the KAP/ST-100 decides what sections of code to leave 
data items are transferred from the host memory to the ST-100 main on the host and what sections to move to the ST-100. The sixth sec- 
memory. The ST-100 executes asynchronously from the host; routines tion relates some of the problems we encountered with the user inter- 
for the ST-100 are written in a Fortran-like language called APCL face of the KAP/ST-100. The last section gives an example of the use 
(Array Processor Control Language) [2]. A program running on the of the KAP/ST-100 on a simple matrix multiply routine. 


host calls system library routines to schedule execution and acquire 
ST-100 resources, to send data from the host to the ST-100 main 
memory, to load the APCL routine to the ST-100, and to initiate exe- 
cution of that APCL routine. The host program may continue execu- 
tion or may wait for completion of the APCL routine. The host pro- 


Control 
Processor 


SMP Queue ACP Queue 


Storage Arithmetic 
Move ERED ANROR SRA DRADER RDO SSAA SSAA OOO 5 3 Control 
Processor , Processor 


Shaded arrows represent Data Paths 
Solid arrows represent Control Paths 


Figure 1. Structure of the ST-100 Array Processor. 
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2. The ST-100 Attached Processor 


The ST-100 is composed of three asynchronous processors and 
three memories. The Control Processor (CP) executes the user’s APCL 
routine. The CP controls the operation of the other two processors, the 

Storage Move Processor (SMP) and the Arithmetic Control Processor 
(ACP). The SMP and ACP are fast processors, with a 40 nanosecond 
clock period; the CP is a standard microprocessor with no floating 
point unit. The APCL code resides in Local Memory (LM), which is 
directly accessible by the CP. Data is sent from the host to the large 
Main Memory (MM). The ACP, which is the arithmetic engine, per- 
forms operations on data in a small, fast, reconnectable private 
memory, called the Cache Memory (CM). The SMP moves data 
between the MM and CM. The ACP and SMP are microprogrammed 


processors; the macro library for each processor can be customized for 


particular applications and even modified by users. 


A block diagram of the ST-100 appears in Figure 1. The CP 
executes an APCL routine which issues SMP and ACP macros. These 
macros are loaded into the SMP and ACP queues; the queues allow the 
CP to continue execution without waiting for the SMP or ACP. Two 
types of synchronization are available. In the first type, the CP can 
wait until the queues are empty and activity is complete; this is useful 
when the CP needs to test a result in order to decide which branch to 
take in a program, or to end the program. Alternatively, the SMP and 
ACP synchronize with each other to reconnect the CM; the CM is 
divided into several banks, each of which is connected to either the 
SMP or ACP. 


2.1 Speeds 


In order to perform vector operations, data must be moved from 
the MM to the CM. The SMP can move one word in either direction 
per 40 ns clock. The ACP is pipelined, and can produce two floating 
point multiply and two floating point add results in each 40 ns clock. 


For simple vector operations, the ST-100 is memory bandwidth 
limited; for each result produced by the ACP, the SMP must move two 
operands to the CM, and the result back to the MM. The speed payoff 
for this machine comes when there are many operations being per- 
formed on each datum, such as in a Fast Fourier Transform. 


2.2 Programming Considerations 


Since the CM is relatively small, strip mining [3] is required for 
long vector operations. The size of the strips is not fixed, as would be 
the case for vector register machines [4], but depends on the number of 
variables kept in the CM at one time; fewer variables allow longer 
strips. 


To insure the best machine utilization, the SMP and ACP should 
be kept as busy as possible. This means that some banks of the CM 
should be connected to the SMP which is loading operands for the next 
operation or storing results from the previous operation, while other 
banks of the CM are connected to the ACP. The ACP has three 
bidirectional connections to the CM and can perform several operations 
simultaneously. Since the SMP has only one, the number of ACP 
operations per SMP operand must average at least three in order for the 
ACP and SMP busy times to be balanced. Temporary results which do 
not need to be saved in the MM can help increase this 
operation/operand ratio. 


2.3 APCL 


APCL is syntactically similar to Fortran, but there are special 
constructs and restrictions which have to do with the way the ST-100 
works. For instance, there is no DOUBLE PRECISION data type, 
since the ST-100 works with only 32-bit floating point numbers. Also, 
all variables must be explicitly declared as residing in one of the three 
memories: LM, CM, or MM. 


A single APCL routine is called a PROCESS. The APCL pro- 
cess is executed by the CP, just as a Fortran subroutine would be exe- 
cuted by a standard computer. To invoke ACP or SMP macros, the 
APCL programmer codes macro CALLs; these look like ordinary 
CALL statements. The APCL compiler compiles macro CALLs into 
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the proper code to build a parameter block and issue the macro to the 
SMP or ACP queue. 


Figure 2 shows a simple Fortran DO loop written in APCL with 
SMP and ACP macro calls. Figure 2(a) shows a vector multiply writ- 
ten as a Fortran subroutine. Figure 2(b) shows how this would be 
translated into APCL to use the fast ACP unit. 


SUBROUTINE MULT(X,Y,2) 
REAL X(1000),¥(1000),2 (1000) 


DO 10 I = 1,1000 
10 2(I) = X(T) * ¥(Z) 

RETURN 

END 

(a) 


PROCESS MULT (X,Y, 2) 

MAIN MEMORY 

REAL X(1000),Y¥(1000),2 (1000) 
CACHE MEMORY 


REAL (C1,CZ(1000) ) 
REAL (C2,CX(1000) ) 
REAL (C3,CY(1000) ) 

Ss 
CALL STSYNC (000000) 
CALL SMM2C (X(1),1,4,0, CX(1),1, 1000) 
CALL SMM2C (Y(1),1,4,0, CY(1),1, 1000) 
CALL STSYNC (111111) 
CALL AVMUL (CX(1),1, CY(1),1, CZ(1),1, 1000) 
CALL STSYNC (000000) 
CALL SMC2M (Z(1),1,4,0, CZ(1),1, 1000) 
CALL STWAP 
END 
(b) 


Figure 2. Vector multiply. 


3. The KAP/ST-100 


The KAP/ST-100 translator automatically translates parts of For- 
tran subroutines, functions or main programs into APCL code. This 
translation can proceed with or without help from the user. The 
KAP/ST-100 will decide what portions of the subroutine to translate, 
translate the code into APCL, and insert system library calls to move 
data to and from the ST-100. User directives can be used to tune the 
generated code in order to improve overall program performance. 


The KAP/ST-100 uses the most sophisticated automatic Fortran 
vectorization .technology currently available. The KAP/ST-100 vector- 
izer includes such advanced transformations as loop interchanging, loop 
distribution, vector optimization and recognition of many reduction 
operations. The KAP/ST-100 performs both exact data dependence 
tests for the most common types of loops and symbolic tests when 
appropriate. The rest of this section gives an overview of how the 
KAP/ST-100 works. The actual translation to APCL is explained in 
detail in the next section. 


3.1 Scanner 


The KAP/ST-100 accepts the Fortran-77 language [5]. Future 
versions of the KAP/ST-100 will accept certain Fortran extensions, 
such as VAX/VMS Fortran extensions [6] for VAX hosts. Acceptance 
of extensions is largely for the convenience of the user, and will not 
affect the quality of translation. 


3.2 Candidate Finder 


The Candidate Finder finds and marks DO loops which are can- 
didates for vectorization. Notice that the KAP/ST-100 also converts IF 
loops to DO loops when appropriate. This pass is very machine 
specific, since different vector machines have very different vector 
instruction sets. For example, the ST-100 can not do DOUBLE PRE- 
CISION arithmetic, so the KAP/ST-100 candidate finder rejects any 
statement containing DOUBLE PRECISION constructs. Each DO loop 
is examined; if a DO loop has no candidate statements, then the whole 
loop is marked as unsuitable and vectorization will not be attempted. 


. Another pass is made that examines all statements, not just state- 
ments in DO loops. The KAP/ST-100 does more than convert vector 
DO loops to vector code; it converts scalar Fortran to scalar APCL. 
Thus, the KAP/ST-100 includes a pass that examines each statement to 
see if it can be converted to APCL. Any statement that cannot be con- 
verted to APCL is left on the host. This affects how the program gets 
translated into APCL later. 


3.3 Preparation Phase 


In order to improve vectorization, the KAP performs such com-. 


mon transformations as induction variable recognition [7] and promo- 
tion of scalars to arrays [8]. In addition, certain IF constructs are 
recognized, such as IFs which save the maximum of a vector, or which 
find the index of such a maximum. 


3.4 Vectorization 


The KAP/ST-100 finds the best method to vectorize each DO 
loop nest in the program for the ST-100. The vectorizer builds a high 
quality data dependence graph for each loop nest, which finds where 
data is generated and used. [9,10] The vectorizer then performs loop 
transformations such as loop interchanging and loop distribution to find 
the best loop to vectorize. The data dependence graph is used to check 
for legality of these operations. Preference is given to long vector 
lengths. 


3.5 Process Finder 


Up to this point, the KAP/ST-100 has not yet decided what part 
of the Fortran subroutine to translate to APCL. The process finder 
selects regions of code that will be converted to APCL PROCESSes. 
Statements that cannot be converted to APCL are left on the host, but 
not all of the rest of the program will necessarily be translated. 
Unvectorizable DO loop nests are left on the host computer, unless 
they appear between two vectorized regions. However, even some vec- 
torized DO loops may be left on the host. For example, a DO loop 
that does just a vector multiply is not a good candidate for translation 
to APCL, since the translated program would most likely execute 
slower than the original program due to the cost of moving data to and 
from the ST-100. This subject is taken up in more detail in Section 5. 


3.6 Cleanup 


Certain minor cleanup passes over the program are performed at 
this time. These include a unique pass called expanded array shrink- 
ing. Since the KAP performs extensive loop interchanging, the loop 
which will finally be vectorized is not known at the time that scalars 
are promoted to arrays. To allow the most loop interchanging, the 
scalars are promoted into multidimensional arrays. After loop inter- 
changing and vectorization, some of the loops will be left scalar, and 
some of the dimensions are not strictly necessary. The cleanup pass 
expanded array shrinking removes the unnecessary dimensions of the 
temporary arrays, reducing their size. 


3.7 Translation to APCL 


Finally, the parts of the subroutine which have been chosen for 
translation into APCL are examined and converted into APCL 
PROCESSes. The program segment is replaced with system library 
calls to properly invoke the APCL PROCESS. This is discussed in 
‘more detail in the next section. 


4. Translation of Fortran into APCL 


At this point in the KAP/ST-100, one or more segments of the 
Fortran subroutine may have been selected for conversion into APCL. 
Each such segment of the subroutine is converted into a different 
APCL PROCESS. This conversion takes place in two steps. 


4.1 Host Interface 


First, the segment is scanned to find all variables or arrays that 
are used or changed. Any variables or arrays whose values are used 
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must be sent from the host to the ST-100, and any variables or arrays 
whose values are changed must be brought back to the host after the 
APCL PROCESS completes. Arrays which are assigned but not used — 
in the segment are also sent to the ST-100, because the assignment 
may be conditional or may only assign part of the array. When the 
values are brought back from the ST-100, they will write over the 
values currently in the host. 


At this point, the segment of Fortran code is removed from the 
subroutine and is replaced by library routine calls. A call to the library 
routine KSTSCH is inserted to reserve a block of ST-100 Main 
Memory and set time limits for the PROCESS. For each array that 
must be sent to the ST-100, a call to the library routine KSTARY is 
inserted to declare that array in MM. Any scalar parameters are 
packed into two arrays (REAL and INTEGER), and sent over all at 
once. Then each array is transferred to the ST-100 using a call to the 
library routine KSTWR. Then a call to KSTLGO is inserted to load 
and invoke the APCL PROCESS. After KSTLGO, any arrays (or vari- 
ables) which were modified by the segment of code are brought back 
from the ST-100 using KSTRD calls. A final call to KSTPRG clears 
the ST-100 Main Memory for the next process; ST-100 resources are 
not released, so that they can be reused by the next KAP-generated 
process. 


One problem encountered with many Fortran subroutines is that 
formal parameter arrays are often declared with an assumed size for the 
last dimension: 


REAL A(100,*), or 
REAL A(100,1). 


In Fortran, the upper bound of the last dimension of formal parameter 
atrays is usually ignored, and may in fact be left unspecified. Thus, 
the true size of formal parameter arrays may not be known. For the 
KAP/ST-100 to work properly, all arrays must be explicitly declared to 
the proper size. A future version will examine loop bounds and array 
references to find the maximum size used for all arrays, rather than the 
declared size. 


4.2 Process Creation 


Next the KAP/ST-100 creates the APCL PROCESS out of the 
Fortran subroutine segment. Because of restrictions and limitations in 
the APCL compiler, certain scalar optimizations must be performed in 
the APCL source to get the best speed. 


There are no implicitly declared variables in APCL, so all vari- 
ables that are used in the PROCESS must be explicitly declared, and 
placed in one of the three memories (MM, LM or CM). The 
KAP/ST-100 will perform all floating point arithmetic on the ACP, 
rather than in software on the CP, since the ACP is so much faster; this 
includes any scalar floating point arithmetic in the ‘“‘tissue code’’ 
between vectorized DO loops. All the code is translated into the 
appropriate ACPL; the control structure of the PROCESS (scalar IF 
tests, serial DO loops) will be similar to the structure of the Fortran 
subroutine. All floating point arithmetic causes SMP and ACP macros 
to be generated, to move values to the CM, compute some results, and 
occasionally bring the result back to the CP for testing. 


Vectorized DO loops generate a sequence of macro CALLs to 
load the operands into CM, perform the computation, and store the 
results. Intermediate results are kept in the CM, so temporary MM 
atrays are not needed. If the DO loop bound is not known at compile 
time, or if it is very large, then the loop has to be strip mined, just as 
would be necessary for a vector register machine. 


Since the CP is so much slower than the SMP and ACP, the CP 
code must be heavily optimized. The KAP/ST-100 helps in this matter 
by performing source-level optimization, similar to what an optimizing 
compiler might do. All arrays in the APCL PROCESS are linearized, 
that is, converted to single dimensional arrays by fully expanding the 
subscript expressions. This allows the KAP/ST-100 to remove most of 
the subscript evaluation through common subexpression discovery and 
code floating. Strength reduction is also important since multiplication 
on the CP is very slow, even for integers. 


The cost of enqueueing a parameter block on the SMP or ACP 


queues is relatively large; much of this relates to the time taken by the 
software to build the parameter block. The KAP/ST-100 has no con- 
trol over this, but it can help by reducing the total number of macro 
calls. Use of multioperation macros and elimination of unnecessary 
synchronization macros are two ways of doing this. ; 


5. Process Selection 


The ST-100 gives the best performance when the program per- 
forms many operations on each datum. While a simple vector multiply 
will execute very quickly in the ACP, the cost of moving the vector 
multiply from the host to the ST-100 is large: 


1. Move the three vectors from host memory to MM 
2. Invoke the APCL PROCESS 
3. Move the output vector from MM to the host memory. 


The APCL PROCESS itself must incur some overhead because it must 
move the vectors from MM to CM and the result back. Unless floating 
point multiplication is very slow on the host machine this process will 
run much slower than ne the operation on the host, regardless 
of the vector length. 


The KAP/ST-100 computes the costs and benefits of using the 
ST-100 in terms of number of onerations ner onerand. The number of 
operands is the number of data items that must be moved between the 
host and the ST-100. In the case of a vector multiply (with a vector 
length of N), 4N data items must be moved (2N for the input vectors, 
2N for the output vector going over and back). In the case of a matrix 


multiply (with a matrix size of NXN), 4N? data items must be moved 
(two input matrices and one output matrix of size NXN). The number 
of operations for a vector multiply is N (N multiplies); for a matrix 
multiply, it is 2N°. Thus, the operation/operand ratio for a vector mul- 
tiply is N/4N, or 1/4; for a matrix multiply, it is 2N°/4N?, or N/2. This 
is a very coarse measure of the amount of benefit relative to the over- 
head cost of moving a set of operations to the ST-100. It is somewhat 
oversimplified, in that it assumes that all operations are equal cost. 


However, even this coarse measure may be used as a Starting 
point in deciding whether or not a set of operations should be moved 
to the ST-100. Certainly, a large operation/operand ratio will produce 
more speedup than a small one. A segment of code with a ratio of less 
than one will never run faster on the ST-100. A segment of code with 
a ratio that grows with the size of the problem will get even more 
speedup with large problems than with small; these segments of code 
are especially suitable for moving to the ST-100. 


Adjacent segments of code should be moved together, if at all 
possible, instead of making two or more adjacent APCL PROCESSes. 
The decision of whether to move a segment of code to the ST-100 
must be made with consideration of the surrounding code. For 
instance, by itself the vector multiply above is not suitable to move to 
the ST-100; however, if it were surrounded by two operations that were 
very suitable for the ST-100, such as matrix multiplies, then the total 
cost of a PROCESS containing both matrix multiplies and the vector 
multiply may be less than two PROCESSes each containing one matrix 
multiply. 


The KAP/ST-100 tries to estimate the speedup of moving each 
independent segment of code to the ST-100. It then looks at adjacent 
segments of code and combines them as long as there is still speedup. 
The goal is to generate a program which has the most potential 
speedup. This goal is made difficult since usually some important 
parameters, such as the loop bounds, are unknown at compile time. 
Thus, some loops which look particularly suitable may at run-time turn 
out to have very short vector lengths, resulting in poor speedup. 


6. User Interface 


The KAP/ST-100 must explain to the user somehow just what 
happened to his program. The KAP/ST-100 listing shows user’s pro- 
gram with the sections that are ported to the ST-100 marked. Those 
statements that are executed in vector mode are also marked. 


One of the problems facing KAP/ST-100 users is how to figure 
out why one section of code was ported and not another. A variety of 


174 


conditions affect this decision, and user directives may change the 
result. One of two DO loops may not be vectorizable or may not fit 
any known macro set. In this case, it may be faster to leave that loop 
on the host, and still move the other loop. Another consideration is the 
amount of data to be moved. One DO loop may use many arrays; the 
KAP/ST-100 may decide that the operation/operand ratio was too low, 
and leave that loop on the host. The KAP/ST-100 leaves the last word 
with the user; KAP directives can force the conversion of a particular 
segment of code, as long as that segment can be fully converted to 
APCL. 


Another KAP directive modifies the ratio at which code is con- 
sidered suitable for conversion to APCL. Since the ST-100 is faster 
relative to a small host (such as a VAX 11/730) than it is to a larger 
host (like a VAX 8600), some segments of code may be faster on the 
ST-100 with one host and faster on the host in another configuration. 
This KAP directive can even be used to force the KAP/ST-100 to con- 
vert almost all code that can be converted; this situation may be useful 
in shops where the critical resource is not time but money, and host 
CPU time is expensive where ST-100 time is inexpensive. 


cf Examples and Speeds 
The KAP/ST-100 output from a matrix multiply subroutine is 


shown mere, including cxccution times. The input subroutine is shown 
in Figure 3(a); the KAP-generated host subroutine is shown in Figure 
3(b) and the APCL PROCESS is shown in Figure 3(c). The host pro- 


gram proceeds in seven steps: 


1. Call KSTSCH to reserve 3N?+1 words of ST-100 MM; 

2. Pack all scalars into a temporary array (just N here); 

3. Call KSTARY to declare arrays in ST-100 MM; 

4. Call KSTWR and KSTWRW to write data values to the 
ST-100; 

5. Call KSTLGO to load and execute the PROCESS 
MATMUOQ; 

6. Call KSTRDW to read results from ST-100; 

7. Call KSTPRG to clear ST-100 MM. 


For each array declared in ST-100 MM, a three word descriptor is kept 
in the host program. This is used to pass that array as a parameter to 
the APCL PROCESS. The parameters to the KSTLGO subroutine are 
the PROCESS name, the number of PROCESS parameters, and a 
“‘type field’’ and the descriptor for each of the PROCESS parameters. 


The APCL PROCESS bears little resemblance to the original 
subroutine, but some things do remain similar. The DOJ and DOK 
loops are still recognizable; notice that the DO I loop was interchanged 
and vectorized for the ST-100. This was done to get stride-one MM 
accesses; non-one strides on MM moves will run at full speed unless 
the stride is relatively prime to the number of MM banks (8). The 
KAP/ST-100 tries for stride-one MM accesses when it can, which are 
guaranteed to run at full speed. Note also that the parameter arrays are 
declared as single dimensional arrays with very large bounds (to bypass 
any bounds checking). All array accesses are done via subscript arith- 
metic; the IV variables in the code are used for strength reduction and 
code floating. 


In this experiment, the host was an unloaded VAX 11/780 with a 
floating point accelerator running the VMS operating system. The 
times for running SUBROUTINE MATMUL on the VAX alone com- 
pared with the times for running the KAP/ST-100 processed subroutine 
are shown in Figure 4. The CPU times shown are the times charged to 
the batch job for this test; the real times shown are the elapsed times 
for the batch job and the page faults are also from the batch job log. 
Notice that the CPU time is the VAX CPU time; there is no way to 
find the actual time spent in execution on the ST-100, except to infer it 
from the elapsed time for the job. For matrix size of 100x100, the 
ST-100 version already runs at least as fast as the VAX version (com- 
paring VAX CPU time to VAX/ST-100 real time). The ST-100 ver- 
sion gets better as the matrix gets larger. Notice also the page fault 
figure; the VAX has a paged virtual memory and the program begins to 
thrash when the matrix size gets large. This also seriously affects real 
time. The SUBROUTINE MATMUL could be optimized for the VAX 
by interchanging loops to make the array accesses all stride-one. 
When this is done, the VAX CPU times improve by a factor of two, 


and the matrix size at which page thrashing begins goes up to some- 
where between 300 and 400. Even with that improvement, the ST-100 
version will run several times faster than the VAX version, with that 
ratio increasing as the matrices get larger. 


The ST-100 attached array processor offers programmers a low 
cost high performance computational engine. The KAP/ST-100 gives 
Fortran programmers a convenient method to access this power. 


SUBROUTINE MATMUL (A,B,C,N) 
INTEGER N 

REAL A(N,N),B(N,N) ,C(N,N) 
DO 100 I = 1,N 


DO 100 J = 1,N 
DO 100 K = 1,N 
100 C(I,J) = C(I,J) + A(I,K) *B(K,J) 
END 
Figure 3(a). Matrix multiply. 
SUBROUTINE MATMUL (A,B,C,N) 
INTEGER N 
REAL A(N,N),B(N,N),C(N,N) 
INTEGER ISCLRS (1), ISCLRO (3), CO (3), BO (3), AO (3) 
CALL KSTSCH (3 * (N * N) + 1) 
ISCLRS(1) = N 
CALL KSTARY (ISCLRO,1,1) 
CALL KSTARY (CO,N * N,2) 
CALL KSTARY (BO,N * N,2) 
CALL KSTARY (A0,N * N,2) 
CALL KSTWR (C,N * N,CO) 
CALL KSTWR (B,N * N,BO) 
CALL KSTWR (A,N * N,AO) 
CALL KSTWRW (ISCLRS,1, ISCLRO) 
CALL KSTLGO (' (MATMUO)’,4,2,ISCLRO,3,CO,3,B0,3,A0) 
CALL KSTRDW (C,N * N,CO) 
CALL KSTPRG 
END 
Figure 3(b). Host program. 
PROCESS MATMUO (ISCLRS, C, B, A) 
MAIN MEMORY 
INTEGER ISCLRS (16384) 
REAL A(16384), B(16384), C(16384) 
LOCAL MEMORY 
INTEGER IV3, IV2, Iv1, IVO, CTMPO, K, J, N, LMISCL(1) 
EQUIVALENCE (LMISCL(1), N) 
CACHE MEMORY 
INTEGER (C1, 1C1(16384)), (C2, 1C2(16384)), (C3, IC3 (16384) ) 
REAL (C1, C1(16384)), (C2, C2(16384)), (C3, C3(16384)) 
EQUIVALENCE (Cl, IC1), (C2, IC2), (C3, IC3) 
CALL STSYNC (000000) 
CALL STRDMA (ISCLRS, LMISCL, 1) 
Ivo =1+N 
V2 =i 
DO 101 J=1,N 
Ivi = 1 
IV3 = Iv2 
DO 100 K=1,N 
CTMPO = IvO 
CALL SMM2C (A(IV1),1,4,0,C1(1),1,N) 
CALL SMM2C (B(IV3),0,4,0,C2(1),1,1) 
CALL SMM2C (C(IV2),1,4,0,C1(CTMPO) ,1,N) 
CALL STSYNC (111111) 
CALL AVMUL (C1(1),1,C2(1),1,C3(1),1,N) 
CALL AVADD (C1(CTMPO),1,C3(1),1,C2(CTMPO) ,1,N) 
CALL STSYNC (000000) 
CALL SMC2M (C(IV2),1,4,0,C2 (CTMPO),1,N) 
Ivi = IVl +N 
IV3 = Iv3 + 1 
100 CONTINUE 
Iv2 = IV2 +N 
101 CONTINUE 
CALL STWAP 
END 


Figure 3(c). Created process. 
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Matrix Multiply Results. 


aaa VAX 11/780 ST100 
reaktime | fal 


Figure 4. 
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ABSTRACT 


This paper first classifies controls used in 
concurrent programming languages within 
nondeterministic constructs. These controls 
are classified as private control, consensus 
control and hybrid control. We then 
illustrate the need for new and separate 
control, which we introduce and classify as 
preference control. Implementations of 
preference control using pre-existing 
primitives are illustrated and analyzed with 
respect to software engineering principles. 
The difference between preference control and 
priorities are discussed. Other issues and 
extensions are also discussed. The 
classification of these controls will help 
programmers to better understand and control 
nondeterministic constructs. The addition of 
a preference control primitive to a 
concurrent programming language will increase 
the language’s expressive power if it could 
not previously implement it; otherwise, the 
inclusion of preference control as a 
primitive will simplify the language and 
satisfy software engineering principles. 


Key Words and Phrases: Nondeterminism, 
Concurrency, control classifications, private 
control, consensus control, hybrid control, 
preference control, Ada, CSP, Concurrent C, 
priorities, language design and the 
implementation of the "pref" primitive. 


1. INTRODUCTION 


Many concepts and notations for distributed 
programming languages have been discussed in 
the past. [1}] [5] [6] [12] [15] [16] This 
paper suggests and implements the addition of 
"preference control” to distributed 
programming languages with nondeterministic 
constructs. The paper is structured as 
follows: 


Section 2 defines our notations and 
classifications of controls used to restrict 
nondeterminism in concurrent programming 
languages, which we call: private control, 
consensus control and hybrid control. 


Section 3 describes: the need for a 
primitive to implement "preference control" 
to constrain nondeterministic constructs; 
some alternative implementations with the use 
of pre-existing control primitives; and the 
introduction and implementation of the 
"sreference control" primitive for further 
control of nondeterminism in concurrent 
distributed programming languages. 
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Section 4 then discusses further extensions 
and issues raised with the implementation of 
preference control such as: fairness, 
priorities and conflicting concepts. 


Examples are used from the following 
concurrent programming languages: CSP [8,9], 
Ada [11], Concurrent C [7], and Concurrent 
Prolog{[13]. These examples are then analyzed 
with respect to software engineering 
standards [3] [10}. 


2. CONTROL CLASSIFICATIONS AND NOTATIONS 


2.1  NONDETERMINISM 


Pure Nondeterminism: an unconstrained choice 
from a finite number of alternatives. The 
select statement in Ada implements 
nondeterminism. Concurrent C also implements 
nondeterminism using the select statement. 
CSP uses the symbols "["' and "J" to surround 
the nondeterministic choices. Pure 
nondeterministic constructs are not 
sufficiently structured; hence, various 
programming languages have introduced the 
following controls for nondeterminism, which 
we have classified as: 


Ze2 PRIVATE CONTROL 


Private control: nondeterminism restricted 
by boolean constraints, which we classify as 
private. Private control is considered open 
if the boolean expression is true; it is 
closed otherwise. Example la below 
illustrates private control in CSP: 


Example ia 


[]}] Bi ==> S1 
[] Bi ==> §2 
[] B2 ==> §3 
] 
Example ib - 
illustrates private control in Concurrent C: 
select { 
Bl: immediate: S1; 
or 
Bi: immediate: S2; 
or 
B2: immediate: S3; 
} 


Example ic below illustrates how private 
control can be indirectly implemented in Ada 


( although private control was not meant to 
be allowed in Ada, it can be implemented 
using the delay construct with 0.0 as the 
time parameter. 


Example ic 


select 
when B1 => delay 0.0; S1; 
or 
when Bl => delay 0.0; S2; 
or 


when B2 => delay 0.0; S3; 
end select; 


In these examples, statements S1 and S2 can 
only be chosen if the boolean expression Bt 
is true, and S3 can only be chosen if B2 is 
true. The choice is nondeterministic, yet 
restricted by the private control. 


2.3 CONSENSUS CONTROL 


Consensus control: nondeterminism restricted 
by environmental/communication constraints, 
which we classify as consensus. The choices 
are restricted to only those for which the 
communication is available from the rest of 
the system. Consensus control is considered 
established if the rendezvous can be 
established. Example 2a illustrates 
consensus control in CSP: 


Example 2a 
[1] P1?x ==> S1 
{] Pl!y ==> §2 
[] P2?x ==> S3 
J 
Example 2b - 
illustrates consensus control in Concurrent C: 
select { 
accept entryi(x); S1; 
or 
accept entry2(y); S2; 
or 
accept entry3(z); S3; 
} 
Example 2c - 
illustrates consensus control in Ada: 
select 
accept entryl(x:type_x) do S1; 
or 
accept entry2(y:type_y) do S2; 
or 


accept entry3(z:type_z) do S3; 
end select; 


In example 2a, $1 can only be chosen if 
process Pi is sending a value for x; S2 can 
only be chosen if process Pi is ready to 
receive a value y; and S3 can only be chosen 
if P2 is sending a value for x. In examples 
2b and 2c, S1 will only be executed if 
another task is calling the entry] and 
sending the value x. Likewise, S2 will only 
be executed if another task is calling for 
entry2 and sending a value of type y. The 
choice is made nondeterministically, yet it 
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is restricted by the consensus control. 


2.4 HYBRID CONTROL 


Hybrid control: nondeterminism restricted by 
both boolean and environmental constraints ( 
private control and consensus control ), 
which we classify as hybrid control. Hybrid 
control is available if private control is 
open and consensus control is established. 


Example 3a - 
illustrates hybrid control in CSP: 


[] Bl, P1?x ==> $1 
[] B2, Pl!ly ==> §2 
[] B2, P1?z ==> §3 
[] Bi, P2!z ==> S4 


In example 3a: S1 can only be chosen if Bi is 
true and Pl is sending a value for x; S2 can 
only be chosen if B2 is true and P1 is ready 
to receive a value y; S3 can only be chosen 
if B2 is true and P1 is sending a value for 
z; and S4 can only be chosen if Bl is true 
and P2 is ready to receive a value z. The 
choice is made nondeterministically, yet it 
is restricted by private control and 
consensus control. 


Example 3b - ; 
also illustrates hybrid control, in Ada 


select 

when Bi => accept entry1() ; S1; end entry]; 
or 

when B2 => accept entry2() ; S2; end entry2; 
or 

when B2 => accept entry1() ; S3; end entry]; 
or 

when B1 => accept entry2() ; S4; end entry2; 
end select; 


Example 3c - 
illustrates hybrid control in Concurrent C: 
select { 
Bl : accept entryl() ; $1; 
or 
B2 : accept entry2() ; S82; 
Or 
B2 : accept entryl() ; S3; 
or 
Bi : accept entry2() ; S4; 
} 


In examples 3b and 3c: Si can be chosen if 
B1 is true and if entryl is being called; S2 
can be chosen if B2 and entry2 is being 
called; and so on... Again, the choice is 
made nondeterministically, yet controlled by 
private and consensus controls. 


An alternative is ready if all of the 
controls have been satisfied; that is: if the 
private control is open and/or if the 
consensus control is established or if the 
hybrid control is available. The 
nondeterministic construct will check all of 
the controls in each alternative and then 
choose one of the ready alternatives. The 
programmer can include one, all or any 


combination of the controls allowed within 
the language’s nondeterministic structure to 
attain the desired expressive programming 
power. 


3. PREFERENCE CONTROL 


We will classify preference control as 
nondeterminism restricted by preferences; 
preference control gives the programmer the 
power to assign preferences to the choices 
within the nondeterministic construct. Each 
entry inside a nondeterministic construct may 
(but need not) have a preference. A lower 
value indicates a lower degree of urgency; 
the range of preferences is implementation 
defined. ( ie. a ready entry with a 
preference = 2 is chosen before a ready entry 
of preference = 1). 


3.1 OTHER EXISTING MECHANISMS FOR 
CONTROLLING NONDETERMINISM: 


Many languages contain primitives for the 
nondeterministic construct with private 
control, consensus control and/or hybrid 
control; but none contain the primitive for 
what we will classify as "preference 
control." The "by" construct was introduced 
by Andrews [2] and Concurrent C [7] later 
implemented the "by" construct and the "such 
that'' construct. These constructs allow the 
entry to selectively choose an entry call 
from the entry queue instead of receiving the 
entry calls from the queue in FIFO order, 
which is the default. The "by" and the "such 
that" constructs differ from preference 
control in that they control choices from 
within an entry’s queue; "pref" controls the 
choice of alternatives within the select 
Statement. The "cell" concept [14] allows 
explicit preference control amongst the 
entries within the select statement by 
attaching labels to entries and ordering the 
labels statically. Some languages can use 
other primitives to simulate preference 
control, but we will argue that these 
implementations are very complicated and 
unacceptable by software engineering 
standards. Preferences should allow some 
choices, within a nondeterministic construct, 
to be chosen before others. Concurrent C and 
Ada have implicitly defined preferences built 
into the language; in Concurrent C, an 
available accept alternative is chosen before 
an open immediate alternative. None of these 
languages contain a primitive to control 
preferences explicitly; this led us to the 
implementation of the "pref" primitive, which 
is necessary for preference control. Take 
for example, the resource manager that 
continuously releases and regains resources. 
Since the manager must have a resource in 
order to release it, a wise manager should 
give preference to the regaining of 
resources. The availability of the resource 
can be maintained by private control; the 
requests to release or return a resource can 
be checked by consensus control; but no 
primitive exists to express preference 
control, which is necessary in this case to 
give preference to the release of a resource. 
Therefore, we are forced to use other 
primitives to implement preference control, 
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which we feel should be included as a 
primitive. These implementations require 
some programming effort and result in 
implementations that do not support software 
engineering principles. Some programming 
languages implement priorities; priorities 
will be discussed in a later section, since 
they are not part of what we have classified 
as controls for nondeterminism within unique 
tasks. Preference control can also be used 
to implement "fairness" and avoid 
process/task starvation; this will also be 
covered in section 4.1. 


3.2 IMPLEMENTATION OF PREFERENCES WITH 
EXISTING PRIMITIVES 


Suppose we have a process/task that is 
required to do one of four services ( 
servicel, service2, service3 and service4), 
and that we wish to give the services 
preferences - that is, we wish to service4 if 
possible, else service3, else service2 and 
the least preference is to servicel. 
Services 1 - 4 are all different. This 
cannot be implemented in CSP within the 
nondeterministic construct. This can be 
simulated in Ada in a couple of ways: (You 
might try to come up with some solutions 
yourself before reading the following known 
solutions) 


Ls Using low level system defined queues 


Example 4 

task CONTROLLER is 
entry SERVICE4(D:DATA); -- very urgent 
entry SERVICE3(D:DATA); -- urgent 
entry SERVICE2(D:DATA); -- medium urgency 
entry SERVICEI(D:DATA); -- low urgency 


end CONTROLLER; 


task body CONTROLLER is 
begin 
select 
when B4 => accept SERVICE4(D:DATA) do 
ACTION4(D); v 
end SERVICE4; 


or 
when B3 and (not(B4) 
or SERVICE4’COUNT = 0) => 
accept SERVICE3(D:DATA) do 
ACTION3(D); 
end SERVICE3; 


or 
when B2 and (not(B4) or SERVICE4’COUNT = 0) 
and (not(B3) or SERVICE3’COUNT = 0) => 
accept SERVICE2(D:DATA) do 
ACTION2(D); 
end SERVICE2; 


or 
when B1 and (not (B4) or SERVICE4’COUNT = 0) 
and (not(B3) or SERVICE3’COUNT = 0) and 
(not(B2) or SERVICE2’COUNT = 0) => 
accept SERVICE1(D:DATA) do 
ACTION1(D); 
end SERVICE1; 
end select; 
end CONTROLLER; 


1. This implementation is not very 
readable; it is not clear why the 
queues for service4, service3 and 
service2 are being checked if servicel 
is to be accepted. 


All of those checks on the queues are 
expensive and make the implementation 
very complex. 


Using the system defined low level 
queues forces the programmer to use low 
level commands, which ideally should be 
hidden from the user and are not 
desirable in a high level language like 
Ada. 


This implementation requires a lot of 
modification if the user wishes to 
change the preferences or add entries 
with different preferences. 


This implementation uses a "loop hole" 
in Ada, (ie. the COUNT attribute) to 
implement consensus control using 
private control. The use of the COUNT 
attribute as a condition is 
syntactically private control, and yet 
semantically consensus control. This 


2. Using nested select statements 


Example 5 


task CONTROLLER is 


inconsistency adds ambiguity to this 
problem. 


This implementation is dangerous 
because it can cause the raising of an 
unnecessary exception. The semantics 
of the select statement specify that if 
all conditions ( the private control ) 
are false and there is no else 
alternative or terminate alternative 
included, then an exception should be 
raised. This rule was probably made 
assuming that the conditions covered by 
private control were static and could 
not change at a later time. But you 
can see by this example that if the 
boolean conditions are true, but the 
queues are empty, an unnecessary 
exception will be raised; although 
entry calls may be in progress. This 
faulty behavior can be caused by the 
inconsistent use of private control 
instead of consensus control ( as 
described in the latter paragraph). 


entry SERVICE4(D:DATA); 

entry SERVICE3(D:DATA); 

entry SERVICE2(D:DATA); 

entry SERVICE1(D:DATA); 
end; 


task body CONTROLLER is 
begin 
Done: boolean = false; 
while not(DONE) do 
loop 
select 
when not(Done) and B4 => 
accept SERVICE4(D:DATA) do 
ACTION4(D); end SERVICE4; Done=true; 


else select 
when not(Done) and (not(B4) 


and B3) => 
accept SERVICE3(D:DATA) do 
ACTION3 (D); end SERVICE3; Done=true; 


else select 
when not(Done) and (not(B4) 
and not(B3) and B2) => 
accept SERVICE2(D:DATA) do 
ACTION2(D); end SERVICE2; Done=true; 


else select 
when not(Done) and 
(not(B4) and not(B3) 
and not(B2) and B1) => 
accept SERVICE1(D:DATA) do 
ACTION1{D); end SERVICE1; 
Done=true; 


else NULL; 
end select; 
end select; 
end select; 
end select; 
end loop; 
end CONTROLLER; 
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(NOTE: Could not use array/family of entries 
since all four entries are different) 


1. This implementation is ambiguous; it is 
not clear why all of the nesting is 
necessary or how the nondeterminism 
will work in this case. 


2. The nesting is very complex and will 
require further nesting if entries of 
different preferences were to be added! 


3. This example requires an extra boolean 
variable (private control) and a loop 
(although the repetitive behavior of a 
loop is not desired) structure to make 
sure one and only one entry is 
accepted; this is extremely inefficient 
and confusing to a reader. 


4. This example also implements "Busy 
Waiting", which is very expensive and 
inefficient. 


5. This implementation is also very 
difficult to modify; changing the 


Example 6 


Syntax for the select alternative: 


preferences of the entries would 
require renesting the entire select 
statement. 


Both of these simulations are ambiguous and 
do not support software engineering 
standards. The need is then evident for a 
primitive to implement preference control 
within a nondeterministic construct. 


3.3 INTRODUCTION TO THE PREFERENCE CONTROL 
PRIMITIVE CONSTRUCT: 


We propose the syntax: "pref -> <constant>" 
to assign a static preference control value 
within a nondeterministic construct. Dynamic 
preference control, in which we could pass 
"pref" the value of an expression, is much 
more powerful and expensive ( and out of the 
scope of this paper ). Static preference 
control is simple, easy to implement, and can 
be evaluated at compile-time. For the Ada 
nondeterministic construct, we suggest the 
following syntax for the select alternative: 


pref -> <constant>: when <condition> => select_alternative 


SELECT ALTERNATIVE: 


constant 


Example 7 


selective wait :: 
select 
[pref -> constant : ] 


[ else 


sequence of statements ] 


end select; 


select_alternative ::= 


—~------~----- rn fselect_alternative] --> 


condition 


[ when condition => ] 
select_alternative 
{ or [pref -> constant : ] 


[ when condition => ] 
select_alternative } 


accept_statement [sequence of statements ] 


delay statement 
terminate; 


All of the nondeterminism constraints are 
listed before the entry call: first, the 
preference control (pref -> constant); 
followed by the private control (when 
<boolean expression>; and finally, the 
consensus control (=> accept <entry>). ( We 
considered placing the preference control 
declaration within the task entries 
declaration, but concluded it belonged within 
the select statement, with the other 
nondeterminism controls.) If a preference is 
not specified, it should default to the 
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[sequence of statements] 


lowest value ( ie. if negative values for 
preferences are not allowed, then default to 
O ). The same value preference can be 
assigned to different entries within the same 
select statement to allow a greater amount of 
nondeterminism. 


For example, using the nondeterministic 
construct "select'' in Ada, combined with our 
proposed preference control construct, the 
problem described above can be solved by the 
following: 


Example 8 


task CONTROLLER is 


entry SERVICE4(D:DATA); peeias 
entry SERVICE3(D:DATA); Rate 
entry SERVICE2(D:DATA); ees 
entry SERVICE1(D:DATA); sees 


very urgent 

** urgent 

<* medium urgency 
low urgency 


end CONTROLLER; 


task body CONTROLLER is 


begin 


select 
pref -> 4: when B4 => accept SERVICE4(D:DATA) do 


or 


ACTION4(D); end SERVICE4; 


pref -> 3: when B3 => accept SERVICE3(D:DATA) do 


or 


ACTION3(D); end SERVICE3; 


pref -> 2: when B2 => accept SERVICE2(D:DATA) do 


or 


pref -> 1: 


ACTION2(D); end SERVICE2; 


when Bi => accept SERVICE1(D:DATA) do 
ACTION1(D); end SERVICE}; 


end select; 
end CONTROLLER; 


Semantics of preference control construct: 
Select one of the available alternatives with 


the highest pref value. 


More than one entry 


can have the same pref value; this will 
increase the nondeterminism of the choice. 


Ue 


This example is much more readable than 
the previcus two and is much simpler 
and easier to understand. 


Since nested select statements, loops 
and low level queues are not needed, 
this is also a more efficient 
implementation that supports software 
engineering standards. 


Different entries could have the same 
preference and possibly differ in the 
private control or consensus control. 


This construct adds expressive power to 
languages, like CSP, that cannot 
otherwise implement preferences. 


A smart compiler could save some 
unnecessary evaluations by first 
evaluating the entries with the highest 
preferences. 


This implementation is easy to modify; 
changes of priorities or the addition 
of entries would require minimal change 
of code. 


This implementation does not contain 
busy waiting; nor does it include the 
danger of causing an exception handler 
to be raised unnecessarily because it 
makes a clear distinction between 
private control, consensus control and 
preference control. 


The preference control primitive can directly 
be used to give an elegant solution to the 
resource manager problem by simply giving a 
higher preference to the releasing of 


resources. 


It should be noted that assigning 


preferences to the entries sequentially as 
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they are ordered within the select statement 
is not a valid implementation of preference 
control, since the "pref" value of each entry 
would be unique; preference control requires 
that more than one entry can have the same 
"pref" value. 


4. EXTENSIONS & ISSUES OF PREFERENCE 
CONTROL : 


4.1 FURTHER USE OF PREFERENCE CONTROL TO 


IMPLEMENT FAIRNESS: 


We propose the extension of the 
implementation of type tasks in Ada to allow 
specifying different preferences to the type 
tasks. <A type task could be defined once 
(possibly with default preferences) and 
later, when these tasks are declared 
dynamically, the preferences could be listed 
as a parameter to the declaration. We could, 
for example, define 


Example 9 


task type ceisler ima prefs -> a,b,c,d ) is 
entry entry! 
entry entry2 ... 
entry entry3 ... 
entry entry4 ... 
end CONTROLLERS; 


task body CONTROLLERS( prefs -> a,b,c,d ) is 


select 
pref —-> a . accept entry! 
or 
pref -> b: ... accept entry2 ... 
or 
pref -> c . accept entry3 ... 
or 
pref -> d: ... accept entry4 ... 


end select; 
end CONTROLLERS; 


And then declare: 
: CONTROLLERS(1,2,3,4); 


: CONTROLLERS(4,3,2,1); 
: CONTROLLERS(2,3,3,1); 


CONTROLLER 1 
CONTROLLER2 
CONTROLLER3 


CONTROLLER1 is of type CONTROLLERS and was 
assigned a preference of 1 to entryl, a 
preference of 2 to entry2, a preference of 3 
to entry3 and a preference of 4 to entry4. 
Oppositely, CONTROLLER2 was assigned a 
preference of 4 to entry!, a preference of 3 
to entry2, a preference of 2 to entry3 and a 
preference of 1 to entry4. CONTROLLER3 
illustrates that more than one entry can have 
the same pref value. Type tasks with 
preference control parameterizing would allow 
the programmer to assign different 
preferences to different instances of the 
same task without having to rewrite a copy of 
the task. The use of preferences can help 
avoid starvation of processes or tasks and 
could then implement fairness. Concurrent 
Prolog [13] implements fairness using the 
"stable machine", but the stable machine does 
not appear to implement nondeterminism as 
needed in our examples. 


4,2 PRIORITIES VERSUS PREFERENCE CONTROL: 


We :distinguish between preference control and 
priority control. Priority refers to 
processes or actions to be chosen at the 
system level; preferences are defined at the 
user level. Priorities cannot be considered 
to be included in our classification of 
controls restricting nondeterminism within a 
task. Priority is a function from a set of 
tasks to a set of values; whereas, preference 
is a function from a set of entries within a 
task to a set of values. Priority controls 
the race among different tasks in the system 
level. Preferences control the race among 
entry calls within a task at the user level. 
Giving a process or task a priority would not 
solve the example above, since the same 
process can call the manager for the request 
or release a resource. Preferences give 
control to the server, whereas priorities 
give priority to the clients. Priority 
control is implemented in both Ada and 
Concurrent C. 


In Ada, for example, the user can explicitly 
define the priority of a task using the 
"PRIORITY pragma", which is a 
suggestion/pragma to the compiler. 


"Each task may be assigned a priority that 
overrides the default priority assigned to a 
task by the implementation. Tasks can be 
assigned a priority by using the PRIORITY 
pragma which is of the form: pragma 
PRIORITY(P); and is included in the 
specification of the task. P is a static 
expression of the implementation defined 
integer subtype PRIORITY. The higher the 
value of P, the higher the priority of the 
task." 


Like Ada, Concurrent C implements process 
priority assignment. Concurrent C also 
implements a different version of priorities, 
using the "BY" construct, which allows the 
user to define/prioritize the tasks that 
could be served by an individual entry. The 
"BY" construct is considered a priority 
control because it controls the race among 
different tasks. Concurrent C controls this 
at the user level as an additional language 
within a nondeterministic construct, not at 
the system level like Ada does.. The "BY" 
construct is not powerful enough to solve the 
resource manager problem, since we need to 
prefer one entry over other entries, not one 
task over another at the entry level. 


Concurrent Prolog defines a stable machine 
as: "A Concurrent Prolog machine is 
“stable” if it always chooses the first 
unifiable clause for reduction, if several 
such clauses exist.'' Concurrent Prolog 
implements implicit preference control, but 
it does not do so within a nondeterministic 
construct. Therefore, we cannot classify 
Concurrent Prolog’s stable machine as a 
priority, nor as preference control, because 
it does not control a nondeterministic 
construct. Preferences are implicitly 
defined (and forced) by the ordering of the 
statements. Similar to preference control, 
the stable machine in Concurrent Prolog can 
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implement fairness and decrease the chance of 
process/task starvation. 


4.3 NEW ISSUES RAISED BY PREFERENCE 
CONTROL : 


The additional expressive power of 
implementing preferences may bring about 
other difficulties such as: difficulty of 
implementation; and difficulty of use. 
Consider the following program in CSP with 
the addition of preferences: 


Example 10 
Poe er |. Po 
Piel 


{] pref->2, P2?x ==> $1; 
[] pref->1, P2!0 ==> $2; 


P2334. -{ 
[] pref->2, Pi?y ==> S3; 
[] pref->1, P1!0 ==> S4; 


The above conflict arises when Pl & P2 are 
executed in parallel and are in essence 
"deadlocked"; this new type of error can be 
introduced by preferences. The programming 
methodologies of Communication Closed Layers 
[5] of constructing well-structured 
distributed programs may help avoid such 
errors, but cannot insure their discovery or 
elimination before run time. Ada and 
Concurrent C avoid this conflict by the 
asymetric use of the accept alternatives and 
the call alternatives within separate select 
alternatives. 


These issues, and others that may arise from 
the combination of preference control and 
different language constructs, are out of the 
scope of this paper. We simply wish to 
introduce the concept of "preference control" 
restricting nondeterminism for concurrent 
programming languages. The actual 
implementation of preference control will 
have to vary for each individual programming 
language, depending on the language’s 
programming philosophies and methodologies. 


5. CONCLUSION: 


Nondeterminism is a central concept and issue 
in modern concurrent programming languages 
that exploits explicit parallelism. 
Controlling nondeterminism should therefore 
be an important issue of distributed 
programming language design. In this paper, 
we first classified the current available 
primitives that can be used for controlling 
nondeterminism : private control, consensus 
control and hybrid control. We then 
introduced and suggested the addition of a 
new language independent primitive to control 
nondeterminism, which we call preference 
control. Our examples have illustrated the 
expressive programming power of controlling 


nondeterminism with the above mentioned 


controls. 


We strongly suggest the use of 


preference control in real time applications, 
where it is urgently needed. 
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Abstract — This paper presents the appli- 
cation of the refined-language methodology to 
ANSI FORTRAN 77. The resulting language, 
RF77, permits: 


e Users to write code which differs from 
standard ANSI FORTRAN 77 code only in 
that data access rights are more precisely 
specified 


e Compilers, using well-known flow-analysis 
techniques, to generate consistently good, 
highly-parallel, race-free, code for virtually 
any machine architecture. 


In modifying ANSI FORTRAN 77, our goal is 
not merely to find parallelism where none was 
envisioned by the programmer, but to provide a 
more general way of expressing algorithms for 
parallel computers of any type (MIMD, VLIW, 
SIMD, etc.). 


Introduction 


In our paper, “Refined C: A Sequential 
Language For Parallel Programming,” which 
appeared in the proceedings of ICPP 1985 [Die 
85], we presented both a definition of the 
language RC! and a discussion of what the 
refined-language methodology is. We _ also 
claimed that the methodology could be applied 
to other languages — and everyone asked “Can 
you do it with FORTRAN?” 


In this paper, we present both a review of 
the problems encountered when flow analysis is 
applied to unmodified ANSI FORTRAN 77 
(toward concurrency-detection), and the specifi- 
cation of RF77: refined ANSI FORTRAN 77. 


As might be expected, the substantially dif- 
ferent natures of C and FORTRAN are reflected 


1 The syntax of RC has been slightly altered to 
better conform to the draft specification of ANSI C. A 
revised definition of RC is available from the authors 
upon request. 


0190-3918/86/0000/0184 $01.00 © 1986 IEEE 
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in significantly different-looking refinements 
being made to FORTRAN as compared to C. 
However, the methodology remains intact. 


The Methodolo 


The refined-language approach begins with 
a conventional HLL — nearly any HLL: in this 
paper, ANSI FORTRAN 77. We have chosen 
this particular dialect of FORTRAN because it is 
a superset of most dialects in popular use, yet its 
specification is readily available. Because the 
ANSI FORTRAN 8x specification includes vec- 
tor operations, it might seem, at first, that we are 
attempting to “re-invent the wheel” — but vector 
operations do not make generation of good 
MIMD code significantly easier [Die 86]. In 
addition, the refinements made to ANSI FOR- 
TRAN 77 are compatible with the vector nota- 
tion of ANSI FORTRAN 8x; hence, RF77 can 
be easily transformed into RF8x. 


Since ANSI FORTRAN 77 contains no 
explicit parallelism or synchronization con- 
structs, it is naturally impossible to write a race 
condition in the language. Likewise, a flow- 
analyzing compiler re-structuring code into a 
parallel form by using only _ correctness- 
preserving transformations will be incapable of 
generating a race condition. Therefore, if such a 
compiler is given pure ANSI FORTRAN 77 
code, the programmer is guaranteed that the 
parallelized program will produce the same 
result as the sequential program — the program 
will be debuggable. Perhaps even more impor- 
tant, using a compiler with a “back-end” | 
appropriate for each parallel or sequential 
machine, the program will be completely port- 
able. 


Unfortunately, the amount of useful paral- 
lelism found by a flow-analyzing compiler exa- 
mining a typical (unrefined) ANSI FORTRAN 
77 program is not necessarily all (or even a large 
fraction of all) that is present in the program 


[Kuc 72]. This discrepancy is caused by certain 
language constructs obscuring (from _ the 
compiler’s flow analysis) the fact that some 
operations could be parallelized. In refining 
ANSI FORTRAN 77, constructs which obscure 
the needed information are removed from the 
language and are replaced by modified versions 
which do not inhibit the analysis. These replace- 
ments can be made to look much like the origi- 
nal constructs and can provide all, or nearly all, 
of their expressive power. All the other language 
constructs remain as they were. 


The resulting language, RF77, looks and 
“feels” like ANSI FORTRAN 77, but, unlike the 
latter, can be compiled into reasonably efficient 
race-free code for any kind of machine, parallel 
or sequential (assuming a compiler has been con- 
structed for the machine in question). 


What’s Wrong With FORTRAN? 


For the purpose of automatically generating 
parallel code, it is convenient that typical (unre- 
fined) FORTRAN programs are far simpler and 
more static than programs written in most other 
languages. Programs are simpler because FOR- 
TRAN is a relatively spartan language — there 
are not as many different ways to say the same 
thing as in most other languages. Programs are 
more static in that more information than usual 
about the run-time behavior of a program can be 
determined at compile-time. For example, the 
amount of data space required by a typical 
FORTRAN program is known at compile time; 
in most languages, the ability to perform recur- 
sive calls and to dynamically request chunks of 
memory makes determining the run-time size 
equivalent to solving the halting problem. 


In these respects, FORTRAN, in any of its 
major dialects, is an ideal language for a com- 
piler to analyze. This fact is evidenced by the 
multitude of parallelizing compilers for dusty 
deck FORTRAN [All 82] [Fis] [Kuc 84] [Nic 85] 
[KAI 85] and the lack of parallelizing compilers 
for almost any other language. 


Although the simple, static, nature of FOR- 
TRAN programs aids the compiler in _ its 
analysis, there are FORTRAN constructs which 
severly impede analysis for automatic 
parallelism-detection. Any construct which blurs 
the compiler’s picture of which data items might 
be accessed by any particular reference will 
result in the compiler making a “safe” assump- 


tion. For example, in: 


C For this and the following examples, assume 
Cc that labels 10 and 20 are never referenced; 
C they appear only to relate each example to 


Cc the discussion in the text. 
10 A=B*C 
20 D=E*F 


it is obvious that the statements labeled 10 and 
20 can be executed either sequentially or in 
parallel, but if and only if the answers to the fol- 
lowing questions are “No”: 


(1) Is B or C an alias for D? Is E or F an alias 
for A? 
(2) Are A and D aliases for each other? 


If any part of (1) was answered “Yes,” execut- 
ing statements 10 and 20 in parallel could pro- 
duce an incorrect result — a write/read race 
condition would exist. If (2) was answered 
“Yes,” executing statements 10 and 20 in paral- 
lel would produce unreliable results — a 
write/write race condition would exist.? If static 
analysis could not answer either (1) or (2), or if 
either answer was “Sometimes Yes,” the only 
“safe” assumption is that parallel code for state- 
ments 10 and 20 cannot be generated. 


The previous example is somewhat con- 
trived, because it is (usually) trivial for a flow- 
analyzing compiler to determine the answers to 
each of the above questions — any of the 
“dusty-deck” parallelizing compilers mentioned 
above could answer them. However, we will use 
minor variations on this example to demonstrate 
the analysis problems caused by each of the 
ANSI FORTRAN 77 constructs which must be 
refined. 


References to Global Data (COMMONS) 


Suppose that A, B, and C are all defined as 
independent variables which reside in a COM- 
MON and that the statement labeled 20 1s 
changed so that we now have: 


10 A=B*C 
20 CALL SUBR 


2 It is interesting to note that, in general, if an 
answer is known to be ‘Yes’ then a sequential code 
optimization is possible. For example, if A and D are 
aliases for each other, statement 10 is a dead computa- 
tion and can be eliminated. 


Also assume that SUBR is a SUBROUTINE 


which is defined in another file. The questions 
the compiler must answer are essentially the 
same: 


(1) Does SUBR (or any subprogram invoked 


by SUBR) contain any stores into B or C — 


or any loads of A? 


Does SUBR (or any subprogram invoked 
by SUBR) contain any stores into A? 


In order for the compiler to generate code which 
can safely execute statements 10 and 20 in paral- 
lel, the answers to both questions must be 
known to be “No.” 


The time complexity of some flow-analysis 
techniques used to attempt to answer these ques- 
tions forces flow-analysis to be localized to small 
regions of a program (for example, a SUBROU- 
TINE or FUNCTION at a time) — analysis of 
larger regions would take an unacceptably long 
time. Since the above example may require this 
analysis to be performed on the entire program,* 
the compiler would probably be unable to 
answer these questions. This, in turn, would 
force the compiler to make the “safe” assump- 
tion that every SUBROUTINE or FUNCTION 
call might affect every variable that appears in 
any COMMON. Sequential code results. 


A less obvious, but very similar, situation 
exists relative to the use of I/O channels by sub- 
programs. The I/O channel numbers act like glo- 
bal variables. Consider: | 


(2) 


10 WRITE(6,*) A 
20 CALLSUBR 


It is impossible to tell if SUBR uses I/O channel 
6 without looking, at the very least, at the code 
for SUBR. 


References using Pointers (Call-by-Address) 


Let us now assume that A, B, and C are all 
defined as independent variables local to the 
SUBROUTINE in which the following code 
appears: 


3 Recently, substantial advances have been made to- 
ward limiting the scope of analysis by constructing 
“programming environments” which incrementally col- 
lect the needed information. A good example of this is 
[Coo 85]. However, this mechanism alone is not capa- 
ble of solving the problems described in section on 
References to Indexed Data Structures. 
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10 A=B*C 
20 CALL SUBR(A, B) 


As before, SUBR is assumed to be defined in 
another file; but, since we have already con- 
sidered the problem, we will assume that there 
are no COMMONs in SUBR. Although FOR- 
TRAN does not explicitly support pointers, it 
does use call-by-address in passing arguments to 
FUNCTIONs and SUBROUTINEs. The com- 
piler must now find answers to: 


(1) Does SUBR (or any subprogram invoked 
by SUBR) store into B or load from A? _ 


(2) Does SUBR (or any subprogram invoked 
by SUBR) store into A? 


which, of course, would normally be too expen- 
sive for the compiler to answer. 


References to Indexed Data Structures 


Let us return to our original example, 
again, slightly modified: 


IMPLICIT INTEGER A-Z 
DIMENSION G(100) 

10 G(A) =G(B) * G(C) 

20 G(D) =G(E) * G(F) 


Again, we must answer nearly the same ques- 
tions as for the original, trivial, example: 


(1) Is G(B) or G(C) the same element as G(D)? 
Is G(E) or G(F) the same element as G(A)? 


(2) Is G(A) the same element as G(D)? 


However, these questions are much harder to 
answer. In fact, they cannot be answered at com- 
pile time if any of A, B, C, D, E, and F is given 
a value at run-time which is not related to all the 
other variables’ values in an obvious way. For 
example, suppose a value is READ for A in the 
statement which is executed just before 10. It is 
not merely difficult to find answers to these 
questions by analysis; it is theoretically impossi- 
ble in such a case. In other cases, answering 
these questions requires theorem proving. 


If the value of any of A, B, C, D, E, and F 
is affected by a parameter entering the SUB- 
ROUTINE which contains the above example, 
then the corresponding argument to every CALL 
of that SUBROUTINE must be examined. This 
may be theoretically possible, but it certainly is 
not practical. : 


Since it can be very difficult to determine 
which element(s) of a data structure an indexed 
reference can access, it is often necessary to 


assume that such a reference potentially affects 
any (every) element of the array. 


Several attempts have been made to resolve 
these analysis problems by having the program- 
mer insert assertions. Some of these assertions 
have been of the form “Trust Me... it’s safe to 
do this in parallel,” which is very thinly- 
disguised explicit parallelism — with all the 
disadvantages thereof. The best approach to 
adding assertions is described in [Fis] and [Fis 
84]. These assertions take the form of equations 
which can actually be checked for validity at 
run-time, but solving the equations is a complex 
task for the compiler and the assertions them- 
selves are quite alien to the average FORTRAN 
programmer. 


The RF77 equivalent to assertions (the 
concept of partitioning) is both efficient and safe, 
but, most importantly, it makes sense in terms of 
expressing an algorithm. 


The Refinements 


In each of the situations described above, 
the compiler’s inability to resolve exactly which 
data items might be accessed by a particular 
reference must result in the “safe” assumption — 
that all possibly touched items are not “avail- 
able” across such a reference — which typically 
forces the generation of sequential code. If we 
can enable the programmer to specify what 
really happens in these cases, more computations 
can be known to the compiler to be available, 
fewer precedence constraints need be artificially 
imposed by the compiler, and less of the gen- 
erated code will be forced to be sequential. 


Each refinement can be viewed as simply 
providing the programmer with a language con- 
struct which, while being intuitive and natural to 
the programmer, allows him to provide exactly 
the extra information the compiler needs. 


Access Permissions for Globals 


As pointed-out in the discussion above, 
FORTRAN supports two basic kinds of global 
data: 


e COMMONS, which are used to group sets 
of variables together by name and to give 
FUNCTIONS and SUBROUTINES access 
to the variables by these group (COM- 
MON) names. 


& I/O channel (logical unit, logical file) 
numbers, each of which is used to give the 
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set of data within a file a name (number) 
and to give FUNCTIONs and SUBROU- 
TINEs access to the variables by that name. 


the two are very similar in concept. 


The kind of language structure needed to 
solve the problems associated with global data 
references closely resembles a COMMON. 
Unfortunately, although COMMON statements 
provide the naming features needed, they are 
scattered throughout the source program and 
potentially across a large number of files. To 
make compilation speed acceptable, the informa- 
tion must be available without having to scan the 
entire source program. 


The required information about use of glo- 
bals in an RF77 program is placed in a separate 
interface specification file, which is 
#INCLUDEd by all files that constitute the 
source program — much as C programs #include 
“header files.” The interface specification file 
simply contains the definitions of global vari- 
ables and the access permission each FUNC- 
TION and SUBROUTINE has for each global. 


Borrowing the terminology of Ada [Ada 
80], there are two primitive kinds of access per- 
mission relevant in performing concurrency 
detection and generation of efficient parallel 
code: 


IN For a variable, permission for the variable’s 
rvalue to flow into the FUNCTION or SUB- 
ROUTINE; for a file, permission only to READ 
from the file. 

OUT For a variable, permission for the rvalue of a 


variable to be different as the variable flows out 
of the FUNCTION or SUBROUTINE from 
what it was at entry; for a file, permission to 
WRITE to the file. 


Also, as in Ada, these access permissions may be 
combined: 


IN OUT For a variable, permission for the variable’s 
rvalue to both flow into the FUNCTION or 
SUBROUTINE and to be different as the vari- 
able flows out; for a file, permission both to 


READ from and to WRITE to the file. 


COMMON Permissions. Each individual 
entry within an RF77 interface specification takes 


one of the following forms: 


4 A software tool currently under development, 
called PREFINE, will automatically convert a FOR- 
TRAN program into its RF77 equivalent — including 
the creation of an appropriate interface specification file. 


IN /global_name / subprogram_list 
OUT /global_name/ subprogram_list 
IN OUT /global_name / subprogram_list 


where IN and OUT are as indicated above and 
IN OUT is equivalent to both IN and OUT per- 
mission. Note that global_name may be either 
the name of a COMMON or it can be blank, 
representing the unnamed COMMON. 


The following is a skeletal example of the 
use of IN, OUT, and IN OUT: 


C ‘this would be the interface spec., "TEST.H” 

IN /A/ SUBR1,FUNC2 

SU BR1 and FUNC2 both can examine any var in A 
OUT /A/ SUBR2 

SU BR2 can change any var in COMMON A 

IN OUT /A/ FUNC1 

FUNC1 can examine and change any item in A 


the following appears in TEST1.RF 
NCLUDE TEST.H 

FUNCTION FUNC\1(...) 

COMMON /A/ X,Y,Z 


re 
C 
Cc 
Cc 
#1 


END 


Cc the following appears in TEST2.RF 
#INCLUDE TEST.H 
FUNCTION FUNC2(.. .) 
COMMON /A/ X,Y,Z 


END 


C the following appears in TEST3.RF 
#INCLUDE TEST.H 
SUBROUTINE SU BR1(...) 
COMMON /A/ X,Y,Z 


END . 


C the following appears in TEST4.RF 
#INCLUDE TEST.H 
SUBROUTINE SU BR2(.. .) 
COMMON /A/ X,Y,Z 


END 


It is important to note that the RF77 com- 
piler will flag any attempt to reference a global 
for which permission was not explicitly or impli- 
citly granted: if any subprogram attempts to 
exceed the access rights granted by the interface 
specification, a fatal compile-time error wiill 
result. By the interface specification given in the 
previous example, the following definition of 
SUBRI is an error because only IN rights were 
granted for members of the COMMON A: 


SUBROUTINE SU BR1(B) 
COMMON /A/ X,Y,Z 
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X =50 
END 


By the same principle, any attempt to CALL a 
SUBROUTINE or FUNCTION which has 
access privileges beyond those of the caller also 
constitutes a fatal compile-time error: 


SUBROUTINE SUBR1(B) 
COMMON /A/ X,Y,Z 
CALL SUBR2 

END 


I/O Channel Permissions. As we pointed- 


out at the start of this section, I/O channels are 
also a form of global, named by the INTEGER 
channel number (which is considered to be a 
pre-defined COMMON name). For example, the 
ability to READ from I/O channel 7 would be 
granted to subroutine SUBR3 by: 


IN /7/ SUBR3 


The parallelization of I/O operations is 
traditionally one of the most difficult and unreli- 
able language features, as evidenced by the lack 
of I/O operations in many new  parallel- 
processing languages. In creating RF77, this 
problem is even more difficult because much of 
the flavor of FORTRAN is its style of I/O — 
drastically changing the I/O would have made 
RF77 drastically different from FORTRAN. We 
have chosen to maintain the FORTRAN I/O 
style, at a slight cost in reliability of paralleliza- 
tion. 


To function properly, I/O operations 
would have to be based on naming files — but 
FORTRAN, and hence RF77, I/O is based on 
naming channels. RF77 compilers assume that 
operations performed using different 1/O channels 
are always independent of each other. Strictly 
speaking, this is not always true — a single file 
may be associated with several I/O channels: if 
one is WRITEing, the compiler might acciden- 
tally create a race condition by assuming that 
Operations on two different channels can 
proceed simultaneously. In fact, this can also 
cause unpredictable results on some _ single- 
processor machines, due to I/O buffering prob- 
lems. 


Argument Passing & Parameter Definition 


Each FORTRAN FUNCTION or SUB- 
ROUTINE is able to accept any number of call- 
by-address arguments. Since call-by-address is 


used, each argument passed to a subprogram 
could be carrying IN, OUT, or IN OUT, permis- 
sions to the rvalue: for the same reasons that a 
parallelism-detecting compiler must know the 
access permissions that subprograms have to 
COMMONS and I/O channels, the compiler must 
know which of these permissions each argument 
carries. As with global information, the same 
specification must be available to the compiler 
during compilation of both the CALLer and the 
CALLed subprogram. The CALLed routine can- 
not require access privileges not granted by the 
CALLer. 


The access privileges granted by the caller 
for each argument should generally match those 
rights required by the called subprogram of its 
parameters.© By placing this information in the 
interface specification, it need be given only once 
for each FUNCTION or SUBROUTINE. RF77 
uses the following syntax to state which access 
rights are carried by each argument of a subpro- 
gram: 


function_spec ::= ARGUMENT FORTRAN_type arg_spec 
eubroutine_spec := ARGUMENT arg _spec 
arg_spec ::=sub_name (a_p_list ) 
a_p_ltst :=a_p_lIist ,perm type_size 
| perm type_size 
perm :=IN |OUT |[IN OUT |null_ string 
type_size :=FORTRAN_type ( dim_list ) 
| FORTRAN_type 
| null_ string 


where FORTRAN _type is any data type available 
in ANSI FORTRAN 77. 


Normally, ARGUMENT specifications will 
state the access permission carried by each argu- 
ment; however, if the access permission carried 
by an argument is not specified then the permis- 
sion is assumed to be IN. The _ optional 
type_size specifications, if given, allow the RF77 
translator to perform type checking on argu- 
ments — a feature not related to parallel execu- 
tion, but desirable for other reasons. 


As an example of access permission specifi- 
cation, a subroutine to add its first two argu- 
ments and store the result in the third might be 
written as: 


5 In fact, these permissions often do not match. An 
individual call might pass more restricted rights than 
the subprogram normally requires, but that are known 
to be sufficient for that call. Capitalizing on this was 
deemed too risky. 
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Cc this would be the interface spec., "TEST2.H”"” 
ARGUMENT ADDSUB(IN, IN, OUT) 


Cc the following appears in TEST2. RF 
#INCLUDE TEST2.H 
SUBROUTINE ADDSUB(A, B, C) 
C=A+B 
RETURN 
END 


And ADDSUB would be called as, for example: 


CALL ADDSUB(5, D, E) 


A more concrete (but still artificial) exam- 
ple of the use of these constructs is the following 
matrix multiplication code: 


Cc this would be the interface spec., "MA TMU L.H” 

Cc MATMUL and DOTPRO examine COMMON AB 
IN /AB/ MATMUL, DOTPRO 

Cc DOTPRO can examine,examine,échange its args 
ARGUMENT DOTPROC(IN,IN,OUT) 

Cc MATMUL can change its arg 


ARGUMENT MATMUL (OUT) 
Cc the following appears in some other file 
#INCLUDE MATMUL.H 

SUBROUTINE MATMUL(C) 


REAL C(100,100) 
COMMON /AB/ A(100,100),B(100,100) 


DO 10 I=1,100 
DO 20 J=1,100 . 
CALL DOTPRO(I, J, C(I,J)) 
20 CONTINUE 
10 CONTINUE 
STOP 
END 


the following appears in yet another file 
anGLDE MATMUL.H 

SUBROUTINE DOTPRO(L, J, C) 

COMMON /AB/ A(100,100),B(100,100) 

SUM =00 

DO 10 K=1,100 

SUM =SUM +A(LK) * B(K,J) 

10 CONTINUE 

C =SUM 

RETURN 

END 


The constructs presented in this and the previ- 
ous section enable the RF77 compiler to deter- 
mine which subprograms can be executed in 
parallel with one another without requiring 
expensive global analysis. The constructs 
presented in the next section permit the pro- 
grammer to express information beyond that 
which can be expressed in a “dusty-deck” 
language, hence greatly increasing parallelism. 


Indexing and Partitioning 


Partitioning is the technique used by 
refined languages to create new names for arbi- 
trary portions of a data structure. Once these 
names exist, it is a trivial matter to indepen- 
dently control the access permissions of each 
piece (called a partition-element). 


C-Style Partitions. Partitioning can be 
viewed as a way of more-precisely specifying 
what pointers have permission to point at, which 
may imply a conceptual rearrangement of the 
object data structure such that elements which 
are not accessible appear to be removed from 
the structure. This the primary function of par- 
titions in languages like C [Die 84] [Die 85]. 


Partitioning a matrix by an arbitrary for- 


mula specifying which partition-element each 
datum belongs to can produce partition-elements 
which are not the same size or shape as the origi- 
nal array — perhaps not even rectangular. For 
example, each “row” might be of a different 
length. This style of partitioning is what we call 
partitioning with dynamic indexing; that is, the 
shape and indexing structure may change 
dynamically according to the partitioning specifi- 
cation. 


FORTRAN Partitions. FORTRAN doesn’t 
support pointers. FORTRAN programmers 
don’t want to think in terms of non-rectangular 
structures because conventional FORTRAN 
doesn’t provide any way of building such struc- 
tures. 


In RF77, the shape and indexing structure 
of the original data structure must be preserved 
independently of the partitioning specification — 
we Call this partitioning with static indexing. 


Static indexing means that, for example, 
when a matrix is partitioned into partition- 
elements above and below the diagonal, each 
partition-element has the same indexing formulas 
and shape as the original matrix, but some of the 
data of the original matrix are inaccessible by 
each partition-element name: a reference to a 
datum by indexing the “above”  partition- 
element, where the datum is conceptually con- 
tained in the “below” partition-element, is an 
error. In effect, static indexing means that parti- 
tioning is a zero-cost operation and each 
partition-element has a membership test formula 
associated with it.® 
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Partitioning is accomplished by the 
(apparently executable) PARTITION statement: 


partstat := PARTITION ( structure , partlist part ) 
partlist :=part ( condition ) , partlist 
| null__string 


where structure is the name of the structure 
being partitioned (including dummy variables 
naming each subscript) and part is a name for a 
partition-element. 


The conditions within a PARTITION state- 
ment are evaluated left-to-right on only those 
data which have not yet been placed in a 
partition-element, therefore, all partition-elements. 
are guaranteed to be mutually exclusive: each 
datum can belong to only one at a time. 


The following code illustrates the definition 
and use of a PARTITION which creates 
partition-elements naming the portions of the 
array A which are, respectively, above, on, and 
below, the diagonal: 


REAL A(100,100) 


PARTITION(A(I,J), AUPPER(J .GT. 1), 
ADIAG(I Eq. J), ALOWER) 

C the following reference is valid 

AU PPER(5,7) =3.14159265 

the following reference is not valid 

and would cause a fatal compile-time error 

ALOWER(5,7) =3.14159265 


—_ 


aa 


The user may explicitly test whether a par- 
ticular datum is a member of a partition-element 
by using the unary prefix operator .MEMBER., | 
which simply applies the membership test for- 
mula for the following subscripted reference and 
returns a logical value of .TRUE. if that refer- 
ence is valid. Thus: 


C the condition below is obviously .TRUE. 

IF (MEMBER. AU PPER(1,99)) GOTO 30 
Cc this might or might not be .TRUE. 

IF (MEMBER. AU PPER(4*K-1,L)) GOTO 30 
Cc the condition below is obviously .FALSE. 

IF (MEMBER. AU PPER(I,I)) GOTO 30 


Compilation Techniques 


Fundamentally, compilation of RF77 into 
efficient parallel code is identical to compilation 
of FORTRAN into efficient parallel code. Any 


6 The concept of membership also applies to other 
data structures. For example, an array membership test 


formula simply checks that subscripts are within 
bounds. 


of the techniques of [All 82] [Fis] [Kuc 84] [Nic 
85] [KAI 85] could be used and the result would 
be at least as good as — probably much better 
than — that obtained from “pure” FORTRAN. 


Of course, just as we have our own parallel 
computer design [Par 86] [Die 85/2] in mind for 
the execution of these programs, we also have 
our own way of constructing refined-language 
compilers. The most unusual features of our 
approach are: 


® Use of conventional (sequential) optimizing 
compiler flow-analysis concepts and tech- 
niques, aS per [Coc 70] and [Aho 79], to 
build an access-flow graph. 


e Use of non-deterministic process-packaging. 
The kind of parallel code generated is 
determined by a pruned search for target- 
machine-dependent structures within the 
access-flow graph. The basic technique is 
similar to that used by [Kru 82] for sequen- 
tial code generation. 


Conclusions 


In this paper, we have given a detailed 
presentation of the application of the refined- 
language methodology to ANSI FORTRAN 77, 
and we have given a reasonably precise defini- 
tion of the resulting language, RF77, including 
many small examples. 


Throughout our modifications, we have not 
changed the FORTRAN flavor of the language 
nor have we imposed any particular view of 
parallel processing. We have, however, built the 
language in such way that it can be easily com- 
piled into efficient code for nearly any kind of 
parallel computer. Further, since these refine- 
ments aid any compiler in building a more accu- 
rate flow-graph, RF77 is completely compatible 
with, and efficiently usable by, compilers for 
single-processor machines. In fact, by applying 
conventional optimization techniques to the 
better quality flow graph, RF77 may actually be 
a more efficient language for SISD machines 
than ANSI FORTRAN 77. 


References 


[Ada 80] Military Standard, Ada Programming Language, 
MIL-STD-1815 (the green book), Dec. 1980. 


J. R.Allen and K. Kennedy, “PFC: a Program to 
Convert Fortran to Parallel Form,” Report MASC 
TR 82-6, Dept. of Math. Sciences, Rice Univer- 
sity, Houston, TX, Mar. 1982. 


[All 82] 


191 


[Aho 79] Alfred V. Aho and Jeffrey D. Ullman, Principles 
of Compiler Design, Reading Massachusetts; 


Addison-Wesley, pages 262-265, Apr. 1979. 


John Cocke and J. T. Schwartz, Programming 
Languages and Their Compilers, New York; New 
York University Courrant Institute, pp. 320-334, 


[Coc 70] 


1970. 
[Coo 85] Keith D. Cooper, Ken Kennedy, and Linda 
Torczon, ‘The Impact of Interprocedural 


Analysis and Optimization on the Design of a 
Software Development Environment,” ACM 0- 
89791-165-2/85 /006/0107, 1985. 


Henry Dietz and David Klappholz, “Refining A 
Conventional Language For Race-Free Specifica- 
tion Of Parallel Algorithms,’’? IEEE Proceedings 
of ICPP 1984. 


Henry Dietz and David Klappholz, ‘‘Refined C: 
A Sequential Language For Parallel Program- 
ming,” IEEE Proceedings of ICPP 1985. 


[Die 85/2] Henry Dietz and David Klappholz, “RISC CPU 
Design for MIMDs,” presented at The Second 
SIAM Conference on Parallel Processing for 
Scientific Computing, Nov. 1985. 


[Die 84] 


[Die 85] 


[Die 86] Henry Dietz, ‘The Case for Sequential Languages 
for Parallel Machines — The Myth of Machine 


Independence,” document in preparation. 


Joseph A. Fisher, ‘Parallel Processing: A Smart 
Compiler and a Dumb Machine,” Yale University. 


Joseph A. Fisher, ‘The VLIW Machine: A Mul- 
tiprocessor for Compiling Scientific Code,” IEEE 
Computer, pp. 45-53, July 1984. 


KAI, Mini-KAP/ AF, Kuck and Associates Inc., 
Professional Commerce Center, 1808 Woodfield 
Dr., Savoy, IL 61874. Aug. 1985. 


D. W. Krumme and D. H. Ackley, "A Practical 
Method for Code Generation Based on Exhaus- 
tive Search,” Proc. of the SIGPLAN ’82 Sympo- 
sium on Compiler Construction, pages 185-196, 
June 1982. 


D. J. Kuck, Y. Muraoka, and S. C. Chen, “On the 
Number of Operations Simultaneously Executable 
in FORTRAN-Like Programs and Their Result- 
ing Speed-Up,” IEEE Trans. on Computers, Vol. 
C-21, No. 12, pages 1293-1310, Dec. 1972. 


David J. Kuck et al, “The Effects of Program 
Restructuring, Algorithm Change, and Architec- 
ture Choice on Program Performance,” IEEE 
Proc. of ICPP 1984, pages 129-138, Aug. 1984. 


Alexandru Nicolau, “Uniform Parallelism Exploi- 
tation In Ordinary Programs,” IEEE Pro. of ICPP 
1985, pages 614-618, Aug. 1985. 


H-C Park and David Klappholz, “Single-Stage 
MIMD Design With Smart Nodes,” document in 
preparation. | 


[Fis] 


[Fis 84] 


[KAI 85] 


[Kru 82] 


[Kuc 72] 


[Kuc 84] 


[Nic 85] 


[Par 86] 


A VLSI COMPARISON OF SWITCH-RECURSIVE 
BANYAN AND CROSSBAR INTERCONNECTION 
NETWORKS * | 


T.H. Szymanskt 


Department of Electrical Engineering 
University of Toronto 
Toronto, Ontario, Canada 


Abstract — A class of multistage interconnection networks 
(MINs) that minimizes the number of wire crossovers is presented. 
The basic idea is to use a MIN (made with smaller MINs) as the 
basic switch in the realization of a larger MIN. The resulting 
switch-recurstve MIN has far fewer wire crossovers when com- 
pared to others MINs, such as the SW-Banyan, Baseline, Delta and 
Omega networks. More importantly, switch-recursive MINs have 
very distinct clusters in the topology, with heavy communications 
within clusters, and little communications between clusters. In a 
discrete component realization, such clustering makes the parti- 
tioning of these networks onto printed circuit boards, in a way to 
minimize edge connector and board costs, relatively straight for- 
ward. The network’s physical size is reduced, and hence delays 
through the network are also reduced. In a VLSI realization, 
where the entire MIN is realized in a single integrated circuit, such 
clustering reduces the area requirements of these networks by up 
to 50%, when compared with the SW-banyan layout. 


1. Introduction 


With the emergence of VLSI technology and wafer-scale 
integration, the prospect of one day integrating an entire system 
on one wafer is growing. An important component in many sys- 
tems, such as multiple processor architectures, is the interconnec- 
tion network. A large number of interconnection networks have 
been proposed, the most predominant ones being the crossbar net- 
work, numerous multistage interconnection networks (MINs), and 
the single stage shuffle exchange network (SEN). Optimal layouts 
for SENs have been presented in [7], and the VLSI aspects of SW- 
banyans [6] and crossbars have been analysed in [4]. This paper 
extends the latter by introducing a class of banyan networks, 
called switch-recursive banyan networks (or SR-banyans), with a 
topology that tends to minimize the “‘crossovers” of wires. Such a 
topology is very useful in the VLSI layout and in the discrete com- 
ponent layout (i.e., with printed circuit boards) of MINs. 


An analysis by Franklin on the VLSI requirements of 
crossbar and SW-banyan networks indicated that ‘“‘for reasonable 
values of [network size] the two have roughly comparable space- 
time performance” [4]. A strong result of Franklin’s work is that 
the area of both networks grows as O(N”), where N is the 
number of input or output ports. While there is little that can be 
done to improve the layout or performance of a crossbar, a large 
number of parameters may be selected to improve the layout or 
performance of a banyan network. Switch-recursive banyan net- 
works have such a topology. In addition, other aspects of the sys- 
tem such as the use of multiple metal layers for routing are exam- 
ined in this paper. We present much simpler and more general 
models for VLSI area comparisons than presented in [4], and use 
these to calculate the area requirements of conventional SW- 
banyan and switch-recursive banyans. 


One area of research identified by Franklin was the general 


question of which MIN, out of many topologically equivalent ones, | 
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(i.e., the SW-banyan |6], baseline [2], delta [10], omega [8], indirect 
binary n-cube [11]) to use in a discrete component environment. 
“The printed circuit boards and the wire interconnecting the 
boards are much larger contributors to system delays [compared 
with chip delays], primarily because the distances the signals must 
travel are orders of magnitude longer” [13]. Our analysis indicates_ 


that switch-recursive banyan networks are preferable to other 


MINs in a discrete component environment by virtue of their 
minimized wire crossovers, which will result in increased packaging 
densities and thus fewer boards and connectors to realize the sys- 
tem. Hence, the delays through interconnect in the network will 
also be reduced. 


The use of recursion to realize large switches is not a new 
concept. However, the topological considerations of switch- 
recursive networks have gone largely unnoticed. A detailed exami- 
nation of these issues is presented in this paper. 


This paper is organized as follows. Definitions are presented 
in Section 2. Section 3 presents a simple VLSI area model for 
crossbars and SW-banyans. This model will be used to calculate 
the area requirements of the switch-recursive MINs introduced in a 
later section. Section 4 summarizes Franklin’s VLSI delay models 
for SW-banyan and crossbar networks. Section 5 presents a com- 
parison of SW-banyans and crossbars in a VLSI environment. Our 
assumptions and analysis are different from Franklin’s, and hence 
our results are significantly different. Section 6 introduces the 


-‘switch-recursion used to realize MINs with low wire crossover 


counts. Section 7 presents a VLSI comparison of switch-recursive 
banyans and crossbars in a VLSI environment. Section 8 contains 
some concluding remarks. 


2. Definitions 


Many of our definitions have been previously presented by 
Franklin [4]. We present a brief overview. 


An N XN crossbar connects N processors to N memories, 
as shown in fig. la. The basic switch used in these crossbars has 2 
states, which are shown in fig. 1b. 


An NXN SW-banyan network, built with 2X2 switches, 
connects N processors to N memories, as shown in fig. 2a [6]. 
The basic switch used in these networks has 6 states, which are 
shown in fig. 2b. 


Network Cost: In a discrete component environment, the cost 
C of a network is usually the number of switches required to real- 
ize it. The cost of the crossbar C,, is N?, and the cost of the 
banyan C;,, is (N /2)-log.N . 

Network Delay: The average network delay D, experienced 
by any individual connection request, can be modelled as 1/(1-pb ) 
times the average path length through the network, were pb is the 


blocking probability of a network under the assumed operating 
mode [10]. 


processors 


memories 


(a) 


RIGHT 


FULL 


(b) 


Fig. 1: a) an 8X8 crossbar network. b) the 2 states 
of a basic switch. 


In a crossbar network, a request must travel through N 
switches on average, hence the delay of a crossbar D,, is given by 


: (1) 


Dy ~N - 
: 1-pb,., 


where pb,, is the blocking probability of the crossbar network. 

In a banyan network, a request must travel through log.N 
switches, hence the delay of a banyan D,, is given by 
1 


Die ~ logaN « 


(2) 


where pb,, is the blocking probability of the banyan network. 


The blocking probabilities of these networks are more 
difficult to arrive at. For simplicity assume that the network is 
synchronously circuit switched and all N input ports are occupied 
in each cycle, (i.e., the network is fully saturated). Franklin used 
the pb when the set of N requests submitted during each cycle 
was a randomly selected permutation in the equations above. In 
this case the pb of a crossbar is 0 and the pb for the banyan was 
obtained by simulations. However, this permutation model biases 
the results in favour of the crossbar, since the banyan has large 
pb ’s under this assumption, even for very small N. 


A more realistic, general performance comparison is to use 
the blocking probability when the requests are random and 
independent during each cycle. This model would correspond to 
general MIMD operating environments. In this case, the difference 
between the pb ’s for both networks is reduced considerably. 


We assume that the requests submitted during each cycle are 
random and independent. Under this assumption, the pb of an 
N XN crossbar is simply [10] 


(3) 


where wu, is the probability that each processor issues a 
request in every cycle. 


Under these assumptions, the blocking probability of an 
NxXN banyan, built with 2X2 switches and with n=log.N 
stages, is simply [10] 
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Fig. 2: a) a 1616 SW-banyan network. b) the 6 
states of a basic switch. 


(4) 


and u, is as above. 


Cost-Delay Products: A general figure of merit that has been 
used to compare networks is the cost - delay product [4]. Using 
these equations, the cost-delay figure of merit C-D for each of 
these networks, in a discrete component environment, can be com- 
puted: 


Cor "Dep ie N*/(1-pbeo ) 
Cha Dg = N (log oN )?/2(1-pbie ) 


(5) 
(6) 


The remainder of this paper examines how this figure of 
merit changes when the entire network is realized in a single 
integrated circuit. As will be shown later, the area requirements of 
the SW-banyan are largely due to routing, and a topology that 
minimizes the wire crossover count will be introduced. 


3. VLSI Area Models 


We now present a simple and general model for the VLSI 
area requirements of SW-banyan networks. This model includes 
the effects of the use of multiple layers of metal for the routing, 
and is easily extensible to other topologically different MINs. In a 
later section we introduce the class of switch-recursive banyan net- 
works, which require about 50% less area than the SW-banyan for 
large network sizes. The area models developed here will be 
adapted to calculate the area requirements of these recursive 


MINs. 


3.1. VLSI Area Model For Crossbar Networks 


An accurate model for the crossbar is difficult without under- 
taking a detailed design in a particular technology. To simplify 
the analysis, we assume the existence of a basic 2-input 2-output 
switch, with w bits in each input and output, as shown in fig. 3. 
We assume that metal lines are 3 \ wide and must be spaced 3 A 
apart, where X is a basic dimension [9]. Assume that the basic 
switch is square and has dimensions given by Franklin’s equation 


L =6VK (7+w?) 


where K 1.5, and 7 =~ 20 [4]. We assume that the switch is 


Fig. 3: the basic switch with w bits in each input 
and output : 


‘square ior simplicity, however, we note that non-square switches 
would reduce the area of the banyan network considerably. 

Assuming the crossbar topology as shown in fig. la, the area 
of a NXN _ crossbar is approximately (N-L,,)*, where 
L.5 = L +3 (the 3 is die to spacing between switches). 


3.2. VLSI Area Model for SW-Banyan Networks 


We assume that the SW-banyan network is made with basic 
2X2 switches, with w bits in each input and output, as shown in 
fig. 3, and that these switches have the same area as the basic 
switches used in the crossbar [4]. This assumption is valid if the 
control logic associated with each switch is not significant [4], and 
that the basic switch dimensions are limited by the minimum pos- 
sible pitch for a transistor in the fabrication technology. 


We currently ignore the increased area required by the line 
drivers used in the switches in the banyan network; this will be 
examined later. The SW-banyan network topology we are assum- 
ing is shown in fig. 2a. 


A 2X2 switch requires height L,, , since we assume the basic 
switches have comparable complexity, as in [4]. The wires into 
and out of the sides of the basic switch must be bent to go verti- 
cally, which will increase the horizontal space required by each 
switch. We assume, as Franklin did, that each switch can be real- 
ized in width W,, ~ L.,+6w, as shown in fig. 4a. Note that 
W,q4 may be decreased if more layers of metal are used for routing, 
and the area of the banyan will be reduced proportionately. _ 

Assume the technology used has at least 2 layers of metal 
used for routing. By implementing all horizontal inputs in one 
layer and all horizontal outputs in another, the layout in figures 4a 
and 4b are realizable. A small area above or below each switch 
may have to be used to convert from one layer to another (for 
channel routing). 


Consider the N XN SW-banyan network shown in fig. 2a. 
This network clearly has a recursive structure; it consists of one 
stage of 22 switches, connected to two smaller networks of size 
N /2 each. Between the first and second stages (from the top), 
exactly N /2 links, with w bits each, will cross an imaginary line 
drawn vertically through the middle of the network. Assuming 
that the vertical components of links are routed on one layer and 
the horizontal components of links are routed on another, the vert- 
ical height required for routing these links is then (N /2)6w. 
Note that this is true regardless of the basic switch size used in 
the banyan network. 


If we have more than 2 layers of metal, then assume any 
extra layers are used to route the horizontal links between stages. 
If we let s denote the number of metal layers used for horizontal 
routing, then the vertical area required is reduced by a factor of s . 


3X 


(b) 


Fig. 4: a) minimum horizontal spacing for switches. 
b) minimum vertical spacing for switches. 


(The effects of extra layers on the switches must be reflected in the 
basic switch size L , but it appears that L,, will not change much 
for a crossbar; In the case of the banyan, W,, may be reduced, 
and the network’s area requirements are reduced proportionally). . 


The height of the network, H,,, (N ), is determined by the the 
height of the first stage of 2X2 switches, the height required for 
routing between the first stage and the two smaller networks of 


size N /2, and the height of the smaller network of size N /2: 
6wN 
How (N }= 2s 


gy (2) ~ Lee 


+ Hew (2) + How (N /2) 


Assuming that a 44 banyan has the layout shown in fig. 
4b, then 


Ayy (4) ad 2L ep 


3.3. Driver Considerations 


The preceding development has ignored the area required for 
drivers in each stage of the banyan network. The propagation 
delay through a long line is minimized by having a sequence of 
successively larger drivers driving the line [9]. Franklin has shown 


' that the area of a driver will require between 10 and 30% of the 
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area of the line being driven (assuming NMOS fabrication) [4], and 
assumed the value 25%, as we will do (i.e., v=0.25). We now 
must derive an expression for the line length. 


Assume the line length must be greater than some minimum 
value (2/0.75a) in order for a driver to be necessary. The reason 
for this choice will become apparent in the section on delays. 


Consider switches in the first stage of the NxXN SW- 
banyan. Each switch will have two types of drivers; w drivers 
that drive a line with a vertical component only, and w drivers 
that drive a line with both vertical and horizontal components [4]. 
The vertical component of a line after the first stage is ~ 
6wN /2s . The sum of the vertical and horizontal components of 
a line with both components is ~ (6wN /2s ) + (N /2)-(W,, /2), of 
which ~ W,, /2 is used to reach the drivers (if any), and the 
remaider is the length of the line being driven. 


We assume these drivers are implemented in a way to 


izontal [4]. Each switch may have w drivers of each type, which 
must be realized within the width of the switch W,, (note that we 
differ from Franklin here, and this difference is not insignificant). 
Assume each line contributes height dr,,,(p) to the height of the 
switch, where p is the line length: 


Oryy (p) = “ee 


dr yy (p ) =0 


The added height due to drivers after the first stage, DR,, (N), is 
then 


otherwise 


CwN Wa 
2 


N 
—W, 
ar be) 


DRyw (N) =w "dT a (— *) + w ‘dT sy ores 


ee (2) =0 


Assume that a 4X4 banyan has the layout shown in fig. 4b and 
that no drivers are used: 


DRyy (4) = 0 
The height of an N XN SW-banyan network is then 
Hye (N)=>o™ + DRw (N) + Hye (2) + How (N /2) 


The width of the banyan network is determined by the 
(N /2) 2X2 switches, and is given by ~ (N/2)W,,. Hence, the 
total area required for a NN SW-banyan, made with 2x2 


switches, with w bits per input/output, and with s+1 metal 


layers for routing, is given by 


~ Hew (N )-— ~ Whe 


(7) 


4. VLSI Delay Models 


We use the NMOS delay models developed by Franklin [4], 
but with a few modifications. Recall that Franklin assumed that 
the pb used in equations (5) and (6) was the observed blocking 
probability when the request pattern submitted during each cycle 
was a randomly selected permutation. We assume that the indivi- 
dual requests are random and independent. Hence, in our delay 
model the crossbar also has a non-zero blocking probability. 


In addition, Franklin assumed that all lines being driven in 
the banyan required drivers. This is not true in the last few stages 
where the line lengths are relatively small. 


4.1. General VLSI Delay Models 


A very brief summary of Franklin’s NMOS delay models is 
presented [4]. Assume each switch is implemented with NOR 
gates in NMOS technology. If each gate has a fanout of f , and 
each switch has m levels of logic, then the switch delay is typi- 
cally [4] 


~25mf 7 
where 7 is the transit time through the gate’s active region. 


If the gate is driving a line of length p whose total capaci- 
tance is comparable to the gate capacitance, then the delay associ- 
ated with driving such lines is [4] 


7 (1+ 0.75ap ) 


If the line to be driven has a total capacitance that is much 
larger than a minimum sized gate’s capacitance,then the delay is 
minimized by having a sequence of successively larger drivers driv- 
ing the line [4,9]. Assuming each gate is 4 times the size of its 
predecessor [4], then the delay associated with driving a line of 
length p is 


F 27 loge (0.75ap ) 


If the line is very long, then a resistance factor should be taken 


into account and a transmission line model should be used. We 
assume that a simple capacitance based model is sufficient. 

Franklin assumed a value of a=0.1. However, an examina- 
tion of SPICE circuit parameters for NMOS and CMOS indicates 
that a=0.03 is typical today. The value a=0.03 will be used 
throughout this paper. An implication of this assumption is that 
lines can be much longer before a driver is required. 

Let d(p ) be the delay due to a line of length p: 


2 


d(p) 2 rlogs (0.75ap ) if p > ——— aaa 


d(p)* 


d(p)=0 ifp =0 


: i < 
7(1+0.75ap ) ifp < Nba 


4.2. Crossbar Delay Model 
The delay of an N XN crossbar is given by [4] 


D 


(2. 5Nmf +(N-1)(1+0. 7503) } (8) 
We will use m =2, f =—2, and a=0.03 throughout the paper, and 
pb., can be calculated from equation (3). The parameter 7 will 
cancel out later. 


eb = t= a 


4.3. SW-Banyan Delay Model 


Let D(N) be the average path delay through an N XN 
SW-banyan: 


D(N)*= at a) + (140.75aW,, /2)r + (9) 
D(N /2) + D(2) 


where D (2) = 2.5mf 7. Within the large brackets, the first term 
is the delay of a line with a vertical component only, the second 
term is the delay to reach the line drivers for lines leaving the side 
of the switch, and the third term is the delay over a line with a 
vertical and horizontal component. 


If we assume that 44 banyans have the layout shown in fig. 
4b and have no drivers, then 


D (4) © (140.750 W,, )r + 5mf + 
Hence, the delay of an N XN SW-banyan is given by 


Du = —— (NV) (10) 


hk 


where pb,, can be calculated from eq. (4). 
5. VLSI Comparisons 


5.1. SW-Banyan and Crossbar Comparisons 


In fig. 5, the area, delay and space-time comparisons of SW- 
banyan and crossbar networks are presented. The space-time 
ratios in the VLSI environment are similar to the cost-delay ratios 
in the discrete component environment. 


Our area results for the SW-banyan are lower than 
Franklin’s by about 30% [4], which can be explained as follows. 
While Franklin’s analysis was basically correct, his equation (16) 
increased the area of the banyan by one extra stage of drivers. 
(This alone accounted for an error of 25%). Also, Franklin’s 
analysis assumed that the area required by drivers was realized in 
such a way that the width of the switch did not increase (only its 
height was allowed to increase). Hence, the vertical contribution 
to a switch of width L due to a driver requiring area A was given 


_by A/L. However, while the true switch width is L , switches are. 
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Fig. 5: a) ratio of SW-banyan to crossbar areas. b) 
ratio of SW-banyan to crossbar time delays. c) ratio 
of SW-banyan to crossbar space-time products. 
(dashed curves are from traditional analysis); d) frac- 
tion of SW-banyan area used for routing. 


spaced 6w apart, so each switch requires L + 6w W,,- Hence, 
our analysis assumed that the drivers can be realized over the 


width W,, , and the added contribution to a switch’s height is then 
A/Wye- 


Our delay curves are significantly different from Franklin’s 
since we assumed that the blocking probability of these networks 
under the assumption of random request distributions was used in 
equations (8) and (10), and we have used the value of a=0.03. 
The effectiveness of each network is indicated in fig. 5c, the space- 
time ratios. Our space-time ratio drops off more rapidly than 


Franklin’s, indicating that banyans become more effective as N 
increases. 


Fig. 5d illustrates the area required by routing in the SW- 
banyan networks. As the network size N increases, this area 
rapidly approaches an asymptotic limit. This limit can be 
explained as follows. As N increases, the area required by 
switches in a banyan approaches 0, and the majority of the area is 
due to routing and drivers. The area of actual metal routing (not 
including the spacing between metal lines) is about 50% of the 
banyan’s area, and the drivers require 25% of this area (by 
_ assumption). Hence, the drivers require about 12.5% of the area 

of a banyan, for large N. 


6. Recursive MINs 


The preceding graphs indicate that in a VLSI realization, the 
majority of a large SW-banyan’s area is devoted to interconnect 
between stages. This area can be reduced by selecting a topology 
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ba 


(d) 


that minimizes the wire crossover count. In a discrete component 
realization, minimizing the wire crossovers will result in increased 
packaging densities, and hence minimized cost and delays. In this 
section, we present such topologies. 


The SW-banyan network shown in fig. 2a (in section 2) 
clearly has a recursive structure; a network of size N consists of 
one row of 2X2 switches that are connected to two smaller net- 
works of size (N/2) each. We introduce another type of recursion, 
which we call switch-recurston. The basic idea is to use a MIN 
(made with smaller MINs) as the basic switch in the realization of 
a larger MIN. We illustrate the concept by examples on various 
MINs, which we call SR-MINs. 


6.1. Switch-Recursive Delta Networks 


Consider a 64X64 delta network [10] made with 2x2 
switches, as shown fig. 6a. This network can be reorganized into a 
network where each recursive switch is a 4X4 delta network, as 
shown in fig. 6b, and into a network where each recursive switch is 
a 8X8 delta network, as shown in fig. 6c. An advantage in the 
latter realizations is the decreased wiring complexity. In a 
printed-circuit board environment, each recursive switch should be 
realized on the same board. This would translate into an increased 
component density per board (since less board area is used for 
routing to edge connectors), fewer boards to realize the network, 
and decreased interboard communications and delays (through © 
edge connectors and ribbon cables). In a VLSI environment, the 
clustering and high connectivity within each recursive switch will 
result in reduced wiring area. (Many automatic placement and 
routing programs look for such clustering in the “‘net list” of a cir- 
cuit, for improved layout [12)). 


6.2. Switch-Recursive Banyan Networks 


The previous figures illustrated the recursive process when 


eee 
ie 
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Fig. 6: a) a 64X64 delta network. b) a 6464 delta 
network with 4X4 recursive switches. c) a 64X64 
delta network with 8X8 recursive switches. 


applied to delta networks. In a VLSI environment, the SW- 
banyan network has significantly lower area requirements than the 
delta. The recursive process can also be applied to SW-banyan 
networks, as shown in fig. 7. (Note that “delta” networks are 
actually a subclass of “banyan” networks, as defined in [6]; we use 


the term ‘“‘SW-banyan” to denote the topology illustrated in fig. 2a 
and fig. 7a). 


6.3. Generalized Switch-Recursive Banyans 


A large number of possible topologies exist when switch level 
recursion is used. A conventional N XN network can be realized 
by an m stage network, where each stage is made of k; Xk; 
crossbars [3]. The integers <k,,ko,...,k, > correspond to the fac- 
tors of N, ie., kyko* + °° km == N. Given a basic switch size, 
such as a 2X2 crossbar, we can synthesize each k; Xk; crossbar 
with a MIN, provided the crossbar size is a power of 2. After 
applying switch level recursion corresponding to some factorization 
of N, we are left with a log,N stage network with N /2 switches 
per stage, as before. However, the topology is different and gen- 
erally has more localized communications. 


Clearly, this recursion can be extended to many levels, so 
that a k Xk switch could be realized by recursive switches, which 
in turn use recursive switches. We have not yet proven results on 
an optimal factorization of N. However, a heuristic that seems to 
give optimal results is as follows (let n=Ilog,N). To realize an 
abstract N XN network, we use a 2 stage recursive SW-banyan, 
with recursive switches of size 2!"/2! in one stage, and recursive 
switches of size 2'"/2! in the other stage. One recursive banyan 
network derived from this heuristic is shown in fig. 7b. This par- 
ticular network requires only one level of recursion. 


The recursion process only improves the area requirements 
when the network size is >8. We have written a branch and 
bound routine to try all possible factorizations for a particular N, 
and the heuristic always yields the optimal result. 


7. VLSI Comparison of SR-Banyans and Crossbars 


An analysis of the area requirements of SR-banyans is now 
presented. Let H,,(N) represent the height required for an. 
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Fig. 7: a) 6464 SW-banyan network, made with 
22 switches. b) 64X64 SW-banyan network, 
made with recursive 8X8 switches, in turn made 
with 2X2 switches. 


N XN SR-banyan. The SR-banyan consists of 2 stages, with 
recursive switches of size 2'*/2! in one stage, and recursive 
switches of size 2!"/21 in the other stage, where n =logoN . 


Our analysis indicates that the area and delays are minim- 
ized when the smaller switches are in the first stage, so let each 
recursive switch in the top stage be of size 2 ln/2] The recursive 
switches in the top stage have width T,, = g tn /2 11 *W,,, and 
the recursive switches in the bottom stage will have width 
By, = 2 [n /2 biey,, , where W,, is the width of a 2X2 switch as 
described in section 3. 


Since N /2 links, with w bits each, will cross an imaginary 
line drawn through the middle of this network, the vertical height 
required for routing between these two stages is 6wN /2s , where s 
is the number of metal layers used for horizontal routing. 


The height required by drivers for a recursive switch in the 
top stage must be calculated. Consider a recursive switch nearest 
the middle of the row in the top stage. At this point, the channel 
used for routing the wires is most dense and any height required 
for drivers will contribute to the height of the network (at the 
ends, the drivers can extend into the channel since very few wires 
exist there). One line leaving this switch will have ~ a‘vertical 
component of 6wN /2s only, two lines will each have this vertical 
component and a horizontal component of ~ B,,, another pair of 
lines (if any) will have this vertical component and a horizontal 
component of ~ 2B,,, etc. Assuming that the total driver area 
can be realized over the width of the switch 7T,, , then each line of 
length p contributes height dr,,(p) to the height of the switch 
(note that T,, changes according to N ): 


3 ‘ 2 
dry (p) = “PD ifp > O75a’ 


dr,,(p) =0 otherwise 
The total height added to a switch due to drivers is then 


gla/2] 
DR,,(N) >; w dts, ( —_ 


$=1 


where n=log.N. Hence, the total height required for the SR- 
banyan is then 


+ abs (2'"/211_;).B,, ) 


6wN 


Hy, (N) = —— + DRa (N) + 


H,, (2 [n /2 Ly + H,, (2 Ln /2 i) 
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Fig. 8: a) ratio of SR-banyan to crossbar areas. b) 
ratio of SR-banyan to crossbar time delays. c) ratio | 
of SR-banyan to crossbar space-time products. 
(dashed curves are from traditional analysis); 


where H,, (2) and H,, (4) are as before. 
The total area of an N XN SR-banyan is then 


N 
= LN) Wea (11) 
The delay equations from section 4 can be adapted in a simi- 
lar manner to yield the delay of SR-banyans. (We have used the 
average delay over a recursive switch at the end of each row and 
a recursive switch nearest the middle of each row.) 


Fig. 8 illustrates the area, delay, and space-time comparisons 
between SR-banyans (given by the heuristic) and crossbars in a 
VLSI environment. The area requirements of the SR-banyans are 
about 50% less than that of SW-banyans for large N. For 
N =1024, with b =16 and s =1, the SR-banyan requires 45% less 
area than the SW-banyan, and has 19% less delay. For 
N 16,384, the savings are 49% and 25% respectively. (Note that 
Delta [10] and Omega [8] networks require O(N*log,/) area, and 
are not even considered here.) — 


Fig. 9 illustrates the area and space-time comparisons when 2 
and 4 layers of metal are used for horizontal routing. Clearly, as 


more layers of metal are used, the banyan network becomes more 
attractive. 


8. Conclusions 

A class of multistage interconnection networks, called SR- 
banyans, that minimizes the number of wire crossovers has been 
presented. The basic idea is to use a MIN (made with smaller 
MINs) as the basic switch in the realization of a larger MIN. We 
have presented a heuristic that appears to yield the optimal net- 
work, in terms of minimized wire crossovers, for a given network 
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size N and a given basic switch size k. The resulting swttch- 
recurstve banyan has far fewer wire crossovers when compared to 
others MINs, such as the SW-Banyan, Baseline, Delta and Omega 
networks. 


More importantly, switch-recursive MINs have very distinct 
clusters in the topology, with heavy communications within clus- 
ters, and little communications between clusters. It has been 
estimated that printed circuit boards and wire interconnect are 
much bigger contributors to system delays than gates, in a discrete 
component realization [13]. In [2] it is estimated that the round 
trip network delay is dominated by cable delay, which is irreduci- 
ble because it is imposed by the physical size of the assemblage. 
The clustering in switch-recursive MINs makes the partitioning of 
these networks onto printed circuit boards, in a way to minimize 
edge connector and board costs, relatively straight forward. The 
delay through a switch-recursive banyan realized in a discrete com- 
ponent environment should be reduced significantly, since fewer 
edge connectors and ribbon cables must be traversed, and since the 
network size should be reduced significantly. 


In a VLSI realization, where the entire MIN is realized in a 
single integrated circuit (as a log,N stage network with N /2 
switches per stage), such clustering reduces the area requirements 
of these networks by up to 50% (for large N) when compared with 
the SW-banyan layout. The average delay through the network is 
reduced by up to 25% (for large N) when compared with the SW- 
banyan layout. (The area required for an 8X8 banyan can be 
improved further; see [14].) 


However, even better layouts for banyans exist if we assume 
that the links into the network are not necessary. This could be 
true if a) processors are realized on the same substrate and have 
negligible area, as in [7], or b) each switch actually contains a pro- 
cessor. Asymptotically up to 50% of the area of the SR-banyan or 
up to 75% of the area of the SW-banyan can be saved. The sav- 
ings is achieved by laying out the top row of a SW-banyan as two 
horizontal rows, one above the other, and each half as long as the 
original. The two subnetworks (of size N /2 N /2) are placed one 
above and one below this “row” (using the SR-banyan layout), for 
a total area about equal to that for the two subnetworks. (For 
N =1024, b=16, and s =1, this layout requires 48% less area 
than that for the SR-banyan, and 70% less than that for the SW- 
banyan.) 


While this paper has focused on SW-banyans and SR- 
banyans, the results are more general. It has been proved that all 
“strict-buddy type MINs” (i.e., SR-banyan, SW-banyan, baseline, 
delta, indirect binary n-cube, and omega networks) are isomorphic 
[1]. Hence, if we are able to change the labels assigned to the pro- 
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Fig. 9: a) ratio of SR-banyan to crossbar areas when 
2 layers of metal are used for horizontal routing. b) 
ratio of SR-banyan to crossbar space-time products 
when 2 layers of metal are used for horizontal rout- 
ing. c) ratio of SR-banyan to crossbar areas when 4 
layers of metal are used for horizontal routing. d) 
ratio of SR-banyan to crossbar space-time products 
when 4 layers of metal are used for horizontal rout- 
ing. 


cessors connected to an Sk-banyan, than any strict-buddy type 
MIN can be realized. This labelling can be accomplished by hav- 
ing a writable register associated with each processor that contains 
its label. This result has an important implication. The omega 
network is very useful in array processor architectures [8], but has 
a very high cost due to the number of link crossovers. However, 
we can transform the omega network into an SR-banyan, and 
obtain the exact same behaviour at a much lower cost. Similarly, 
any other strict-buddy type MIN can be realized simply by chang- 
ing the processor labels. 
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ABSTRACT 


This paper envisages to evaluate three types of static inter- 
connection networks for VLSI implementation. At first, a real- 
istic VLSI computational model has been proposed taking into 
account several aspects of VLSI technology, like chip embed- 
ding, interconnection delay, chip yield, device dissipation, device 
failure, etc. The criteria of evaluation have been selected from 
three orthogonal aspects - physical (chip area and dissipation), 
computational speed (message delay and message density) and 
cost (chip yield, operational reliability and layout cost). The 
result of the evaluation reveals that the Cellular Networks 
similar to two dimensional meshes are most suitable for on chip 
parallel processing in VLSI. The salient feature of this paper is 
to augment the selection criteria for the interconnection net- 
works from the classical AT? metric and to provide results per- 
taining to realistic VLSI implementation. 


- 1. INTRODUCTION 


Over the past two decades many interconnection networks 
have been proposed in the literature for SIMD architectures. 
Extensive accounts of these networks and their performance 
evaluation have been reported in [1,2.3,4]. Now with the 
advent of submicron silicon technology, more than ten million 
devices can be integrated in the VLSI circuit [5] and it has 
opened a new vista in parallel processing. On-chip multiprocess- 
ing by several processors for executing special algorithms is 
envisioned to be the major application goal of the future genera- 
tion computers using massive parallelism. 


In order to justify the need for re-evaluation of intercon- 
nection networks, it should be noted that in the design of earlier 
networks adapted for non-VLSI environment 


a. spatial distribution of the processors is not a con- 
straint on the design, 


b. signal propagation time is exclusively determined by 
the velocity of electromagnetic wave in the resistive 
medium and is negligibly small compared to the speed 
of operation. Thus the length of the interconnecting 
wire is not a constraint on design, 


c. cost of the system is directly proportional to the aver- 
age number of links per node, 


d. fault-tolerance capability of such networks is merely 


a topological property (i.e., whether alternate message 
transmission routes exist or not) 


On the contrary, under a VLSI environment 


1 This research was supported by the Semiconductor Research Corporation 
(SRC) under contract 84-06-049. 
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a. the spatial distribution of the processors play an 
important role on the total chip area. 


b. the interconnecting wire behaves as transmission line 
having both resistive and capacitive components and 
signal propagation time is largely dependent on these 
values which are directly proportional to the length of 
the interconnection, 


c. link per node does not have any direct relevance to 
design cost. The regularity of the networks topology 
decides the layout cost and size of chip decides the 
fabricational cost, 


d. fault-tolerance has an additional role to play. To 
improve the device yield and thereby to reduce the 
overall cost, it is necessary to introduce redundant 
processors. The existence of long interconnects 
increases the chip failure probability, both at the time 
of fabrication and in normal operation. 


This paper which was written as a term project in an 
architecture course envisages to evaluate the performance of 
three types of static interconnection networks and to determine 
how these new constraints modify the performance of these 
interconnection networks in VLSI implementation. The choice 
of the networks has been motivated from the facts that these 
networks have been widely pursued in the literature for design- 
ing many algorithms [6, 7] and all of them have optimal VLSI 
layouts. Also, these provide the insights to three topologically 
different classes of networks. The networks discussed are: two 
dimensional meshes, binary trees and Cube Connected Cycles 
(CCC). These three networks conceptually belong to separate 
classes in the sense that if the interprocessor link is constrained 
to have constant length, their distribution in three dimensional 
space reveals somewhat planar, conical and spherical surfaces, 
respectively. Thus the results of this evaluation can be easily 
extended to assess the performance of other networks, because 
most networks have one of these three topologies in three 
dimensional space. Networks with wraparound connections like 
the ILLIAC IV describe a torroidal surface and are not suitable 
for VLSI implementation because the endaround interconnection 
introduces large delay and also expands the area by a factor of | 
two. Otherwise, the overall performance of these networks is 
similar to two dimensional meshes [3]. 


At first, a computational model for the VLSI technology is 
proposed and the results of the model is employed to evaluate 
the performance of the networks. The model is similar to 
Thompson’s [8] model and additionally accounts for device 
faults and chip yield. The criteria of evaluation are enumerated 
from three orthogonal viewpoints, viz. area, speed and cost. 
The results of the analysis are compared and it is concluded 
that the regular structured Cellular Networks having short 
interconnects similar to the two dimensional meshes are the 
most suitable candidates for VLSI implementation with two 
layered interconnects. It may be pointed out that the conclusion 


made here is on the basis of asymptotic performance and may 
not hold good for small sized networks. Some of the results dis- 
cussed here have been reported by other researchers. But the 
modest objective of this paper is to identify the relevant aspects 
of VLSI technology and to demonstrate that under the new con- 
straints (like chip yield, layout regularity, failure probability 
of long interconnects, reliability improvement per redundant 
processor, etc.), Cellular Networks indicate overall better per- 
formance than H-tree, CCC and other fast networks. This is in 
direct contrast to the results of [3,9] where it was concluded 
that fast networks like CCC, PSN, dual bus hypercubes, etc., 
have better overall performance. The paper is organized as fol- 
lows: Section 2 discusses the VLSI model, Section 3 discusses the 
criteria of evaluation, Section 4 provides the performance 
analysis, Section 5 compares the performance of these three net- 
works and Section 6 concludes the paper. 


2. VLSI MODEL OF COMPUTATION 


The model is based on MOS and C(omplementary)MOS 
technologies and assumes a two layered model for interconnects. 
The two layered model uses one layer of polysilicon and one 
layer of metal for interconnection networks. A two layered 
model is easy to fabricate and provides a high yield. A compo- 
site two layered metal interconnects like Al-Pt-Ti-Au improve 
the device performance, but is difficult to fabricate and is not 
assumed in this model. The model makes four types of assump- 
tions as discussed below: 


2.1. Embedding Assumptions:- 
Assumption 2.1.1: 
and size. 


Since the processors have identical computational power, 
they need equal computational area and hence can be identical 
in shape and size. 

Assumption 2.1.2: Each processor occupies O(1) area and 
can be represented as a square of unity area. 


All processors are identical in shape 


The natural layout of any computational circuit has a rec- 
tangular topology. Any rectangular topology can be converted 
to a square topology by increasing the overall area by at most a 
factor of three [10, 11]. 


Assumption 2.1.3: Wires always run either in vertical or 
in horizontal direction in two different layers. 


This scheme is called the Manhattan interconnection tech- 
nique [12]. Since the processors are aligned parallel to the 
Cartesian co-ordinates, interconnection wires being perpendicu- 
lar to the sides of the processors run along the horizontal and 
vertical directions. 


Assumption 2.1.4: At most two wires may cross over each 
other at any point in the plane. 


In MOS technology metal wires can cross over either 
polysilicon or diffusion without making contact. Whenever 
polysilicon runs over diffusion automatic contact is established 
resulting in a transistor at the overlapping surface. 


Assumption 2.1.53 Wires can bypass the faulty nodes. 


Switches can be placed over the processor to isolate them 
[13] and built in self test circuit can be incorporated to test 
whether the processor is faulty. 


Assumption 2.1.6: Wire of length | behaves as a 
transmission line having O (1 ) resistance and O (1) capacitance. 


Since the interconnection wire has distributed resistance 
and capacitance all along the length of the wire, O (1) assump- 
tion is justified. 
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2.2. Timing Assumptions:- 


Assumption 2.2.1: Processors have O(1) computational 
delay. 


Since the processors are identical in shape and occupy O(1) 
area, the delay is constant. 


Proposition 1 Wire of length, 1 introduces O (1?) 
delay and this is the upper bound of propagation 
delay. 

Proof: see [14]. 


For metal wire, the delay can be reduced to O(log 1) by 
introducing a cascade of drivers as stated in the following pro- 
position. 


Proposition 2: The metal wire of length, | has a 
minimum of O (log 1) delay and it needs O (log |) stages 
of drivers having a combined area of O(L). 

Proof: see [15]. 


Mead & Conway [14] have shown that quadratic propa- 
gation delay in polysilicon (or diffusion ) wire of length J can 
be minimized to O(Z) by introducing J/k drivers at regular 
interval such that the driver delay is equal to signal propagation 
time over polysilicon (or diffusion) of length k . Thus 

Proposition 3: The poly/difusion wire of length, ! 
has a minimum of O(1) delay and it needs 1/k stages 
of drivers having a combined area of O(1). 

Proof: see [14]. 

Ramachandran [16] has shown that if many parallel long 
wires exist and space available externally to the processors is 
not sufficient to lay down all the drivers, then the delay can be 
minimized at the cost of increasing the size of the processors. In 
the worst case, this results in quadrupling the area of the pro- 
cessor. 


2.3. Energy Dissipation Assumption:- 


Assumption 2.3.1 Each processor of size O(1) consumes 
O (1) power in unit time. 


Since the power consumption in CMOS occurs due to 
charging and discharging of nodal capacitors, the switching 
power consumption at a node is equal to 0.5SCV7f , where f is 
the frequency of operation and C is the nodal capacitance. For a 
static CMOS circuit, the total switching power dissipation 
depends on the input to the processor and the internal circuit 
configuration of the processor. On the average, the dissipation 
will be a constant fraction of the total processor area and is 
bounded by O(1). For dynamic CMOS circuit, since each gate 
will be charged by the refreshing clock, an O(1) power con- 
sumption automatically follows. 


Proposition 4: Wire of length | consumes at most 
O(1) power. 
Proof: The power consumed by the resistor R(/) in charging 
the capacitor C(Z) is equal to 0.5SC(Z)V?f . The capacitor stores 
charge only and does not dissipate any energy. Hence the upper 
bound is O(Z ). 


2.4, Failure Assumptions:- 


The failure in VLSI can be classified into two categories - 
chip related defects which lower the IC yield and the field 
operational which is a function of time. 


Assumption 2.4:1: Defects in the fabrication of an IC 
(viz., pinhole defects in oxide, defects in photoresist, implant 
defects, etc.) are randomly distributed and statistically indepen- 
dent. 


Generally, gross imperfections causing large areas of the 
chip to be bad (ie., area defects) are detected at slice test and 
and line defects (like scratches) do not occur in a well- 
controlled process [17]. The most commonly encountered chip 


related flaws are random isolated spot defects. Clusters of spot 
defects also occur, but they can be treated as a single defect 
since usually the size of the processor is large enough to encom- 
pass the whole cluster. Thus the probability of failure of pro- 
cessors is independent and the same for all the processors. _ 


Assumption 2.4.2: The yield of an IC decreases inversely 
with the size of the chip. 


Since the defects occur randomly as Poisson's process, the 
yield (i.e., the probability of having no defect in a chip) is given 
by 

Yo=e 0 (2.4) 


Practical results show that this under-estimates the yield 
slightly for larger size chips [18] and Poisson's distribution of 
random defects represents the yield pessimistically [19]. A 
better fitting with practical results has been obtained by 
employing Bose-Einstein statistics and it is shown [20] that the 
yield of defect-free chips are given by 


= 
T+ADo 


where, AD 9 is the average number of defects per chip. 


Y, (2.5) 


The random distribution of defects can be assumed to 
increase linearly with size for small sized defects like pinholes 
and to decrease as the cube of defect size for large sized defects 
_ like those occurring in diffusion patterns [21]. Using this ran- 
dom defect distribution, it can be shown that the effect of these 
defects on interconnection is very drastic and the failure rate is 
related to the aspect ratio of the interconnection wire. Formally, 


Proposition 5: oe wire of length, lt and width, 
\, can fail with a probability proportional to O(1/n,, ). 


Proof: Let L be a long wire of length, 2 and width, A,,. Leta 
circular defect of diameter, 1 occur randomly on the conducting 
wire resulting in a hole as shown in Figure 1. If the width of 
the conducting material available for current conduction is 
sufficient to carry current in normal operation without causing 
‘any catastrophic failure, then the pinhole caused by the defect 
will not have any effect on the wire. Let 5 be the minimum 
width of the wire required for normal operation at a particular 
current density, then defects of diameter, n<(A,,—5) will not 
cause any failure, while the defects of diameter, n2(A,,—8) 
will cause failures if they occur such that they do not leave 5 
width of conducting wire. The locus of the center of defects 
that lead to failure is called critical area, A (y) and is given by 


0 for 0<<(A,, —8), 
A(m) = (nA, VL. for (,, —5)<<oco. 


If f (m) denotes the distribution of defect density, then the 
average value of the critical area with respect to defect size dis- 
tribution is given by 


A= fA (mf (nd n. 
0 


Very small defects can be assumed to increase linearly with 
defect size up to a certain value (say 7) and large defects can 
be assumed to vary inversely as the cube of defect size [21]. 
Thus 

for 0<n<%, 

for No<n<co. 


f () = n/n? 
f GD = 12/7 
Hence, 


A = f (nA, DL (mo2/4) dq=O(C/,), assuming 8<<A,,. 
Ay 


The probability of failure of the wire is directly proportional to 
its critical area and is O (1 /A,, ). 
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3. CRITERIA FOR EVALUATION 


The interconnection networks have been modeled to study 
three aspects - a) the physical aspects, b) the computational 
aspects and c) the cost aspects. The overall performance of the 
networks has been estimated from these viewpoints which form 
three orthogonal classes. 


3.1.: Physical aspects: 
The physical aspects relate to the chip area and power con- 
sumption by the chip. 


3.1.1.. Chip Area 


The chip area refers to the total area required to lay the 
processors and the communication links. The area occupied by 
the processors is the computation area. The ratio of the compu- 
tation area to the total area is the performance metric and is 
defined as area efficiency. 


3.1.2. Power Consumption 


Like the chip area, the total power consumed by the chip 
can be divided into computational power which is equal to the 
power dissipated by the processors and the power required to 
drive the communication links. The ratio of the computational 
power to the total power consumed by the chip is defined as the 
power efficiency. The average power consumption inside the chip 
is an important parameter because in VLSI the chip computa- 
tional limitation is largely due to finite device dissipation. The 
standard commercial epoxy and ceramic packages allow about 
2-5 watts of steady state power dissipation. 


3.2. Computational Aspects: 


The computational aspects refer to the speed of computa- 
tion and the message flow within the networks. The speed is 
estimated by computing the delay in interprocessor communica- 
tions and the message density in the interprocessor links reveals 
the message flow. 


3.2.1. Delay 


The delay refers to the time needed in interprocessor com- 
munication to execute a computational task. This is both the 
property of the topology of the network and the length of inter- 
connects. The average energy dissipation is measured as a pro- 
duct of the total chip area and the average delay. 


3.2.2. Message Traffic Density 


An important aspect of the computational power of the 
networks is the distribution of data flow within the networks. 
An efficient network should avoid the message traffic congestions 
at the links and should distribute the data (message) flow uni- 
formly across all the available links. The message traffic density 
of at the links with respect to the networks size is a good meas- 
ure of the computational power of the network. The average 
rate at which each node originates messages is assumed to be 
fixed at one message per time unit regardless of network size N . 
Also it has been assumed that the average rate at which any 
node, t within the network transmits messages to another node 
j ¥i in the network, is constant. The average message traffic 
density is defined as the product of number of processors (NV ) 
and the ratio of average number of nodes visited by a message 
os the total number of links in the network and is denoted by 


3.3. Cost Aspects: 


The cost aspects consider the fabrication cost and the 
replacement cost due to poor reliability of the networks. The 
manufacturing cost of the IC is related to the total chip area 
(Assumption 2.4.2) and the regularity of the layout [22]. The 


reliability of the networks largely depends on the presence of 
long interconnects (Proposition 4) in the embedding and the 
topology of the networks. Depending on the existence of alter- 
nate message routes within the networks, the networks can fail 
completely or partially. 


3.3.1.; Yield 


The yield of the IC is largely dependent on the total chip 
area. The occurrence of random spot defects will reduce the 
yield by a factor inversely proportional to the area, A of the 
chip. This O(1/A ) factor is called the yield factor and is a 
measure of manufacturing cost of the IC. The defective proces- 
sors can be replaced by redundant processors, but the chips with 
defects on the interconnects cannot be salvaged. The size of the 
interconnect is highly relevant for the chip yield (Proposition 4) 
and the presence of long wires will be enumerated for evalua- 
tion of the manufacturing cost. 


3.3.2. Regularity 


The regularity of the network largely decides the layout 
cost. Since all processors are identical, ‘an O (1) layout cost can 
be assumed for laying out a processor. Each link between the 
processors also can be made hierarchically, the actual cost to 
layout the links may be much less than the actual number of 
links. The regularity factor is a measure of layout cost and is 
defined as the ratio of total number of interconnections to the 
number of interconnections actually laid. 


3.3.3. Fault-tolerance 


The fault-tolerance capability largely decides the reliabil- 
ity of a working chip. Due to a host of causes, like electromigra- 
tion, Kirkendall's effects, hot electron effects, etc., a processor or 
a link may fail during the normal use of the chip. Depending on 
the topology of the networks, the effect of failure of a single 
processor or a link will adversely affect the operation of the 
network. Normally the level of masking and processing associ- 
ated with the interconnect is far more simpler than the proces- 
sors and the reliability of the interconnect is higher than the 
reliability of the processor. So the reliability due to the proces- 
sor failure and the interconnect failure are separated and 
different measures are used here. Since by Proposition 4, the 
probability of interconnect failure is directly proportional to its 
length, total length of the interconnects in the network is used 
as the measure of fault-tolerance due to interconnect failure 
and is denoted by R,. The failure of a single processor will 
result in performance degradation because it may isolate one or 
more processors. The degradation in computing will be deter- 
mined by the maximum number of connected processors (say 
N’) in the network due to the occurrence of a single processor 
failure. The value of N’ depends on the topology and the loca- 
tion of failed processor. The ratio of maximum value of N to 
the original size of the network is defined as the degradation 
factor, 5 and is used as a measure of fault-tolerance. The relia- 
bility of the network can be improved by introducing redun- 
dant processors. The ratio of the reliability of redundant net- 
work to that of the non-redundant (original) network is defined 
as reliability improvement factor (RIF ). Since the addition of 
redundant processors increases the chip area, the ratio of RIF to 
the number of redundant processors (denoted by p ) is also used 
here as a measure of fault tolerance. Overall fault-tolerance 
capability of three networks will be graded as High, Medium 
and Low by making a relative comparison of these three meas- 
ures. 


4. EVALUATION OF NETWORKS 


_ 4.1. Two Dimensional Meshes 
The two dimensional mesh is shown in Figure 2. By 
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Assumption 2.1.2, each processor is represented as a unit square 
and the interconnect length can be ignored. Thus both the area 
efficiency and the power efficiency of two dimensional meshes 
are approximately equal to 1. 


In order to compute the delay in a JN xVN_ mesh, it 
should be noted that the message path length between two arbi- 
trarily located processors at (@, j ) and (K.1) within the square 
grids is given by the city block distance d = \i-—k | + |j—ll. 
Thus the average message path length from a source processor is 
a function of its location @, j), assuming the lower leftmost 
processor is at (0,0). If the source processor is at any of the 
locations (0, 0), (n.0), (0.n ) or (n.n), ( where n = VN ), 
then the average message path length is 
d, =n~7[2* 14+3* 2+ ---+n*¥(n—1)+(n—1)*n + --- +(Qn—2)] 

=O(n)=O(VN). If the source processor is at (n/2,n/2), 


then the average message path length is 
dy = 4n~2[2* 14+3*2+ --- +5" (F-D+(5-DtS+ ++ +(n—-D] 


=O(n)=O(VN). For the source processor in any other posi- 
tion it can be shown that the average message path, d, is 
O(VN ) and satisfies the inequality 7. < d < d3. 


Assuming an O(1) delay time (Assumption 2.2.1) associ- 
ated with each processor, the average delay between the proces- 
sors is D = O(VN ). 


The_total number of links in the square mesh is equal to 
2V/N (VN —1)=O(N) and the average message path is 
O(VN ). Assuming all the N nodes issue messages simultane- 
ously, the__ average density is then 


message __ traffic 
M = NO(JN )/0(N) =O(WN ). 


Since the average delay is O(VN ) and the chip size is 
O(N), the average chip dissipation is O(N*/2). Also the area 
efficiency is 1 and by Assumption 2.4.2, the yield is O (1/N ). 


The layout can be constructed hierarchically and a block 
of 4* processors can be laid in k —th step paying 2* ++ cost. Thus 
0 


a network of size N needs }. 2&+) =2"&"*?_4=Q(VN ) 


k=1 
cost. The regularity factor is thus O(N )/O(VN )=OCVN ). 


Since the network consists of nearest neighbor type con- 
nections, interconnect reliability is R, ~ 1. The failure of a sin- 
gle processor does not impair the performance of the networks 
drastically. Due to the presence of many parallel paths in the 
square grids, the failure of a single processor results into isola- 
tion of the failed processor only and does not impair the perfor- 
mance of the networks drastically. The degradation factor is 


thus 6=1/N. If R, = e~?' is the functional reliability of 
each processor in the meshes, then the overall reliability of the 
network is Ry(t)=R,%. This reliability can be sufficiently 
ameliorated by adding a redundant row and the overall net- 
work can be made to be (VN —1) fault-tolerant. The reliability 
of the redundant mesh network is 
Rut) = (R,” +n(Ci—-R, )R," 1)", where n?=N. The relia- 
bility improvement factor, RIFy due to the redundant proces- 
sors is given by RIFy =R,y/Ry such that 
p = (1/Qn—-1))A +n (RK, 7-1)”. 


The delay can be improved for mesh networks if the pro- 
cessors belonging to each column and each row are connected 
hierarchically as binary trees. Such networks are known in the 
literature as orthogonal tree networks [23], mesh of trees 
[24, 25] and orthogonal forests [26]. The average delay for such 
networks reduces to O(logN ) but the chip area increases to 
O(N log?N ). The overall performance thus does not improve. 
On the contrary, the presence of long interconnects of length 


O(VN ) actually increases the average delay to O (log*N ) (by 
Proposition 2). Moreover, these networks suffer from many 
practical limitations as poor yield (due to large chip size). poor 
regularity (due to presence of mesh and trees combined), 
O(N logN ) cross-overs, long interconnects, etc. 


4.2. Binary Tree Networks: 


The complete binary tree networks of N processors is 
shown in Figure 3 which needs at most O(N logN ) layout area 
corresponding to O(N ) leaves and O (logN ) height of the tree. 
A better layout which needs optimal O(N ) area can be con- 
structed using the concept of an H-diagram, originally proposed 
by Marihugh and Anderson [27] as a graphical approach to logic 
design. Horowitz and Zorat [28] have constructed the algo- 
rithms for the generation of such a layout and the modified net- 
work is henceforth referred as an H-tree. The H-tree layout of 
the binary tree is shown in Figure 4. The total area for a com- 
plete binary tree of N processors can be computed from the fol- 
lowing recursive relationship 


a(w)= [af a (wa) + 1P with A(1)=1. 
Assuming N =2.4* —1, it can__be shown that 


A(N) = 4&t1 — 244241 =2N —2.82VN +1 + 3. 


Thus, the area peg a better than 0.5. The longest 
wire in the layout is of size VN /2 and the total length of wires 
in the layout is given by the recurrence relation: 


LW) = aL (|w/4) +VN with L(7)=1 


which givesa solution L(N)=O(VWN logN). 


The power efficiency depends on L (NV) and the width of 
the wires and is more than 0.5. 


The worst case delay occurs when a message is propagated 
between the leaves through the root of the tree. Assuming O (Z) 
delay for both metal wire (without driver) and polysilicon wire 
(with interspersed driver), the worst case delay can be given by 


Dmax(V ) = 2* [2 } (2#-/ —1)+2k +1] = 2V2(N +1)—log(N +1)—2 


=1 
=O(VN ): Bui this delay is smaller than in mesh network, 
since only 2logN intermediate processors are visited as opposed 
to 2VN —1 in mesh network. The average message delay 
between root and other processors can be given by 

2k —1 |) /2 > Paes e-[}4 
D= DY 2-7 (2(2F -§-1)+(j mod 2)(2 —1)+log j) 

=0 3 =] : 
=o (Wn ). 

To calculate the message traffic density on a link, consider 
N-—1 time units during which N (NV —1) messages are generated 
and each node on the average will have sent a message to each of 
the others. Let h = log(N +1) be the height of the tree network, 
denoted as 7; . such that |7, | = N. A subtree of 7, at level &k 


from leaves is shown as 7; such that 17; | =2* —1. A link 
between level k and level k+1<h will be used to transmit 
messages between i) all the nodes of the left and the right sub- 
trees each of size 17, | and ii) one subtree of size T, connected 
by the link and (N — 27; ) nodes of the tree, 7, (Figure 5). 
Thus, the message density per unit time at a link between level 
k and level k +1 is 
M(k) = alte |? +N 217, DAT, 
2* —1 

&S A — 9k 

(2 2* ) cee y 
increasing function of & , the maximum congestion occurs at the 
link between the root and its sons (i.e., k = h—1) which can be 


obtained by solving OM ) 


Since M(k) is a monotonically 


= 0. The total number of messages 


that pass through these links is 
M(h —1) = 24-1 = N/2 =O(N). 


Since the area efficiency is less than 1, the yield is not 
likely as high as the meshes, but it is asymptotically equal to 
O(1/N ). The average energy dissipation of the chip is O (N*/2). 
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The layout can be constructed hierarchically and each 
level of embedding needs 7XO (1) cost and the overall connec- 
tion cost for a network of N processors is equal to O (log,N ). 
The total number of links in a binary tree is equal to N-—1. 
Thus the regularity factor is O (N /logN ). 


The fault-tolerance capability of the network due to the 
failure of a single processor depends on the location of the pro- 
cessor. If the external communication is done solely through 
the root, its failure will have total disastrous effect invalidating 
the usability of the IC. Since there is no parallel path for mes- 
sage flow, failure of any links will truncate the operability of 
the chip. If any processor other than the root fails, will also 
reduce the performance of the IC by an amount depending on its 
location in the tree. The computational degradation that occurs 
due to a faulty processor or an interconnect at level i from the 
leaf nodes can be defined as the number of processors which are 
eliminated due to the fault at level i and is equal to 2'—1. If 
the external communication is made through leaf nodes, then 
§=(N-—-1)/QN ). The network can be restored to function 
normally by replacing the defective processor by redundant 
processor. Redundant processors can be placed in the extra space 
available within the chip and re-routing can be done by electri- 
cally programmable routing technique. Since only 
N-—2.8VN +1+3 space is available for laying out the redundant 
processors redundancy can be added for nodes till level 2 from 
the leaf nodes, i.e., level h —2 from the root. Thus the leaf pro- 
cessors and their fathers are not replicated and all other nodes 
in the tree are replicated at locations shown by * in Figure 4. If 
R, is the reliability of each processor then the overall reliability 
of the tree network without any redundancy is Rr = R,”. 
Clearly, R; =O(VN logN’). If the redundancy is added as 
described above, then the reliability of the redundant network 
is Rp = R,°%/4x(2R, —R,7)"'4. The reliability improvement 
factor, RIFr, is given by RIFr = R,7/Rr = (2—R, }V/4 and 
p = (4/N )Q—R, )” 4. Thus the reliability of the tree network 
is poorer compared to the meshes. 


4.3. Cube Connected Cycles: 


An m dimensional Cube Connected Cycle (CCC) is a net- 
work which can be derived from a boolean hypercube of 2” 
vertices by replacing each vertex with a cycle of m vertices. 
This was originally proposed by Preparata and Vuillemin [7] to 
ensure that the degree of each vertex is bounded to 3 and not to 
m as in an m-cube network. The topology of a 3-dimensional 
CCC with 3.23 = 24 processors is shown in Figure 6 and its 
optimal VLSI layout has been given in Figure 7. It should be 
noted from Figure 7 that the layout of an m dimensional CCC 
can be made by laying out m C; routes in an m-cube [29] on a 
grid graph and replacing the vertices of m -cube by 2” cycles of 

me 


size m. Clearly, the maximum height of the cycle is }. 2' = 2” 


i=0 

and there are totally 2” cycles. Thus the total area required by 

an m dimensional CCC is 27”+*! assuming that the width of the 

edge is equal to the square root of processor area. Since 

N =m2", then 2” =N /(logN —logm), i.e., 

m = log(N /logN ). Thus the total chip area is approximately 
O(N2/logN*). Thus both the area efficiency and the power 

efficiency are O ((log?N )/N ). 


Let the interprocessor link in the cycle be called ring link 
(vertical lines in Figure 7) and the interprocessor link between 
two adjacent cycles be called vertex link (horizontal lines in. 
Figure 7). Using Wittie’s algorithm for message routing, it can 
be shown that on the average a message traverses m/2 vertex 
links and (5m /4)—2+2!™ ring links if m is even (and an addi- 
tional 1/(4m ) ring links if m is odd). It may be noted that 
Wittie’s average path length analysis by his routing algorithm 
was incorrect and the correct result is stated above. Wires are 


either horizontal or vertical and cannot be both metal (Assump- 
tion 2.1.4). Thus in order to reduce the propagation delay, it is 
needed that the cycle links should be made of metal and the 
vertex link should be made of polysilicon (or diffusion). It may 
be noted that in an m dimensional CCC, there are m 2™~1 vertex 


m 
links having a total interconnection length of 2"! }°2' so that 
i=1 
average vertex length is (2"+1—2)/m. Thus the average vertex 
delay is 4(2"—1) ie. O(N /logN ).: The asymptotic average 
delay over metal ring link is O(logN ). The worst case delay is 
due to message transmission between two processors at the 
opposite edges of the chip and is proportional to the perimeter of 
the chip, ie. O(N /logN ). Since each node has degree 3, the total 
number of links is equal to 3N . If N messages are generated on 
the average in unit time, then the average message density is 
equal to M = (N/3N ) x O(n) = O (logN ). 


Since the chip area is very large, ‘the yield is 
O (log2N /N2), which is very poor compared to the mesh and 
the tree networks. The average energy dissipation of the chip is 
O (N3/log3N ). 


The layout can be partially constructed hierarchically. An 
N (=m 2™ )-node CCC composes of two (m—1)2™—!-node CCC 
and 2™~! connections as is evident from Figure 7. Thus the lay- 
out cost to construct an N-node CCC is 
2" +m —1 = O(N /logN ). Since the total number of connections 
in the layout is equal to 1.5N, the regularity factor is O (logN ) 
and is poor compared to the mesh and the tree networks. 


The fault tolerance capability of CCC is good because there 
is always an alternate path to re-route the message like in the 
mesh. Thus the failure of a single processor will not have any 
drastic effects on the performance of the network and § = 1/N. 
But due to presence of many long interconnects and large chip 
area, the reliability is not as good as the meshes. Total length of 
interconnect within the chip is O(N2/log?N ) due to the pres- 
ence of N/logN cycles of average height N/logN. Thus 
R, = O(N2/log?N ). The reliability of the network due to pro- 
cessor failure can be given by Recc = R,%. If one redundant 
processor is added to each cycle, then the reliability improves to 
Riccc = (R,™ +m (1—R, )R, m—L)N Nogn The reliability 
improvement factor, RIF ccc : is given by 
RIF ccc = Rrccc /Recc and p = (1/m \A+m (1/R, —1)) losn . 


5. COMPARISON OF THREE CLASSES OF NETWORKS 


From the analyses done in the previous section, it is evi- 
dent that each network has certain strong aspects and certain 
weak aspects. It is difficult to relate all these aspects by a com- 
pact formula which can be utilized as a performance metric. A 
weak effort in this respect was originally done by Mead and 
Rem [14] relating the area (A ) and the speed (7'—!) and propos- 
ing the rental time of the chip AT as a metric. Thompson [8] 
has extended the concept for any arbitrary network by showing 
that AT? indicates a better performance metric. He has related 
the area and the speed of computation in a network through its 
minimal bipartition width, w and have shown that a computa- 
tional problem can be solved by exchanging information over w 
such that the speed of computation is directly proportional to 
the cardinality of w while the area of planar implementation of 
the network is directly proportional to the square of the size of 
®. But this metric accounts for the lower bound of the chip 
area. Savage [30] has contended that A727’ reflects a better 
evaluation for certain computational problems like binary sort- 
ing. From all these contradictory claims, it is evident that it is 
virtually not possible to correlate all the criteria discussed in 
section 3. An alternative strategy has been adopted here. The 
networks have been given credit points for each criterion 
depending on their relative merits and the total points have been 
used as a performance index for the network. The results of the 
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evaluation with respect to different criteria have been shown in 
Table 1. The credit points have been assigned on the basis of 
relative merits of the networks. From the values of total 
points, it can be seen that the two dimensional mesh networks 
indicate overall better performance than the H-tree and the 
CCC. This is in direct contrast to the results of Wittie [3], Siegel 
[31, 9], etc., who have concluded that fast networks like CCC, 
PSN, spanning bus hypercubes, etc., have better overall perfor- 
mance. It may be argued whether it is appropriate to give same 
weight to all the criteria. But it can be easily seen that the con- 
clusion remains true even if different weights are ascribed to 
three orthogonal aspects (i.e. criteria having same aspect only 
has same weight). 


6. CONCLUSION 


The basic conclusion from this paper is that the Cellular 
Networks which have a similar structure to the mesh can be 
cost effectively implemented for VLSI implementation and are 
highly suitable for VLSI parallel processing. The penalty in 
delay can be offset by the gains of several criteria discussed in 
this paper. The faster topologies like CCC, PSN, tree, etc., do not 
provide an overall good performance because of long intercon- 
nects which introduce high delay and large chip area which 
reduces the chip efficiency and the chip yield. A multilayered 
model with more than one layer of metal will improve the per- 
formance of these topologies and a judicial laying out is neces- 
sary to reduce the chip area under the multilayered model. 


REFERENCES 


K. J. Thurber, “Interconnection networks - A survey and 
assessment,” AFIPS Conference Proceedings, vol. 43, pp. 
909-919, 1974. 


G. A. Andersen and E. D. Jensen, “Computer intercon- 
nection structures: taxonomy, characteristics and exam- 
ples,” ACM Computer Survey, vol. 7, pp. 197-213, 
December 1975. 


L. D. Wittie, “Communications structures for large net- 
works of microcomputers,” [EEE Transaction on Comput- 
ers, vol. c-30, pp. 264-273, April 1981. | 


Tse-yun Feng, ““A survey of Interconnection Networks,” 
IEEE Computer, vol. 14, pp. 12-27, December 1981. 


D. F. Barbe, Very Large Scale Integration: Fundamentals 
and Applications. Springer-Verlag, 1980. 

M. J. Atallah, Algorithms for VLSI networks of proces- 
sors. PhD Thesis, The Johns Hopkins University, 1983. 


F. P. Preparata and J. Vuillemin , “The Cube Connected 
Cycles: A versatile network for parallel computation,” 
Proceedings of 20th Annual IEEE Symposium on Founda- 
tions of Computer Science , pp. 140-147, 1979. 

C. D. Thompson, A complexity theory for VLSI . PhD 
Thesis, Carnegie-Mellon University. 1980. 

H. J. Siegel, ““A model of SIMD machines and a com- 
parison of various interconnection networks,” IEEE 
Transaction on Computers, vol. c-28, pp. 907-917, 
December 1979. 

R. Aleliunas and A. L. Rosenberg, “On embedding rec- 
tangular grids in square grids,” JEEE Transaction on 
Computers, vol. C-31. pp. 907-913, September 1982. 


C. E. Leiserson, Area-Efficient grph layouts (for VLSI). 
PhD Thesis, Carnegie-Mellon University, 1981. 

C. Y. Lee, “An algorithm for path connection and its 
applications,” JRE Transactions on Electronic Computers, 
vol. EC-10, pp. 346-365, September 1961. 


[1] 


[2] 


[3] 


[4] 
[5] 
[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 
[16] 
[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


A. L. Rosenberg, “The Diogenes approach to testable 
fault-tolerant arrays of processors,” [EEE Transaction on 
Computers, vol. C-32, pp. 902-910, October 1983. 


C. A. Mead and L. A. Conway, Introduction to VLSI sys- 
tems. Addison, 1980. 


C. Mead and M. Rem, “Minimum propagation delays in 
VLSI,” IEEE Journal of Solid State Circuits, vol. SC-17, 
pp. 773-775, August 1982. 


V. Ramachandran, “Driving many long parallel wires ,” 
Proceedings of 23th Annual IEEE Symposium on Founda- 
tions of Computer Science, pp. 369-378, 1982. 


B. T. Murphy, ““Cost-size optima of monolithic integrated 
circuits,” Proceedings of IEEE, pp. 1537-1545, December 
1964. 


G. E. Moore, “What level of integration is best for you?,” 
Electronics, pp. 126-130, February 1970. 


R. B. Seeds, “Yield, economic and logistic models for com- 
plex digital arrays,’ [EEE Int. Convention Rec., pp. 60- 
61, March 1967. 


T. E. Mangir, Fault-tolerant design for VLSI design: 
Effect of interconnect requirements on yield improvement 
of VLSI design. PhD Thesis, University of California. 
Los Angeles, 1981. 


C. H. Stapper, “Modeling of integrated circuit defect sen- 
sitivities,”” IBM Journal of research and development, vol. 
27, pp. 549-557, June 1983. 


C. H. Sequin, “Managing VLSI complexity: an outlook,” 
Proceedings of the IEEE, vol. 71, pp. 149-166, 1983. 


D. Nath, S. N. Maheswari, and P. C. P. Bhat, “Efficient 


[24] 


[25] 
[26] 
[27] 


[28] 


[29] 


[30] 


[31] 


VLSI networks for parallel processing based on orthogo- 
nal trees,” IEEE Transaction on Computers, vol. c-32, 
pp. 569-581, June 1983. 


J. D. Ullman, Computational Aspects of VLSI. Computer 
Science Press, 1984. 


F. T. Leighton , “A layout strategy for VLSI which is 
provably good,” Proceedings of ‘the 14th Symposium on 
Theory of Computation, pp. 85-98, 1982. 


P. R. Cappello and K. Steiglitz, ““Area-efficient VLSI 
Structures for multiplying at clock rate,’ Technical 
Report #289, Princeton University , 1981. 


G. E. Marihugh and R. E. Anderson, “The H diagram: a 
graphical approach to logic design,” IEEE Transaction on 
Computers, pp. 1192-1196, 1971: 


E. Horowitz and A. Zorat, “The binary tree as an inter- 
connection network: applications to multiprocessor sys- 
tems and VLSI,” IEEE Transaction on Computers. vol. c- 
30, pp. 247-253, April 1981. 


K. Hwang and F. A. Briggs, Computer Architecture and 
Parallel Processing. New York: McGraw-Hill Book Com- 
pany, 1984. 


J. E. Savage, “Planar circuit complexity and the perfor- 
mance of VLSI algorithms,” Proceedings of the CMU 
Conference on VLSI Systems and Computations, pp. 61- 
67, 1981. 


H. J. Siegel, “Analysis techniques for SIMD machine 
interconnection networks and the effects of processor 
address masks,” IEEE Transaction on Computers, vol. c- 
26, pp. 153-161, February 1977. 


Table 1: Evaluation of Three Static Networks 


Evaluation of Meshes, H-Tree and CCC for 
VLSI application with respect to 
Physical, Ea a a and Cost Aspects 


NETWORE 
RUCTURE 


206 


merwosk ||  RVALUATIONGETERA 2 =——<Cit‘:™C*C@W*WO VERRALL’ CRITERIA 


level k node 


Figure 5: Message Flow through a’ 
level k node in a Tree Network 


Figure 2: A 4x 4 Mesh Network 
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Figure 3: Layout of a Binary Tree 


Figure 6: Cube Connected Cycle 
topology for 3. 23 processors 
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Figure 4: H-Tree Layout 


Figure 7: Layout of CCC with 3. 23 processors 
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ABSTRACT 

‘This paper considers various physical constraints which influence 
the design of VLSI based interconnection networks used in 
imultiprocessor systems. Design expressions are presented for 
implementing an N log N packet passing interconnection network 
composed of circuit switched crossbar chip modules. Expressions 
reflecting chip level and board level pin and area constraints are 
derived and used to determine the network delay expected at a 
given clock frequency. Logic and memory delay, signal path 
delay, clock skew and clock tree delay parameters are defined 
‘and used to determine the maximum frequency which can be 
obtained with a given design. An example 2048x2048 network 
‘design is considered. 


1. Introduction and Overview 


The design of effective multiprocessor systems involves 


humerous interacting elements ranging from _ parallel 
algorithms to programming languages to computer 
architecture. This paper focuses on the computer architecture 


question and, in particular, on the design of VLSI based electrical 
interconnection networks for use in multiprocessor systems. Due 
to their potentially critical effect on overall 
multiprocessor performance, interconnection networks have been 
widely studied. Various studies have focused on _ their 
functional properties (permutations, control algorithms) 
'(2,9,14,18], their complexity and performance [5,16,17,23], and 
‘their actual design [7,20]. 


One way of characterizing interconnection networks relates 
to the style of multiprocessor system in which they are 
used. At one end of the spectrum are systems where the 
interconnection network is the central communications 
‘component between the processors (as in a message passing 
system) or between processors and the main memory (as in a 
shared memory system). The NYU Ultra computer [10], the 
related IBM RP3 [19] computer and the BBN Butterfly [3] are 
‘examples of this style of design. It is convenient to refer to 
‘these systems as NETWORK CENTERED multiprocessors 
‘since the network is a central resource which has a major effect 
on overall system performance. This system style encourages 
viewing memory access and communications interchange in a 
‘uniform manner with costs associated with access and 


communications being roughly independent of physical location. 
(We ignore questions of cache and processor local memory here. If 
most accesses are local, then the need for a large and complex 
interconnection network is questionable). Algorithm development 
in this environment tends to encourage the use of large data 
structures which span the memory space and develop processor 
access patterns which are relatively uniform in nature. 


At the other end of the spectrum are systems where the 
processors are embedded within the network itself with direct 
communications taking place primarily on a physically local 
basis. The Intel Cosmic Cube [22], the Blue Chip computer and 
various tree machines [24,4,11] are examples of this style. These 
systems are PROCESSOR CENTERED in that processor 
‘performance is a main determining factor in overall performance. 
This system style encourages viewing the world as being made up 
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of local processing niches with data exchange between processors. 
becoming more costly as one reaches further from the local niche. 


Algorithms which can take advantage of memory access and 


communications locality associated with a given interconnection 
structure can operate effectively in this environment. 


In the middle of these two ends of the multiprocessor 
spectrum are machines where some (typically small) portion of 
the processing task is allocated to the interconnection network. 
The idea here is that since data must pass through the 
interconnection network (and be delayed) anyway on its way to. 


and from memory, why not do some processing on the way. 


This paper considers the design of large (several thousand: 


inputs and outputs) VLSI based interconnection networks for the 


case of NETWORK CENTERED multiprocessor systems. The 
emphasis is on the physical design aspects of the network, and 
the implications of design constraints on network implementation 
and performance. A simple modified packet switched N log N 
multistage interconnection network topology is assumed, with 
each node being implemented as a crossbar switch having a 
limited amount of packet buffering at its input. Interconnection 
network designs using optical techniques are not 
considered though they may well serve as the basis for networks 
of the next generation. 


The next section presents overall network topology and 
operation. Section 3 considers pin and area limitations at both 
the chip and board levels. Section 4 presents simple models of 
overall network delay as a function of clock frequency. Section 5 
derives data rate equations for the network in terms of various’ 
design parameters such as logic, memory, and data path delays. 
Clock skew and distribution factors are included in the model. 
The model presented can be used to determine the maximum 


clock rate achievable and hence the expected delay through the 


network. Section 6 explores an example design of a 2048 
input/output switch. The conclusion indicates that, in the current 
design environment, a large NETWORK CENTERED shared 
memory multiprocessor is likely to encounter a large performance 
penalty when accessing memory on the other side of the switch. | 

2. The Overall Network 


The overall network topology considered is of the Boolean: 
hypercube variety as shown in Figure 1. All switch modules are: 


the same and are designed to be crossbar (CB) switches sized so- 


that each one fits entirely on a single chip. Earlier studies have 
indicated that from an area-time performance viewpoint there is 
little difference between using an N log N versus a CB network 
within a chip [7]. Across the network as a whole, however, use of: 
a Boolean hypercube structure is significantly less costly in sie 
of the total number of chips required [8]. 


While each switch module (being a CB) is nonblocking, the 
network as a whole is a blocking network. The number of stages’ 
in the network is logyN' where N' is the size of the overall 
network and N is the size of the CB on a chip (see Table 1 for 
various definitions). Since the blocking probabilities of the: 
network decrease as the number of stages decrease, it is’ 
advantagous to place as large a CB as possible in each switch. 
module. This is shown in Figure 2 which contains a plot of: 
blocking probability versus number of stages for a network of size 
4096 (based on the formula derived in [17]). Note that reducing 
the number of stages from 5 to 3 decreases the blocking. 
probability by about 10%. 


Note also that the length of off-chip signal lines generally 
increases as the number of stages increases. Thus, for example, 
the maximum line lengths between the first and second stages in 
the network of Figure 1 are shorter than those between the 
second and third stages. This is important because of the delays 
associated with driving these lines. Thus, in general, reducing the 
number of stages reduces the off-chip delays. 


Overall network throughput can be increased by using a 
modified packet switched rather than a pure circuit switched 
design. A modified packet switched design places a_ limited 
number of packet buffers at the input to each of the switch 
modules. However, within the module circuit switching occurs. 
Thus, a packet holds an entire path within each switch module as 
it passes through that module. On leaving the module the path is 
released for use by some other packet. Increases in throughput 
which result from a packet switching approach have been 
discussed elsewhere [5]. These studies also indicate that most of 
the potential gain from buffering is achieved with a limited 
number of buffers (about 4) on each switch module input. For 
simplicity, in our discussion here we assume a single packet 
buffer. However, this is not critical to the analysis which follows. 
We also assume fixed size packets of length 100 bits which is 
about what is needed to include sufficient bits for data, memory 
module address, intra-memory module address, and return 
processor address. 


Figure 3 shows an example path through a three stage 
network. In addition to the input packet buffer, a single bit buffer 
has been placed at each CB output to allow for a limited 
pipelining capability. Notice that a dotted line is present in the 
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diagram around the input buffers. This indicates the presence of a 
cut-through mechanism which permits packets to stream through 
the switch module without going through a buffer filling process if 
the down stream switch module it requires has an empty input 
buffer. Under light loading conditions this will allow packets to 
pass from one switch module to another without being slowed 
down by buffer fill times [12]. 


The broad design philosophy taken is to keep the switch as 


simple as possible s*nce simplicity generally leads to speed. In this. 


spirit, combining networks and network information processing 
other than routing is omitted. Hot spots [20] and other problems 
are assumed either to be dealt with by the operating system, or 
are accepted and result in performance penalities which hopefully 
are partially overcome by having a fast network. The network is 
thus of the RISC (Reduce Interconnection System and 
Complexity) style of design. 
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2.1. Network Control 


The general design methodology employed utilizes clocked 
as opposed to asynchronous control. Other studies have shown 
that given today’s design environment, system sizes and data 
rates, clocked designs adequately meet most performance 
requirements [6,25]. At very high clock rates, multiple clocked 
approaches will need to be considered, however they are not 
treated here [1]. We assume that a two phase clock is used and 
that two pins on each chip are allocated for this purpose. 


A complete interconnection network requires control 


provision for: 


e path establishment and data transfer 
e detection of a blocked path 

e indication of end of transmission 

e path clearing 


Packets are self routing, moving from CB chip to CB chip 
according to address bits in the header portion of the packet. 
Within each CB the entire path is held from chip input to output. 
Feeding back from each input buffer to the associated output of 
the preceding chip is a buffer full line. This indicates whether the 
buffer is full and thus the path is blocked at that point. Packets 
are backed up and held in prior buffers until the buffer full line 
indicates transfer can proceed. For each NxN CB chip there are 
thus 2N buffer full control lines, N of these indicating whether its 
buffers are full, and N indicating whether the buffers downstream 
are full. All packets proceed in lock step from stage to stage in 
the network. 


Given fixed length packets, an on-chip counter can 
determine when packets have completed transmission. Path 
clearing will occasionally be necessary when certain error 
conditions occur. We assume that such clearing operations 
constitute a type of network reset and that such a drastic action 
will be initiated by some master processor for the network as a 
whole. One pin per chip is allocated for this purpose. Thus, 
ignoring power and ground for the moment, 2N+3 control lines 
are needed per chip. 


2.2. Crossbar (CB) Design 


There are many ways of designing CB switches. In this 
paper we consider two approaches (see Figures 4a and 4b). The 
first is referred to as the MESH CONNECTED CROSSBAR 
(MCC) design [16]. In this design, N? two by two crosspoint 
switches are placed on the chip, with each switch having a packet 
routing capability and one bit of buffering to allow for limited 
pipelining. Packet routing is thus completely local, the layout is 
planar and the distance between adjacent switches constant. The 
design is modular and as technology improves larger on-chip CBs 
can be easily implemented by replicating the basic switch 
crosspoints. The area of the entire CB grows as O(N?) while the 
time delay grows as O(N). 


The second design approach is referred to as the 
DMUX/MUX CROSSBAR (DMC) design [13]. In this case, after 
log.N bits have arrived at an input port, an input port controller 
(IPC) (i.e. a demultiplexer) determines on which output line to 
route the arriving message. The IPC also signals an output port 
controller (OPC) (i.e. multiplexer) which selects between multiple 
input packets that request the same output port. An NxN 
DMUX/MUX crossbar would require O(N’) two input gates and 
have a time delay of O(logN) gate delays. In addition to gate 
delays, however, there are the line delays associated with the 
potentially long on-chip lines between the input and output ports 
and controllers. This is due to the fact that path topology from 
input to output represents a complete bipartite graph whose 
layout area grows as O(N‘). Certain results-show that the 
overall delay with this type of CB grows as O(N?) [16]. 


3. Pin and Area Constraints 


A serious constraint in the physical design of large networks 
is imposed by the limited number of signal pins available both at 
the chip and board levels. In this section these pin limitations are 
explored assuming a Pin Grid Array (PGA) packaging 
technology that is aggressive, but currently realizable, and board 
edge connectors of standard high performance design. 


3.1. Chip Pin Constraints 


Pin usage can be broadly grouped into three categories: 
data pins (N,.), control pins (N,,.), and ground and power pins 
(N,.g)- The total number of pins on a chip, N,, is thus given by: 


Np = Np.d +: Np.c + Np.g (3.1) 


Since the size of the subnetwork on a chip is NxN and W is the 
data path width, the number of data pins is: 


No.4 = 2WN (3.2) 


The control information required for setting up paths in the 
network is obtained as part of the data and hence requires no 
extra signal pins. However, for each input or output port, a 
control signal indicating the state of the buffer (full or not) is 
required. This necessitates 2N pins for the buffer full signals. In 
addition, we allocate two pins for a two phase non-overlapping 
clock and a pin for system reset. Hence: 


Np.c = 2N + 3 (3.3) 

Normally, when small circuits are considered, especially 
those that have a small number of signal pins, the use of a single 
pin for power and a single pin for ground is sufficient. However, 
for large chips, especially those that have a large number of 
signals all of which can switch at the same time, it may be 
necessary to allocate more pins for power and ground in order to 
maintain ground and power voltage variations to within 
acceptable limits. 


Each signal pin has an associated inductance. When a 
signal switches between its high and low voltage states, the 
change in current through this inductance causes a voltage to 
appear across the pin. As the number of signal pins increases, in 
the worst case, all of them can switch in the same clock cycle, 
thus causing a large voltage change of either the ground or the 
power net. Given the impedance of the pin driver circuit and 
assuming that this circuit itself is driven by an exponential 
driver, the rate of change of supply current with respect to time 
for a single pin can be obtained (by analytic or simulation 
methods). For a 50 ohm driver driving a 30 inch metal pe board 
line with typical characteristics, the rate of change of current 
with respect to time, di/dt, is about 7*10° amps/second. The 


following expression can be used to determine the number of 


power and ground pins, Ny» needed for a given number of 
output signal pins, N(W +1), maximum permissible voltage 
variation, AV,,,,, and pin inductance, L. 


i ~ 2a ne |) a7 
D.g AVinax dt oe ( ) 

The expressions 3.2, 3.3 and 3.4 can now be substituted into 
3.1 and the overall number of pins, N, determined. Using the 
parameter values indicated in Table 1, Table 2 indicates N, asa 
function of subnetwork size and path width. Assuming that the 
maximum number of pins available on a chip is 240, the portion 
-of the table to the left and top of the heavy lines indicate designs 
that satisfy the pin constraints. 


3.2. Chip Area Constraints 


We now consider constraints on chip area and obtain the 
size of the largest network that can be implemented in a single 
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chip. As mentioned earlier, maximizing the size of the CB 
subnetwork (residing entirely on a single chip) reduces blocking 
and hence is desirable. The two designs introduced earlier (i.e. the 
mesh connected CB, MCC, and the demultiplexer/multiplexer CB, 
DMC) are explored. The development follows that of 
Padmanabhan [16]. 


For the case of the MCC design the key element in 


estimating the area lies in estimating the area occupied by a two 


input, two output switch. These switches can then be connected. 


in a mesh to form the CB network. Padmanabhan gives a PLA 
implementation of this switch with a one bit wide data path. He 
shows that the implementation would occupy a rectangular area 


of dimensions approximately 100 lambda by 100 lambda. Assume. 


that the above implementation gives the area of the control logic 
of a switch with a W bit data path. The area occupied by the 
data path must now be estimated. The data path consists of W 
lines traversing the switch from left to right and from top to 
bottom. In addition, W control lines for each set of data lines 
must be routed. Assuming that separation between lines, 


including area for driving and control buffers, is 10 lambda, the 


dimensions of the rectangular area occupied by the switch 


|__| _N- Crossbar Subnetwork Size | 
[wis | is | 90 | 22 | 24 


72 | 30 | 89 | 97 eS 
106 | 119 | 132 | 144 | 157 
174 | 196 | 217 | 239 


| 311 | 350 | ses | 427 | 465 


Table 2: The number of pins per chip necessary, N,, for different 
values of subnetwork size N and data path width W. 


becomes 100 + 201. Hence the area of the MCC realization. 


consisting of N? switches is: 


Ayicc = N*(100 -+}- 20 W)? 


Next consider the area occupied by the MUX/DMUX 


realization. The area occupied by such an implementation can be 
broadly divided into the area occupied by the N _ 1-to-N 
demultiplexers and N N-to-1 multiplexers, and the area occupied 
by the WN? wires which must be routed from the multiplexers to 
the demultiplexers. The routing of the wires from the: 
demultiplexers to the multiplexers will be done according to the 
routing presented by Wise [26] which results in identical wire. 
lengths. Let the minimum separation between wires be d and the 
vertical separation between consecutive wire origins and endings 


be h. The area occupied by such a routing can be shown to be’ 
given by: | 


h2W2d 
4h? — d? 
The minimum area is occupied when h = d which results in: 


_=(N —1)¢ Way 
Aire = (N — 1) “V7 


Awire = (N = 1)* (3.6) 


(3.7) 


‘Next estimate the area occupied by the N demultiplexers and N’ 


multiplexers. Assume that the demultiplexers are implemented as 


-a binary tree of 1-to-2 demultiplexers. A 1-to-2 demultiplexer 
‘with W data bits occupies a rectangular area of dimensions 30 W 


by 24 [16]. A tree realization will have a maximum of N/2 
demultiplexers in the first stage with log,.N stages. Thus, the 
bounding box of this tree is given by 360WNlogN. The area 
occupied by N demultiplexers and N multiplexers (assuming a 


multiplexer occupies the same area as a demultiplexer) is given 
by: 


A amux/mux = 360 WN? logN 


(3.5) 


(38) 


The total area of the MUX/DMUX design is given by: 


2 


The area expressions of (3.5) and (3.9) are lower bound estimates. 


Table 3 gives the maximum size network that can be 
implemented in a chip satisfying the above area constraints 
assuming that these estimates are increased by a third (to handle 
line drivers, etc.). The maximum chip dimensions assumed are 
lcm by lcm with a \ = 1.5 pm. 


Examination of Table 2 shows that the largest network. 


that can be implemented in a chip satisfying the pin constraints 


is 22x22 with a 4 bit data path. Table 3 indicates that the area 


constraints limit the network size to about an 18x18 network 
with the DMC design and a 25x25 network with the MCC design 


(assume a 4 bit data path). However, we choose a convenient: 


power of 2 network size of 16x16 with a 4 bit data path as our 


|__| Sub network size N | 
LW 


Table 3: The largest subnetwork that can 
be implemented on a lemxicm chip. 


basic subnetwork to be implemented in a chip, satisfying both pin 
and chip area constraints. Though not detailed here, a CMOS 
power analysis of such a chip indicates that about 2.5 watts will 


be dissipated at 50 Megahertz. We next investigate the layout of 


the network at the board level. 


3.3. Board Area Constraints 


Assume that the pin layout on the chip package uses a pin 
grid array with three rows of pins having a separation between 
adjacent pins of 100 mils. The size of a package with at least 
175 pins is about 2 inches on a side. The use of a 16x16 
subnetwork makes a 256x256 network reasonable for 
implementation on a single board. This network has two stages, 


each stage consisting of 16 chips. If a stage of 16 chips is lined up: 


on a single side along a board edge then the board length will be 
about 32 inches. The routing of wires between these two stages 
will determine the width of the board occupied by the 256x256 
network. 


To estimate the area occupied by the routing assume that 
the routing strategy is similar to that adopted for the chip level 
DMC implementation. The routing in this case is identical to the 
DMC implementation of a 16x16 crossbar network. The 
parameters A and d must, however, be estimated differently. d 


still remains the minimum separation between wire at the board. 


level with a typical value of 50 mils. Assume the board has two 
signal layers. The total number of wires to be routed is 
N*W + 1) = 1280. Each layer then has to route half the total 
number of wires, that is, 640 wires. Hence the vertical separation 


between wires is h = 32000/640 = 50 mil, the minimum. The. 


area occupied by the wires is then obtained by evaluating (3.7) as 
73 square inches. Thus, the width of the layout of the wires is 
about 3 inches and the longest path on the board is then no more 
than 32 + 3 = 35 inches. This layout will be used in Section 6 in 
examining the design of a 2048x2048 network. 


3.4. Board Pin Constraints 


At the board level, consideration must be given to the 
routing of signal wires from one board to the next. Implementing 
a 256x256 network with a 4 bit data path requires the routing of 
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1280 wires on the input and output sides. In the last section it 


was shown that the layout of the chips required each edge of the 
board to be about 32 inches long. If the wires were brought out to 


‘the edge of the board on two layers, it was shown that the 


separation between wires would have to be 50 mil. This is about 
the minimum separation between wires on the board that keeps 
crosstalk among wire to acceptable levels. Commercially 
available connectors are able to connect up to 100 lines from one 
side of a board and are no more than 4 inches long. Thus with 
connectors using both sides of the board, eight connectors can be 
used for the entire 1280 lines which can be lined up on one edge 
of the board. Thus the pin constraints at the board level can be 
satisfied. 


4. Network Delay 


In this section expressions are presented for the time to 
transfer a packet through the network (T). This is a one way 
network delay time and doesn’t include memory access time. A 
best case model is presented in which a lightly loaded network is 
assumed with no blocking of packets. Packets, in this situation, 
can thus stream through the entire network from the input to the 
output being delayed only by chip module setup and pipeline fill 
times. 


For the case of the MCC design, network delay is composed 
of two components. The first is the time to fill the pipe of 
crosspoint switches from input to output of the network. The 
second is the time it takes to transfer the packet once it has 
arrived at the end of the network. Thus: 


Tucc = pipeline fill time + packet transfer time (4.1) 


In this design the average number of crosspoint switches per chip 
that a packet passes through is N. The number of stages the 
packet traverses is jlogyN'|. Thus, the pipeline fill time is 


N logwiV"} The number of bit times associated with a packet 


transfer is the packet size (P) divided by the path width (WW). 
Thus, the packet delay time is: 


Tyce = (N logwiv | + P/W)(1/Fucc) (4.2) 

In the DMC design, associated with routing the packet 
within each chip is a setup time. This time_is dependent on the 
size of the on-chip CB in that at least fos.) bits must be 
received by the chip before the path can be established. Given a 


path width of W, the number of clock periods, M,,, associated 
with this setup time for a single chip (or stage) through which the 


‘ message passes is thus: 


M,, = llogaN/ Ww (4.3) 
To break up long path delays, this design also assumes that a 1- 
bit buffer is present at the output of each chip. This acts as a 
logyN"| pipeline through the network. As in the MCC case, a. 
packet transfer time is also present to account for the time it 
takes the packet to leave the network when its first bit starts to 
leave an output port. The resulting overall time is given by: 


Tpuc=setup time+pipe fill time+packet transfer time (4.4) 


Tpuc = ( [Mu + 1 log iV | + P/W \(1/Fouc) (4.5) 

These network delay expressions have been evaluated and 
are presented in Table 2. Notice that the results indicate that, 
even at fairly high clock frequencies, the one way delay through 
the network is substantial when compared to typical memory 
cycle times. For example, in the DMC model operating at 40 MHz 
with a path width of 2, the one way network delay is 1.48 
pseconds. Round trip delay, including 200 nanoseconds for 
memory access, would be 3.16 usec. . That is, the remote (through: 


‘the network) memory access is more than an order of magnitude 
greater than local memory access. Note also that the delay 
-expressions and associated tables do not indicate whether or not 
the frequencies suggested are achievable. This is dependent on the 
logic and path delays, and on clock distribution characteristics 
(e.g. clock skew). These design parameters are explored in the 
“next section. 7 


5. Clock Frequency and Data Rate Expressions 


The clock frequency at which a given system can run is 
determined by a host of factors ranging from effectiveness of the 
logic design, to chip layout and signal delay consequences of using 

_a@ particular packaging technology. The final result of these 
factors are a set of delay parameters which determine the rate at 
which data can pass from chip to chip. This maximum data rate 
corresponds to the maximum clock frequency which can be used 


| |  OQLOCK FREQUENCY (MHZ 


|__| CLOCK FREQUENCY (MHZ) __| 
[wfio |20 | 30 [40 [20 | 


DMC (Demux/Mux Crossbar) MODEL 
Table 2: Time Through Network (sec) 
(P=100, 512<.N'<4096, N=16) 


without errors occurring. 


The data rate can be expressed in a general form as follows: 


DR= I 


Information signal delays consist of logic and memory delays 
(D,), information signal path delays (D,) and delays due to clock 
skew (5). The sum of these delays can be associated with the time 
for a signal to pass between communicating modules (chips or. 
crosspoint switches). Since this must occur within a single clock 
cycle, a constraint is placed on minimum clock cycle duration 
(and thus maximum frequency). Taking the worst case (largest) 
sum of these delays overall communicating modules leads an 
overall constraint on data rate and clock frequency [25]. 


In a similar manner, delays associated with propagation of 
the clock signal form another basic limitation on clock frequency. 
Two types of clock distribution schemes are considered here. In 
the first (the STANDARD CLOCK SCHEME), the entire clock. 


tree is viewed as an equipotential surface which must achieve a 


single final voltage in each half of the clock cycle. That is, the: 


entire clock tree must be charged and discharged during the clock: 


cycle. Given that such clock trees may present a large capacitive 
load, and that this load grows as systems become larger, this: 


‘constraint can be a limiting factor in achieving high frequencies 
in physically large systems. If r is the time required to charge or 


discharge the clock tree, then the data rate for this clocking 


scheme is constrained to be less than 1/27. Based on the above, 
the data rate for the case of a Standard Clocked Scheme is as 
follows: 


1 


Dis 
Rec maz[D, + D, + 6, 27] 


(5.2) 
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Notice also that a relationship exists between 7 and 6. That 
is, as the clock line increases in length, both 7 and 6 increase with 
7 being an upper bound on 6. One model that relates 7 and 6 was 
developed by Wann and Franklin [25]. The model assumes simple 


-exponential rise times to and from the power supply voltage, 


Vig. Variations in material properties can result in variations in 
rise time which can be expressed in terms of maximum and 
minimum 7 values, 7,,, and 7,;,- Variations in processing can 
result in variations in FET threshold voltages which can also be 
expressed in terms of maximum and minimum threshold voltage 
values, Vemma: aNd Vaywin- The resultant expression is: 


H 


V. min V: maz : 
6 = ty, In(l — 2M") — rap, in(d — 22%) (5.3) 


When dealing with long clock lines, the charge/discharge. 
constraint and the clock skew can_ severely limit 


performance. The charge/discharge time constraint can be 


‘partially overcome by treating the clock lines as transmission. 


lines and, using the memory properties of the line, placing. 
multiple pulses on the line at the same time instant thus reducing 


the delay between clock pulses (MULTIPLE PULSE SCHEME).. 
Clock skew can be reduced by using a global clock and employing 
phase locked loop techniques (called “dynamic delay adjustment". 
in reference 1) to perform phase synchronization. These. 
techniques, however, add complexity to the design and are not 
pursued here. 


6. An Example 


6.1. Physical Design 


Based on the design constraint and data rate expressions 
developed in prior sections, we now consider the design of an 
interconnection network with 2048 inputs and outputs. As stated 
in section 3, pin constraints on the chip limit the maximum size 
of the crossbar that can be implemented in a chip. A 16x16 
network having path widths of 4 bits appears to be reasonable. 
and is used in this example. 


The implementation of a 256x256 network on a single board 
was shown to be possible, satisfying area and pin constraints on’ 
the board. Larger size networks can then be implemented from 
these boards by racking the boards in three dimensional space to 
reduce the distances over which wires must be routed. A 2048 
input, 2048 output network implementation is shown in Figure 5. 
The first two stages of the network are implemented from eight 
256x256 network boards; the last stage consists of eight boards- 
with each board implementing one eighth of the last stage. If the 
boards are stacked as shown in Figure 5, the longest wire between 
any two chips is one that traverses the diagonal of a board. In 
section 3 we showed that this distance is 35 inches. 


6.2. Clock Frequency 


Expressions for the clock frequency (data rate) were 
derived in Section 5 based on the logic and memory delays and 


‘signal path delays and skews in clock distribution. In this section 
we estimate the information signal delays and clock delays from | 
a knowledge of the physical layout of the network. A 


STANDARD CLOCK SCHEME is used. 


Information signal delays consist of delay in the logic and in 


‘the memory of a finite state machine implementaion. Estimates 


given in [16] indicate that the logic delay would be about 12 nano 


seconds whereas the memory delay can be restricted to about 2 


nano seconds. This results in D; = 14 nsec. 


Consider the path delay. Since the network is pipelined, 
.nly the largest path delay is of significance. This corresponds to 
the path that has to go off-chip and traverse the longest distance . 


{35 inches) on the board. This results in off-chip delays much | 


larger than the signal risé times, thus off-chip paths must be 
viewed as transmission lines with on-chip drivers matched to their 
impedance (50 ohms). The off-chip path delay is determined 
largely by the speed of signals on the board and this is typically 
0.15 nsec/inch. The delay in driving the driver is about 6 nsec. 
Hence D, = 6 + 0.15*35 = 11.3 nsec. 


_ The clock distribution essentially consists of two parts: one 
totally internal to a chip, consisting of a tree distributing the 
clock to the 16x16 network, and the second external to the chips 
consisting of the clock distribution on the board. Using a H-tree 
distribution to minimize clock skew for the clock distribution 
internal to a chip, the clock delay is given by [27]: 


= (10N* — 3)(3 — 2/N)RoCo/7 (6.1) 


where Ry and Cy represent the resistance and capacitance of the* 
last branch of the tree distributing the clock to a switch. In this 
design (a 16x16 network occupying a lcmxlcm chip) we have 
RCo = 0.244 pico sec. and we get 7,4;, = 4.1 nano sec. The delay 
in the clock distribution on the board is determined in a similar 
manner as the path delay. It consists of the delay in driving a 
driver with a 50 ohm output impedance and the delay in driving 
a line of maximum length of 35 inches. Thus tpoarq = 11.3 
nanoseconds. Thus, the total clock delay is r = 15.4 nanoseconds. 
An expression for the clock skew is given in Section 5. Assuming 
a 20% variation over the nominal value in both the clock delay: 
and transistor threshold voltages, the clock skew is obtained as: 


§ = .87log(1—2/5)—1.27log(1-3/5) = .7r = 10.8 nano sec. (6.2) 


Tchip 


We can now estimate the clock frequencies. Note that due 
to the pipeline design of the network, only the largest delays are 
significant and both the MCC and DMC designs resulted in equal 
cloek frequencies. Since, D, + D, + 6 > 27 the clock frequency is: 


1 
= = 27.7 MHz 
Di +D, +6 (6.3) 
7. Summary and Conclusions 
This paper has presented design expressions for 


implementing an N log N packet passing interconnection network 
composed of circuit switched CB chip modules. The expressions 
considered are derived from various physical constraints and 
network models. Chip level and board level pin and area 
constraints are considered and expressions are derived which 
indicate the sort of delay which can be expected through a 
network at a given frequency, and the design constraints on 
achieving that frequency. 


An example 2048x2048 network design is considered. This. 
example indicates that using aggressive packaging and MOS 
technology, a rate of about 28 Mhz is achieveable. However, this 
frequency, with this network design, would result in a one way 
delay (ignoring blocking and hot spot delays) of about 1 psecond. 
A read operation from memory requiring a round trip would thus 
require more then 2 yuseconds. ._This represents more then an 
order of magnitude slowdown when compared with accessing 
strictly local memory and appears to be a major problem in the 
design of VLSI based, network centered multiprocessor 
architectures which utilize standard clock destribution designs. 
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Figure 2:.Plot of probability of blocking 

with the number of stagesin a NlogN network. 
Figure 1: A 16x16 NlogN network built from Each stage consists of switches that are 
2x2 switch modules crossbar networks. 
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Figure 3: A path from an input to an output in 
a three stage network. Each input at each stage 
has a buffer that can be bypassed. 
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D - data 
A - acknowledge 


Figure 4a: The mesh connected 
crossbar network. 
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Figure 4b: The DMUX/MUX crossbar network 
implementation. Each input consists of a 
1 to N demultiplexer and each output a 
N to 1 multiplexer. 
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A GROUP THEORETIC MODEL FOR 
SYMMETRIC INTERCONNECTION NETWORKS 


Sheldon B. Akers 
Dept. of Electrical & Computer Engg. 
University of Massachusetts 
Amherst, MA 01003 


ABSTRACT 


Symmetric graphs, such as the ring, the n-dimensional Boolean 
hyper-cube and the cube-connected cycles, have been widely used as 
processor/communication interconnection networks. The performance of 
such networks is often measured through an analysis of their degree, 
diameter, connectivity, fault tolerance, routing algorithms, etc. In this 
paper we develop a formal group theoretic model, called the Cayley graph 
model, for designing, analyzing and improving such networks. We show 
that this model is universal and demonstrate how the networks mentioned 
above can be concisely represented in this model. 


More importantly, we show that this model enables us to design new 
networks based on representations of finite groups. We can then analyze 
these networks by interpreting the group theoretic structure graph theoret- 
ically. Using these ideas, and motivated by certain well known combina- 
torial problems, we develop two new classes of networks, called the star 
graphs and the pancake graphs. These networks are shown to have better 
performance, as measured by the parameters mentioned above, than the 
popular n-cubes. . 


1. INTRODUCTION 


A processor/communication interconnection network is often 
modeled as an undirected graph, in which the nodes correspond to 
processor/communication ports, and the edges correspond to communica- 
tion channels. Communication over such a network is achieved by a mes- 
sage passing protocol, and the delay in communication is measured by the 
number of edges traversed. Some of the key features of interest in such 
an interconnection network are its degree, diameter, congestion, sym- 
metry, connectivity, routing algorithms, structure, etc. 


A number of interconnection network topologies have been suggested 
in the literature which address one or more of the above features ([{1-7]). 
These range from simple graphs, such as cycles, complete graphs and 
Stars to more sophisticated graphs such as shuffle-exchange graphs, n- 


dimensional Boolean cubes and cube-connected cycles. Since there is no . 


single measure to compare these networks, each of the above examples 
have been justified for one application or another. 


A special class of networks, called symmetric interconnection net- 


works, have the property that the network viewed from any vertex of the 
network looks the same. In such a network congestion problems are 
minimized since the load will be distributed uniformly through all the ver- 
tices. Moreover, this symmetry allows for identical processors at every 
node with identical routing algorithms. It is also very useful in designing 


algorithms that exploit the structure of the network. it is this class of net- . 


works that we address in this paper. With the exception of the star net- 
work and the shuffle exchange graphs, all of the networks mentioned 
above are symmetric networks. 


0190-3918/86/0000/0216 $01.00 © 1986 IEEE 


Balakrishnan Krishnamurthy 
Computer Research Laboratory 
Tektronix Laboratories 
Beaverton, OR 97077 


In designing symmetric interconnection networks, the overall objec- 
tive has been to construct large vertex symmetric graphs with small 
degree and diameter, high connectivity and offering simple routing algo- 
rithms.’ One attractive network that offers all these properties together 
with a good degree and diameter is the n-cube, which is a network of 2” 
vertices, with degree n and diameter n. Thus, the n-cube will be used as _ 
a standard against which to compare many of the networks constructed in 
this paper. 


Specifically, we shall present a group theoretic model for designing, 
analyzing and improving symmetric interconnection networks. We show 
that most symmetric interconnection networks can be represented using 
this model, and that every symmetric interconnection network can be 
represented by a simple extension of this model. This allows us to pro- 
vide an algebraic representation of each of the symmetric networks men- . 
tioned above. More importantly, this group theoretic model enables us to 
start with an arbitrary finite group and construct a symmetric network 
using that group as the algebraic model. This, in conjunction with the vast 


literature, on finite groups allows us to construct a variety of new inter- 


connection networks. = 


Another advantage of analyzing such networks in this algebraic set- 
ting is that many properties of these networks can be proved for the class 
as a whole, instead of proving that property for each network indepen- 
dently. For example, all networks derived from a finite group are neces- 
sarily vertex symmetric, i.e., the network viewed from any vertex in the 
network looks the same. We prove a number of other such properties of 
these networks. Further, even for specific networks constructed using this 
algebraic model we can often derive properties algebraically and interpret 
the properties graph theoretically. We will repeatedly use this technique 
in this paper. 


Apart from the algebraic model suggested in this paper we also offer 
two specific classes of networks that are especially attractive for distri- 
buted processing. We call these the star graphs and the pancake graphs. 
We describe these networks in some detail and compare them to the n- 
cubes. 


The paper is organized into 8 sections. In the second section we 
define the algebraic model, called the Cayley graph model and provide a 
number of examples as well as some new networks. In Section 3 we 
prove certain general properties of Cayley graphs. In Section 4 we define 
transposition trees which is a level of abstraction beyond Cayley graphs. 
A specific Cayley graph resulting from a specific transposition tree, called 
the star graph, is investigated in Section 5. In Section 6 we present some 
preliminary ideas on designing algorithms on another class of such 
graphs, called pancake graphs. In Section 7 we compose Cayley graphs 
using a variety of group theoretic operations and study the resulting 
graphs. The concluding Section 8 reiterates the fertile source of new net- 


_works offered by this group theoretic model. 


1. We point out that the problem of constructing large graphs of a given degree and diame- 
ter - known as the (d,k) graph problem - is a well studied extremal graph theory problem. 
However, solutions to the (d,k) graph problem often ignore the more subjective parame-— 
ters of this problem, such as symmetry, case of routing and the structure of the graph, - 
which are essential for the design of efficient algorithms on these networks. 


In the remainder of this section we will state the group theoretic and 
graph theoretic terminology that we will use in this paper. We will 
assume basic knowledge of elementary group theory and graph theory. 
The reader is referred to [14 and 15] for an elementary exposition of the 
‘relevant terminology in group theory and graph theory, respectively. 
Since we will only be considering finite groups we will represent our 
groups as permutation groups. We will be using a one-row representation 
‘for permutations. Thus, the permutation whose cycle representation is 
(12)(345) will be represented by us as 21453. Recall that transpositions 
are permutations with exactly one cycle of length 2 and all other cycles of 
length 1. Our descriptions of permutations will be quite informal. Thus, 
we will view a transposition as swapping a pair of symbols. 


AS we mentioned in the beginning of the section, we will view inter- 
connection networks as an undirected” graph. Thus in the remainder of 
the paper we will use the term graph in place of interconnection network. 
Our graphs will be finite, undirected, loop-free and devoid of multiple 
edges. We will make special use of cycles, complete graphs, trees, and 
Stars. 


2. THE CAYLEY GRAPH MODEL 


Given a set of generators for a finite group G, we can draw a graph, 
called the Cayley graph, in which the nodes correspond to the elements of 
the group G and the edges correspond to the action of the generators. 
That is, there is an edge from an element a to an element b iff there is a 
generator g such that ag=b in the group G. We require that the set of 
generators be closed under inverses so that the resulting graph can be 
viewed as being undirected. 


Let us illustrate Cayley graphs with a few examples which will make 


these ideas clearer. First, we point out that since we will be considering . 


permutation groups the generators are themselves permutations. We will 
be representing permutations using the symbols 1, 2, 3,..., 2. For exam- 
_ ple, consider the generators 1324, 2143 and 4321. Since these generators 
are permutations of four symbols they must generate either S, (which 
itself consists of 24 permutations) or a subgroup? of S,. In this case the 
subgroup generated by these three generators consists of the 8 permuta- 
tions: 1234, 2143, 2413, 4231, 4321, 3412, 3142 and 1324. The 
corresponding Cayley graph arising from the above generators is shown in 
Figure 1. The reader can easily convince himself that the edges of this 
graph correspond to the action of the generators. 


Generators: 1324 


2143 
4321 
1234 
1324 2143 
3142 - sabe 
3412 4231 
4321 


Figure 1: A simple Cayley graph. 


2. The techniques of this paper can also be used for directed graphs. However, we will limit . 
” our attention to undirected graphs. ar aes 


3. The reader unfamiliar with LeGrange’s theorem would find it useful to know that in a 


finite group the order of any subgroup (the number of elements in it) always divides the : 


order of the group. 
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As another example, consider the generators 213456, 124356 and 
123465. The group generated by these generators again contains 8 per- 
mutations. These permutations and the corresponding Cayley graph are 
shown in Figure 2. We might point out that this is, of course, the familiar 
3-dimensional cube. Figures 3 and 4 show two additional examples of 
Cayley graphs, where we have only shown the permutations at a few 


. selected nodes. In Figure 4 we have also shown the generators as labels 


on the edges. Again we might simply mention that Figure 3 is a 3- 
dimensional cube-connected cycle, i.e., a 3-dimensional cube with its 
comers chopped. Figure 4 is an example of a star graph that we will dis- 
cuss later in this paper. 


123456 Generators: 213456 
124356 
123465 
213456 123465 
214356 Ae 
214365 
Figure 2: 3-Cube as a Cayley graph. 
Generators: 2134. 
1342 
142 
1423 4123 : 


1342 


1342 


Figure 3: 3-Dimensional cube-connected cycles as a Cayley graph. 


The reader might have observed that all the Cayley graphs shown so 
far are vertex symmetric. In fact, 
Theorem 1: Every Cayley graph is vertex symmetric. 


Proof: We need to show that given any two vertices in the Cayley graph 
_there exists an automorphism of the graph that maps one vertex into the 
other. Let a and b be the permutations corresponding to the two vertices. 
Consider the transformation on the group that maps an arbitrary permuta- . 
tion x into ba~!x. Clearly, this maps a into b. Further, this transforma- 
tion is an automorphism of the graph. For, if two permutations x and y 
were connected by an edge in the graph then there is a generator g such 
that xg=y. Then the images of the two permutations, ba“'x and ba™y, 
are connected by an edge since ba~!xg=ba™'y. Hence the proof. O 


Generators: 


2134 (1) 
3214 (2) 
4231 (3) 


Figure 4: An example of a star graph. 


As mentioned in Section 1, an attractive feature of vertex symmetric 
graphs is that routing between two arbitrary nodes reduces to routing from 
_an arbitrary node to a special node. We can now state this more formally. 
First, observe that in a Cayley graph a path from one vertex to another can 
be represented by a sequence of generators 21,82, ...,2,, where each g; 
is a generator of the Cayley graph. Now if 21,22,...,8, isa path from x 
to y then it is also a path from y~'x to/, the identity permutation. Thus if 
we want to find a route from x to y we can instead find a route from y~'x 
to 1. Consequently the problem of routing reduces to the problem of sort- 
ing! 


We can now describe a class of Cayley graphs, which we will call the 


bubble sort graphs. Recall that a Cayley graph is completely specified by | 


providing the set of generators. The generators for the n™ bubble sort 
graph is the set of n—1 transpositions of the » symbols 1,2,...,n that 
transpose adjacent symbols. Thus, for n=4 the generators are: 2134, 
1324 and 1243. Notice that a path in this graph is a sequence of adjacent 
transpositions. Thus, finding a route from a given permutation to the iden- 
tity permutation is equivalent to sorting the given permutation using the 
familiar bubble sort algorithm. The reader can verify that the group gen- 
erated by this set of generators is the symmetric group S,,. Consequently, 
the corresponding Cayley graph has n! vertices. Further, it is easily 
shown that the corresponding Cayley graph has degree n—1 and diameter 


(). 
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As a second example of a family of Cayley graphs we consider an old 
combinatorial problem, called the pancake flipping problem [8]. The 
problem is to sort a stack of n pancakes of different sizes by repeatedly . 
flipping top sub-stacks with a spatula. For example, consider an arrange- : 


"ment of 5 pancakes represented by the permutation 23514. Let us adopt. 
- the convention that the left end of the permutation is the top of the stack. 


Thus the sorted stack would look like the identity permutation 12345. If 
we apply a 3-flip, 1.e., flip the top (left) three pancakes with the spatula, 
the 23514 permutation will be transformed into 53214. Notice that such a 
3-flip is equivalent to multiplying on the right by 32145. We give below a 
sequence of flips to sort the given permutation: 


23514 —» 53214 — 12354 -— 45321 — 54321 — 12345 


We model the pancake flipping problem as a class of Cayley graphs 
using generators representing the pancake flips. Thus, flipping the top i 
of nm pancakes with a spatula gives rise to the generator 
i (i-1)...3210+1)(@+2)...n. Clearly, there are (n-1) generators, one for 
each value of i, 1<i<n. It is easy to show that the corresponding Cayley 
graph has n! vertices with degree (n—1). Finding the diameter of the pan- 


_cake graph is equivalent to finding the maximum number of pancake flips 


one would need to sort an arbitrary stack of pancakes. This problem is 
still open, and the best know results can be found in [13]. 


For this paper we will give a simple routing algorithm that routes in at - 
most (2n-3) steps. Recall that instead of finding a path between two arbi- 
trary permutations, it suffices to find a path from one permutation to the 
identity, i.e., sort a given permutation. In one step we can bring the sym- 
bol a to the left most position using an appropriate flip (generator). In 
one more step, we can bring n to the n™ position using the n-flip. 
Thereafter, we can ignore the symbol n and sort the remainder of the per- 
mutation, recursively. This yields a route of 2” steps. We can slightly 
improve this by observing that the last two symbols, i.e., the symbols 1 
and 2, do not require two steps each. In fact, when we have moved all the 
other symbols to their place, 1 and 2 would require at most one more flip. 
Hence the (2n-3) result. Note that we have in effect proved a (2n-3) 
upper bound on the diameter of this Cayley graph. We will point out that 
in contrast to the n-cubes, whose diameter and degree grow logarithmi- 
cally as a function of the number of vertices, the n-pancake graphs have 
degree and diameter that grow slower than logarithmically as a function 
of its size. 


Before we offer a picture of the pancake graph for a small value of 7, 
we will first mention some elementary properties of Cayley graphs. This. 
will allow us to use the underlying group theoretic structure of these 
graphs in analyzing them. 


3. PROPERTIES OF CAYLEY GRAPHS 


Since a Cayley graph is completely specified by a set of d permuta- 
tions as generators. The degree of the graph will, of course, be d. We 
have already shown in Theorem 1 that every Cayley graph is vertex sym- 
metric. As we had mentioned in Section 1, a vertex symmetric graph has 
the desirable property that the communication load is uniformly distri- 
buted on all the vertices so that there is no point of congestion. A stronger 
notion of symmetry, called edge symmetry requires that every edge in the 
graph look the same. That is, given two edges of the graph there is an 
automorphism of the graph that maps one edge into the other. Such a 
symmetry would ensure that the communication load is uniformly distri- 
buted over all the communication links, so that there is no congestion at 
any one link. It is easy to prove the following necessary condition for a 
Cayley graph to be edge symmetric. We state it without proof: 


Theorem 2: Consider a Cayley graph defined by a set of d generators on 

n symbols {1,2,3,....2}. If for every pair of generators there exists a per-_ 
mutation of the symbols that maps one generator into the other then the. 
Cayley graph is edge symmetric. | 


As a consequence of the above theorem we can conclude that the bubble 
sort graphs are edge symmetric. 


Apart from the symmetry properties of these Cayley graphs there isa. 
very useful decomposition of these graphs that can be seen using elemen-_ 
tary group theory. First, we observe that one of the attractive features of 


the n-cube is its recursive decomposition into smaller cubes. Thus, an n - 
cube can be viewed as consisting of two (n—1)-cubes, interconnected by 
edges that are said to lie in the n dimension. We will now show that this 
property can be abstracted as a group theoretic property and is possessed 
by a number of different Cayley graphs. 


Consider the 4-pancake graph defined by the 3 generators 2134, 3214. 


and 4321 on 4 symbols. Let us examine the subgraph of the 4-pancake 


graph consisting of those vertices (i.e., permutations) that fix the symbol 4 


in the 4 position. Clearly, there are 3! such permutations and thus six 
vertices in this subgraph. Further the only edges that connect these ver- 
tices in this subgraph are those that correspond to the first two generators, 
since the last generator will move the symbol 4 from the 4” position. 
Consequently, this subgraph is identical to a 3-pancake graph (on the 
symbols {1,2,3}), with the symbol 4 being affixed at the end of each per- 
mutation. 


More interestingly, let us now examine the subgraph of the 4-pancake 
graph consisting of the 6 permutations that fix the symbol 3 in the last 
position. Once again we will find that the subgraph is identical to the 3- 


pancake graph, but this time on the symbols {1,2,4}, with the symbol 3 | 


being affixed at the end of each permutation. In this manner we can iden- 

tify 4 mutually disjoint subgraphs of the 4-pancake graph, each of size 

3!=6, with each subgraph being a copy of the 3-pancake graph. These 4 

copies of the 3-pancake graph are then interconnected by edges that 

correspond to the 4 generator, i.e., the generator corresponding to a flip 

of all four pancakes. We can now offer the informative picture of Figure 
5 illustrating the 4-pancake graph. 


Figure 5: The 4-pancake graph. 


Finally we make the trivial observation that the 3-pancake graph itself 
(i.e., the hexagon) is made up of 3 copies of the 2-pancake graph which is 
itself a line. More generally, an n-pancake graph can be viewed as n 
copies of (n-1)-pancake graphs that are interconnected by edges 
corresponding to the n-flip. Further, this decomposition can be carried 
out recursively, so that each of the (n—1)-pancake graphs in turn are made 
up of n—1 copies of (n—2)-pancake graphs, and so on. 


We can now ask which Cayley graphs have this recursive decomposi- 
tion property. Notice that in the 4-pancake graph we used the property 
that the 4-flip cannot be obtained through any combination of 2- and 3- 
flips. That is, the permutation corresponding to the 4-flip, i.e, 4321, is 
outside the subgroup generated by the 2- and 3-flips. In fact, that is the 
only requirement that is needed for such a decomposition. Thus, for con-’ 
venience we will define a Cayley graph to be hierarchical, if its genera- 
tors can be ordered as 21,29,...,2q4, Such that for each 7, 1<i<d, g; is 


sition structure. 


Another important property of interconnection networks is their fault 
tolerance. The fault tolerance of a graph is better defined through the 
graph theoretic property, called connectivity. The connectivity of a graph 
is the minimum number of vertices that need to be removed to disconnect 
the graph. The fault tolerance is then one less than the connectivity and 
indicates the maximum number of vertices that can be removed and still 
have the graph remain connected. Clearly, any graph can be disconnected 
by removing all the vertices adjacent to a given vertex. Thus its connec- 
tivity can be at most its degree.. It has been shown that hierarchical Cay- 
ley graphs (with an additional size requirement) are maximally fault 
tolerant. That is, their fault tolerance is exactly one less than their degree. 


The last property we will mention in this section is the universality of 
this model for symmetric graphs. Can every vertex symmetric graph be 
represented as a Cayley graph? For example, the n-cube can be 
represented as a Cayley graph by generalizing the example shown in Fig- 
ure 2. We will offer a more formal representation of the n -cube in Sec- 
tion 7. We will also provide in that section a representation of the cube- 
connected cycles as Cayley graphs. While, most symmetric networks 
considered in the literature can be viewed as Cayley graphs, it remains 
that there are certain vertex symmetric graphs that cannot be represented 
as Cayley graphs. A prime example is the Petersen graph, shown in Fig- 
ure 6. 


Figure 6: Petersen graph — An example of a vertex symmetric graph that 
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is not a Cayley graph. 


However, one can extend the Cayley graph model to the quotient 
graph of two Cayley graphs. In this paper we will merely give a rough 
definition of the quotient and state (without proof) a theorem that estab- 
lishes the universality of this model. To define the quotient graph of two 

. Cayley graphs we select a subgroup of the group generated by the given 
set of generators. We then identify the subgroup and all its cosets as sub- 
graphs of the Cayley graph. The quotient graph is obtained by reducing 
these subgraphs to vertices and connecting two such vertices iff there 
existed an edge between elements of the corresponding subgraphs. 


Theorem 3: Every vertex symmetric graph can be represented as the quo- | 


tient of two Cayley graphs. 


Finally, we mention an interesting open conjecture. Recall that a 
Hamiltonian path is a path that visits every vertex exactly once. A Hamil- 
tonian cycle is a cycle that forms a Hamiltonian path. 


Conjecture: Every Cayley graph is Hamiltonian, i.e., has a Hamiltonian 
cycle. Further, every vertex symmetric graph has a Hamiltonian path. 


This conjecture has remained open in the sense that neither has it been 
proven nor has a Cayley graph been shown to violate it. For specific Cay- 
ley graphs it is often easy to establish that it is Hamiltonian. For example, 
the Hamiltonian property of the n-cubes is demonstrated by Gray codes. 
The Hamiltonian property of the n-pancake graphs have been shown in 
[16]. 


As we have mentioned, most symmetric interconnection networks 
. that have been suggested in the literature can be represented as Cayley 
graphs. However, we offer this group theoretic model not only to capture 
existing networks, but also to design new networks. It is in this vein that 
we suggest the pancake graphs. We have already pointed out how their 
degree and diameter (as a function of its size) are more attractive than the 
n-cubes. Later, in Section 6 we will show how we can design some fun- 
damental computational algorithms on the pancake graphs. But first, we 
will show how even better graphs can be designed using the Cayley graph 
model. 


4. TRANSPOSITION TREES 


Again we recall that a Cayley graph is completely specified by pro- 
viding a set of d permutations as generators. We do point out a require- 
ment that this set of permutations be closed under inverses. In the exam- 
ples of Cayley graphs given above we have not explicitly shown to have 
met this condition since the generators that we have used have often been 
involutions, i.e., self-inverses. A special class of involutions are the tran- 
spositions — permutations that swap two symbols. For example, 12435 is 
a transposition that swaps the symbols 3 and 4. All the generators of the 
bubble sort graphs are transpositions. In this section we provide a model 
for representing a set of (n-1) transpositions as generators. 


Consider a tree on n vertices. We can label the vertices of this tree 
with the symbols {1,2,3,.....} and interpret the edges as transpositions. 
For example, the tree on 6 vertices shown in Figure 7, gives rise to 5 tran- 
spositions: 321456, 132456, 124356, 123546 and 123654. Thus, we can 
interpret a tree as a set of transpositions, which in turn give rise to a Cay- 
ley graph. We call such a tree, a transposition tree. As another example, 
the path on n vertices gives rise to the bubble sort graph. The following 
general theorem about Cayley graphs of transposition trees is an indica- 
tion of the symmetry and structure underlying these graphs that can be 
readily uncovered by a simple group theoretic analysis: 


Theorem 4: Let T be a Cayley graph of a transposition tree of order n. 
. Then, 


1. has n! vertices. (A well known result attributed to Polya) 


2. Both the edge and vertex connectivity of I are maximal, i.e., equal 
_ to its degree. 


3. The chromatic index of I" is equal to its degree. 


4. I can be represented as an interconnection of n identical copies of a 
Cayley graph of a transposition tree of order n—1, and hence is 
hierarchical. 


We can view the Cayley graph of a transposition tree as the state 
diagram of a puzzle. Consider the vertices of the transposition tree to be 


Figure 7: An example of a transposition tree. 


labeled as suggested above. Now place n markers, each labeled with a 


symbol from {1,2,3,....2}, at the vertices of the tree in any arbitrary way. 
The puzzle is to move the markers to their appropriate positions by moves 
consisting of interchanging the markers at the ends of an edge in the tran- 
sposition tree. The Cayley graph is then the state diagram of such a puz- 
zle. That is, the vertices of the Cayley graph are the possible arrangement 
of markers at the vertices of the tree. The edges of the Cayley graph 
correspond to the permissible moves in this puzzle. Finding a path in the 
Cayley graph corresponds to sorting a permutation, which in turn 
corresponds to solving this puzzle. It can be shown: 


_ Theorem 5: Let T be a transposition tree on n nodes. Given an assign- 


ment of markers as a permutation, 7, of the nodes of T , 7 can bé sorted in 
n 
# of cycles in T+n+ LOG (i )) 


where, d(i ,j) is the distance between nodes i and j inT. 


We remark that the above bound on the diameter of the Cayley graph con-— 
forms to the diameter of the bubble sort graph. In fact, 


Theorem 6: Of all trees on n'nodes, the path yields a Cayley graph (i.e., 
the bubble sort graph) of maximum diameter. 


We omit proofs of the above theorems for brevity. Instead we now 
focus our attention on a specific transposition tree. 


5. THE STAR GRAPH. 


Motivated by Theorem 6, we consider the other extreme tree, namely 


the star, as a transposition tree. The resulting Cayley graph is called the 


the star graph. Since this is an especially attractive alternative to the n- 


cube, we provide an informal description of this graph. The nodes of the 


graph are labeled by permutations of 1 through 2. A permutation is con- 
nected to every other permutation that can be obtained from it by inter- 
changing the first symbol with any of the other symbols. Thus, clearly the 
degree of the graph is n—1. An illustration of the 4-star graph was given 
in Figure 4, 


Let us examine how we might route within this graph. Recall that 
routing between two vertices in a Cayley graph is equivalent to sorting a 
permutation. So we need to ask how we might sort a given permutation 
by exchanging the first symbol with any of the other symbols. For exam- 
ple, consider the permutation 64725831. Let us employ a greedy algo- 
rithm where we observe that the symbol in the first position, namely 6, 
can be moved to its correct position by exchanging 6 and 8. This gives us 
the permutation 84725631. Again, a greedy move gives us 14725638. 


Now we are stuck, since 1 is already at its own position. We now waste 
one step and move 1 into any position not occupied by the correct symbol. 

‘In this case we could interchange 1 and 4 yielding the permutation 
41725638. Now following the greedy approach gives us 21745638 and 
subsequently 12745638. Again we need to waste a step and insert 1 into a 
position not occupied by the correct symbol. Thus, interchanging 1 with 7 
gives 72145638. Resorting back to the greedy step gives 32145678 and 
the final move to 12345678. Notice that we took 8 moves to sort this per- 
mutation. That means that in the Cayley graph we have shown a path of 
length 8 between 64725831 and the identity. 


Of course, this does not establish the diameter of the Cayley graph. 
For that we must find the permutation that requires the maximum number 
of steps to sort. Instead, we will derive the exact number of steps required 
to sort an arbitrary permutation in the following Lemma: 


Lemma I: The number of steps required to sort a permutation 7 using the 
generators of the star graph is given by *: 


2 if n(1)41 


n+# of cycles in m—-2(# of invariances in 7) 4, Ahern 


Proof: Follows from the routing algorithm described above. O 
Theorem 7: The diameter of the n-star graph is Son , and its aver- 
age diameter is n+ +H, 4, where H,, is the nth Harmonic number. 


Proof: The diameter follows by maximizing the formula given in Lemma 


1. To establish the average diameter we total the quantity given in. 


Lemma 1 for each 1 and divide by n!. It is well known (see [17, p.176]) 
that the average number of cycles in a permutation of n symbols is H,, 
the n™ harmonic number, Thus, the total number of cycles, over all n! 
permutations is n!H,. It is also easy to establish that the total number of 
invariances over all permutations of n symbols is n!. Finally, the total 
number of permutations 7, such that 1(1)=1 is (n—1)!. Thus, the average 
diameter is: 


+ | mint) + mld, —I-n!—An! -(n—1)) | 
=n +H, -442. O 


Let us compare this network against the n-cube. Recall that the n-cube 


interconnects 2” vertices with degree n and diameter n. In contrast, the. 


n-star graph interconnects n! vertices with degree n—1 and diameter 


S09] . Notice that both the degree and diameter of the star graph 


grows slower than a logarithmic function of its size. Thus, asymptoti- . 


cally, the star graphs offer a network with less interconnecting edges and 
smaller communication delays than the n-cubes. Even from a practical 
point of view, it is evident from Table 1 that purely based on the degree 
and diameter requirements the star graph is superior. Table 1 provides a 
comparison of various n -cubes against comparable n -star graphs. 


4. The number of cycles in a permutation includes the number of invariances. 
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of its "n dimensions." 


A Comp ie jo eee 


n-Star 2 | star graph 


Dest Diameter Size | Degree | Diameter 
3 

= ina —] 

n-I 3 (n— ) 


Table 1. 


Of course, the degree and the diameter are not the only consideration 
for choosing a specific network. Let us examine some of the other 
relevant properties. The connectivity of the n-cube is n, indicating that 
up to n—1 vertices can fail without disrupting the network. Recall that the 
connectivity is at most equal to the degree of the graph. In the case of the 
n-star graph the degree is n—1, And, indeed, its connectivity is also n—1. 
So it is maximally fault tolerant. This is a result from [10]. Actually, 
such a worst-case fault tolerance measure does not really reflect the prac- 
tical fault tolerance of the network. Even though the n-cube has a con- 
nectivity of n, the only way to remove n vertices and disconnect the 
graph is by removing all the n neighbors of any one vertex — a very 
unlikely event. Likewise, even though the connectivity of the n-star 
graph is n—1 the only way to remove n-—I vertices and disconnect the 
graph is to remove all the n—1 neighbors of any one vertex. In general 
both these networks can tolerate a much higher failure. 


With regard to symmetry considerations, we recall that the n-cube is 


both vertex and edge symmetric. This alleviates any congestion prob- 


lems. We know that the star graphs, being Cayley graphs, are vertex sym- 
metric. Further, it is easy to show that the Cayley graph of an edge sym- 
metric transposition tree is itself edge symmetric. Thus, the star graphs, 
arising from the star as a transposition tree, are also edge symmetric. 


We have already noted the recursive decomposition structure of the 
n-cube? Does the star graph possess that property? It is easy to see from 
the generators that the star graphs are hierarchical. We can infer from that 
that they can be recursively decomposed. But, better yet, they can be 
recursively decomposed into n copies of (n—1)-star graphs, each of which 
is in turn further decomposable into smaller star graphs. Such a recursive 


- decomposition can be identified in the diagram of the 4-star graph given 


in Figure 4. Since the 3-star graph is a hexagon, the 4-star graph consists 
of 4 copies of the hexagon interconnected by edges corresponding to the 


374 generator. Even though a preliminary inspection of Figure 4 suggests 
6 hexagons, we should isolate those hexagons whose edges are labeled 


with 1 and 2 only. The reader will notice 4 such hexagons. These are 
then interconnected by the edges labeled 3. 


Additionally, like the n-cube which can be decomposed along any © 
one of its m dimensions, the star graphs can be decomposed along any one 
Let us explain this further. Observe that an n -cube 
is made up of two copies of an (n—1)-cube connected by edges in the n™ 


. dimension. However, any one of the edges of the n-cube can be viewed 


as the n‘ * dimension. Similarly, any one of the edges in the star graph can 


‘be viewed as the last dimension. So we can break it up into n copies of 
.(n—1)-star graphs along that dimension. We do this by observing that if 
‘we consider the subgraph consisting of all the permutations that fix the 


symbol i (1<i<n) in the i* position then we are left with (n—1)! vertices 


that are interconnected by edges that correspond to swapping the first 
- symbol with any one of the other symbols — except the symbol in the i“ 
position. 


We have tried to make a case for the star graphs by comparing them 


to the n-cubes. We believe that the Cayley model extracts the attractive 


properties of the n-cubes and formulates it in an abstract setting. This 
allows us to design other networks that possess similar properties, the star 
graphs being a prime example. An issue that we have not addressed in 
this section is how one uses the interconnection structure of these Cayley 
graphs to develop specific computational algorithms that can ‘be. executed 


on these networks. We address that in the next section using the pancake . 


graphs as our example. 


6. ALGORITHMS ON THE PANCAKE GRAPHS 


‘Recall that the n -pancake graphs are obtained using the n—1 possible 


fn» as generators. The 
nancake flip, f;, is a permutation that reverses the prefix of i symbols 


pancake flips, which we will denote by f2,f3, ... 
jth 


in the identity permutation. In Section 2 we analyzed the degree and. 


diameter of these graphs and also its recursive decomposition structure. 


In this section we show how we might implement certain fundamental 
algorithms on such a network. First, we should recount similar algorithms 
on the n-cube. Suppose we wish to find the maximal element amongst 2” 
elements distributed among 2” processors located at the vertices of an n- 
cube. A straightforward algorithm involves n time steps. At the i” step 
every processor communicates with the processor connected to it along 
the i* dimension and compares notes on the maximal element that it has 
encountered so far. We then claim that at the end of the n steps every 
processor knows the value of the maximal element. The proof of the 
claim goes as follows: Let d; represent an edge along the i* dimension 
and let a word of the form d2,d4,d3 represent a path in the n -cube starting 
from a specified vertex. Using these conventions, consider the word 
d1,dz,...,d,. It is easy to see that given any two vertices on the cube, 
there is a subsequence > of the above word that forms a path from the first 
to the second. Consequently, for every vertex a of the n-cube there is a 
subsequence of the above word that-forms a path from the vertex contain- 
ing the maximal element to a. This establishes the claim that the proces- 
sor at every vertex knows the maximal element. 


The reason for detailing the above algorithm is to establish the back- 
ground necessary to implement a similar algorithm on the pancake graphs. 
What we need is a word on the alphabet of the generators {f;}, such that 
given any two vertices in the pancake graph there exists a subsequence of 
the suggested word that forms a path between the two vertices. Actually, 
_ it is sufficient to show that we can sort an arbitrary permutation. We pro- 
vide precisely that. First, we point out that in an algorithm such as the one 
suggested above for the n-cube each processor communicates with only 
one other processor at each time step. Consequently, the number of pro- 
cessors that know the value of the maximal element can only double at 
each step. Thus, any such algorithm must take at least as many steps as 
the logarithm © of the number of vertices. In the case of the n-pancake 
graphs, which contain n! vertices, this lower bound is O(n logn). | 

Consider the following word (a.single word of 21 symbols broken up 
into many lines) on the generators of an 8-pancake graph: 


fafafo fs 
fofaf2 
Sf sfato 
fafe 
Safe 
fa fa 


The above word has been broken up into many lines to emphasize the 
block structure of the word, and the groupings within each block have 
been appropriately indicated. 


We claim that for every permutation there is a subsequence of the above 
word that forms a path to the identity, i.e., sorts the given permutation. To 
sort the permutation we first bring the largest symbo!, namely 8, to the 
first position. This is done using a subsequence of the first three letters in 
the word (grouped together to indicate that). It should be clear that we 


5. A subsequence need not necessarily be a contiguous subseaence? 
6. All logarithms are in base 2. 
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can bring 8 to the first position no matter where 8 starts out. Having done 
that we take 8 to the last position using f;. Thus, the first block of the 
word is sufficient to position 8 in the correct place. As the reader might 
have guessed the subsequent blocks of the word are each sufficient to 


position each of the remaining symbols in their correct places. We have 
proved the following general theorem: 


Theorem 8: There is an O(n logn) algorithm to find the maximal ele- 
ment amongst n! elements distributed among n! processors located at the | 
vertices of an n -pancake graph. 


Proof: It should be clear how to generalize the word suggested above for 


the 8-pancake graph into a general word for the n-pancake graph. 


Further, we can also similarly prove that every permutation of n symbols 
can be sorted by a subsequence of that word. The algorithm, then, is 
merely to execute one letter of that word at each time step. Executing a 
letter requires that every vertex communicate with the vertex connected to 
it along the edge corresponding to that generator and compare notes on 
the maximal element that it has encountered so far. Thus the (parallel) 
time taken by this algorithm is exactly the length of the word. It is easy to 
see that the length of the word is O(n logn). O 


In fact, we can use the above technique to construct a binary tree of 
depth O(n logn) over the n! vertices of the pancake graph. Every edge 
of this tree is an edge of the pancake graph. Thus, we can simulate a 
binary tree. Once we do that we can perform any computation for which 
the solution on the n -cube employs the construction of a binary tree. For 
example, as the following theorem states, we can compute prefixes over 
an arbitrary associative binary operation. 


Theorem 9: Given any associative binary operation * and an assignment 
of values x; to each of the n! processors located at the vertices of an n- 
pancake graph, there is an O(n logn) algorithm that computes (in paral- 
lel) the prefix x, * x2* +++ * x; at processor i. (The n! vertices must be 


numbered appropriately.) 


Proof: Omitted. 0 


The prefix computation problem is interesting because it is a general 
formulation of a class of problems. For example, the problem of comput- 
ing the maximal element can be viewed as the prefix computation problem 
over the associative binary operation MAX. Other trivial applications 
include computing an n-ary associative operation, such as summation. 
Non-trivial applications include addition of binary numbers, (particularly, 
computing the carry bits), simultaneous cvaluation of polynomials, etc 
(see [18]). The prefix computation is merely an example of how we might 
design algorithms on the pancake graphs. Clearly, there is considerable 
scope for further work in this area. 


In the last two sections we have made a strong case for two new inter- 
connection networks: the star graphs and the pancake graphs. We have 
shown how they compare against the n-cubes. We have used the n-cubes 
as our basis for comparison since (we believe) it is the most popular non- 
trivial interconnection network. However, the n-cubes themselves are 
Cayley graphs, as we mentioned in Section 1. Even some of the variants 
of the n-cubes, such as the cube-connected cycles, can be abstracted as 


‘Cayley graphs. However, this requires that we develop an interesting 


connection between group theoretic products and graph theoretic pro- 
ducts. Unfortunately, space limitations prohibit development of that con- 


nection in this paper. Instead, we rcfer the curious reader to the more 


detailed report [19]. 


7. CONCLUSION 


The main conclusion of this paper is that this group theoretic view of 
symmetric interconnection networks not only abstracts the structure and 
symmetry properties that make the n-cubes so attractive, but also offers a. 
fertile source of other promising topologies. We can now bring to bear 
the algebraic tools in analyzing such symmetric networks’. The univer- 
sality of this model, indicated by Theorem 3, allows us to present all sym- 


metric networks in a uniform and comparable framework. 


7. An extensive survey of the literature on the use of group theoretic techniques in analyzing 
symmetries in graphs can be found in [20]. 


Some of the immediate advantages of using Cayley graphs as a tool 
for modeling interconnection networks is the conciseness with which a 
symmetric network can be specified, i.e., by providing the set of genera- 
tors for the Cayley graph. Furthermore, casting such a network as a Cay- 
ley graph immediately allows one to infer all the generic properties of 
Cayley graphs. 


Aside from representing known interconnection networks as Caylcy 
graphs, we have pointed out how this model offers an inexhaustible 
source of new topologies. In particular, the ability to start from a finite 
group and construct a Cayley graph is particularly attractive. We belicve 
that we have hardly made a dent in the possible topologies that one could 
investigate. And even with such a limited investigation we havc 
uncovered many topologies that compare favorably against the n -cubes. 


The n-cubes have a number of attractive features. But many of thesc 
features can be interpreted as symmetry properties that many other Cayley 
graphs could well possess. We have indicated this at many places during 
the course of our presentation. A case in point is the recursive decompo- 
sition structure of Cayley graphs. That is really an analysis of the cosets 
of the group with respect to a given subgroup. 


Finally, we have made an initial attempt at showing how we might 


design certain fundamental algorithms on such networks as the pancake 
graphs. We believe that the design of algorithms on such networks could 
conceivably use the algebraic properties of these graphs. There are 
definitely many open problems along these lincs. 


There are also many other features of an interconnection network that 
we have not at all addressed in this paper. For example, what are the 
issues involved in laying out these graphs? Can the n-star graphs be laid 

out at least as efficiently as the comparable n-cubes? We do conjecture 
that the m -star graphs can be laid out on a surface of genus (n-2). 


In summary, we have offered a unified view of symmetric intercon- 
nection networks and suggested some new networks — the star graphs 
and the pancake graphs. We believe that we have done little more than 
scratch the surface of an obviously fertile field. 
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ABSTRACT 


Parallel communication algorithms are central to large scale 
parallel computing. This paper identifies worst-case source- 
destination traffic patterns, and proposes a scheme for 
obtaining relief by means of randomized routing of packets on 
simple extensions of the well-known omega networks. The 
communication problem is considered in two separate contexts: 
the non-renewal, synchronous and the renewal, asynchronous. 
In the non-renewal context we show that our scheme performs 
as well as Valiant’s. The algorithm extends naturally from the 
non-renewal to the renewal. First, we explicitly identify the 
worst-case traffic intensities in the internal links of the 
extended omega networks over all source-destination traffic 
specifications which satisfy loose bounds. Second, the benefits 
of randomization on the stability of the network are identified. 
Third, exact results, for certain restricted models for sources 
and transmission, and approximate analytic results, for quite 


general models, are derived for the mean delays. 


1. INTRODUCTION 


This paper addresses the concern that particular, not rare, 
source-destination traffic patterns may cause unbalanced usage 
of the internal links of an interconnection network and thereby 
severely degrade performance in large parallel computers 
[1,2]. Worst-case traffic patterns and their effects are 
identified, and a scheme for obtaining relief by means of 
randomized routing on simple extensions of the omega network 
is proposed and analyzed. Another feature of this work is that 
the main performance results are given as they depend on a 
parameter r which measures both the degree of extension of 
the omega network and the degree of randomization. The 
omega network [3] corresponds to r =0 and the maximum 
value of r corresponds to the fully extended omega network 
which has one less than twice as many stages as the omega 
network. 


The communication problem with N sources and N 
destinations is considered in two separate contexts, the non- 
renewal and the renewal. The non-renewal context was 
introduced by Valiant [4-6] and is a model for synchronous 
communications. Here the task is to complete a partial h- 
relation [4] in which initially each of the sources have at most 
h packets, with no destination occurring on more than h 
packets. The assumption is that all link queues are initially 
empty and that no packets are allowed into the network while 
the task is in progress. Valiant exhibited an algorithm on a 
log N-dimensional hypercube which completed the task in time 


O(log N) with overwhelming probability. Now, the hypercube 
requires switches of degree log N which is unbounded. The 
algorithm that we give uses switches of fixed degree and also it 
requires no scheduling [6] — the queue discipline is throughout 
first-come-first-served. For the fully extended omega network 
and for the same non-renewal context, we prove that the 
scheme has as good delay characteristics as has been proven 
for Valiant’s scheme. 


The main results of this paper are on the non-renewal 
context in which packets are generated, and allowed into the 
network, continually and asynchronously. Here the main 
interest is in the stationary throughput-delay characteristics [7] 
for various traffic patterns. The packets at each source are 
assumed to form stationary renewal processes. The scheme 
carries over naturally from the non-renewal context. (Valiant’s 
scheme is not defined for this context.) It is not tied to the 
omega network with 22 switches, which we have chosen to 
be specific, and other pipelined networks with switches of fixed 
degree can serve just as well. The randomized routing 
algorithm requires each packet to be given a statistically 
independent scattering ticket. A distributed switching rule 
implements the routing. 


The analysis in the non-renewal context considers partial 
\-relations in which the intensity, or average rate, at each 
source is no more than X and the intensity at each destination 
is no more than X. We first give an explicit characterization of 


- the extremal traffic on the internal links of the extended omega 
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networks. The mean delay is nonlinear in \ and rapidly goes 
to c as X approaches a stability boundary. Randomization has 
the effect of extending the region of stability. The stationary 
mean time from source to destination is characterized for 
various stochastic models of the source, link transmission time 
and the degree of randomization. In particular, on the fully 
extended omega network this mean time is asymptotically 
proportional to log N in the region of stability. 


In conclusion, the results of the present worst-case analysis 
argue strongly for maximum randomization. This paper 
reports partial results; other results and proofs may be found in 
[13]. 


2. THE EXTENDED OMEGA NETWORK AND 
RANDOMIZED ROUTING 


The extended omega network is obtained by preceding an 
omega network of n stages (N = 2”) by a scattering network 
with , stages, 1 <r < m—1. See Figure 1 for an example 
with nm =3 and r=1. At each stage of the scattering network, 
traffic entering a switch is, on the average, routed to each 
output port evenly. Scattering is accomplished by assigning a 
scattering ticket to each packet at the source. Each scattering 
ticket is a binary r-tuple c,,...,c, in which each element is 
obtained from the result of a completely independent trial with 
equiprobable outcomes. A _ packet with scattering ticket 
Cp,...,€, is routed to the output port with the least 
significant bit of its address given by c, at the i'" stage of the 
scattering network. Thereafter the packet is routed through 
the omega network as usual. 


We describe an indexing system for the stages and links in 
the extended omega network. Of the n+r stages, the one 
closest to the sources is called the first stage. The ports on 
both sides of each switch are indexed identically, with all port 
addresses being n-bit binary sequences a,,...,a@, with zero 


at the top. 
Each link is given an address with the format 
(i; a,,...,@,), where i denotes the stage at which it 


originates and a,,...,a, is the address of the port in the 
switch at the (i+1)™ stage where the link terminates. The 
wiring connects output ports a,,...,4@, at the i" stage to the 
input port a), a,,...,@ at the (( +1)" stage. The link that 
makes this connection is, by our previously stated rule, 
(i; a), a,,...,@). The net effect is that a packet on link 
(i; a,,...,4@,) on entering a switch in the +1)" stage is 
switched to output port b € {0,1}, and leaves on the link 
(i+1;b,a,,...,@>). Each link is equipped with a queue at 
its originating port, which for the purposes of this paper is of 
unlimited capacity. 


For each packet originating at source s,, 
destination d,,,...,d, and scattering ticket c,, . 


RAd,,.. 
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Now consider a window of width 7, i.e. it exposes n 
consecutive numbers of R. The window therefore has n +r +1 
positions. Number these positions 0 to n +r, where position 0 
is right most and an increment in position corresponds to 
sliding the window by one unit to the left. Fact: the link used 
by the packet at the i stage, 1 <i <nt+r, has address 
given by the values in R which are exposed by the window at 
the i" position. This Fact is very useful in inferring the traffic 
carried by a particular link. 
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The above is consistent with the following distributed 
switching rule: while the packet is at the switch in the i™ 
stage, route it to the output port whose address has for its least 
significant bit 


3. TRAFFIC ANALYSIS: THE NON-RENEWAL CASE 


The probabilistic bound given here holds for partial h- 
relations [7], see Section 1. The assumption is made in this 
section that each link transmits one packet in unit time. 


The extended omega networks are not non-repeating [4] 
and not non-overtaking [8]. That is, it is possible for two 
packets to use a common link at stage i,, different links at 
stage i, and a common link at stage i,, for some i), i, i; 
where 1 <i, < i, < i;. Note however that there is partial 
non-overtaking. To see this split the extended omega network 
with (n+r) stages into two halves. From the Fact in 
Section 2 it is easy to see that each isolated half is non- 
overtaking. 


Specifically, with 


rA ltr], (3.1) 
each packet’s transit through links in stages 1 to 7? is called its 
Phase 1 and its transit from stage (i +1) to (n +7) is called its 
Phase 2. The time to complete each phase by all packets in a 
partial h-relation is, in turn, bounded below with the help of 
the non-overtaking property in each phase. 


Consider first Phase 1; Phase 2 is similar. The routes of 
two packets are said to intersect in Phase 1 if at least one link 
in stages 1 to i is common to both routes. Let X be a marked 
packet. The following facts are proven in [13] to hold 
regardless of the value of r: if for each distinct packet other 
than X we consider a corresponding trial in which success 
denotes that the packet intersects the route of X in Phase 1, 
then these trials are statistically independent; the expected 
number of packets which intersect X in Phase 1 is at most 
oh/2, where 


eta st ler} 


Note that in the fully extended omega network o = 7. Then 
by using standard techniques [4] we prove the following in 
[13]: 


Theorem 1: (i) For any A 2 oh/2, 


Pr(time to complete Phase 1 > (n+r)/2 + Al 


A 
ech —ah/2 
< Nh |— : 
aca eT. ad 
Pr | time to complete Phase 2 > c a +ht+aA 
a 
h _ 
Nh eo oh/2 
= 2A 


(ii) Asymptotics: for r= n—1, each phase in any partial h- 
relation is completed in time O(logN) with overwhelming 
probability. In particular, for any K > e, 


Pr[{time to complete each phase} > (n+r+1)/2+h+Khn] 


<h N7 ‘KA —1) 
For r = yn, ¥ fixed and less than unity, each phase in any 
partial h-relation is completed in time O|N“-Y”| with 
overwhelming probability. 


4. TRAFFIC ANALYSIS: THE RENEWAL CASE 


In the renewal context, at each source s,,...,5, the 
packets to be delivered to destination d,,...,d, form a 
stationary stochastic process with mean rate, or intensity, 
Ms,,---+543d,,---,4,). A particular case is Poisson 
processes at the sources. The matrix of these traffic intensities 
over all sources and destinations constitute the source- 
destination traffic specification. Unlike the treatment in 
Section 3, here we allow for statistical fluctuations in the time 
required by each link to transmit a packet, although special 
consideration is given to the case of constant link transmission 
time. We assume that when the link transmission times are 
random, they are, for all links and packets, mutually 
independent, independent of the traffic and picked from a 
common distribution with mean 1. 


In a partial )-relation, 


Ns,.--2 83 d_,--+,4) SA, VW Sy,.--,8)- 


In full A-relations, the above holds with equality. Hereafter, in 
this section, we exclusively consider full A-relations since the 
results on delay statistics for it are obvious bounds for the 
partial \-relations. 


4.1 Link Traffic Intensities 


Here t(i; a,,..., 4) denotes the traffic intensity, i.e. the 
mean number of carried packets per unit time, on link 
(i;a,,...,4,). Also, TQ,n,r) denotes the extremal traffic 
intensity over all links and all traffic specifications satisfying 
the full A-relations, 1.e. 

T(A,n,r) & 


max max max t(i;a,,.. 


full\-relation i;1<i<(m+r) a,,..., ay 


We say that the traffic is symmetric if the traffic intensity is 
identical in all links of the network. 


We will find useful the notion of traffic class. Each traffic 
class is indexed by source, destination and scattering ticket. 
Each traffic class has an associated traffic intensity and quite 
obviously the traffic intensity of the class indexed by 
See Sen Sree sr Crick Cy is 
Ms,» ooo Shs d,; ee % ,a,)/2". 


The traffic intensity t(@i;a,,...,a,) is obtained by 
summing the traffic intensity of all classes with routes which 
include the link (i; a,,...,4,). The Fact in Section 2 may 
be used to calculate this quantity. 


.,@;). 
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The following theorem and its corollary is a summary of 
our results on the extremal link traffic intensities, both as seen 
by a packet following a specific route and as seen by an 
independent observer of traffic in the extended omega network. 
The proof is in [13]. 


Theorem 2: (i) for full \-relations and for all binary m-tuples 
(a,,-.-,44), 


t(i;a,,...,@,))=A, 1<i<r and n <i << ntr-4.1) 
If r = n—1 then the above specifies the traffic intensities on all 
links of the extended omega network. If r < n—1, 


tli;a,,...,a) <& AQminG-r,n-) pt Si << n—1(4.2) 
(ii) If r <n—1 then there are source-destination traffic 
specifications satisfying the full \-relations such that the traffic 
intensity on each of the links used in the route followed by a 
particular class is given by (4.1) and (4.2), with equality 
holding in (4.2). 


Corollary: (i) 


TOn,r) = lan] | 


(ii) The traffic is symmetric for all source-destination traffic 
specifications satisfying the full A-relations if and only if 
r=n-—l. 


In [13] we construct source-destination traffic specifications 
for which the traffic intensity on links is as bad as asserted in 
statement (ii) of the Theorem. The point of the construction is 
to show that they are not rare. 


Example: Consider the extended omega network in Figure 1 
in which 2=3, r=1. Suppose each source transmits at the 
average rate of \ packets per unit time, and, on the average, 
source $3, 55,5, sends a fraction a(s;,s,) of its packets to 
destination s,, 5, 53 and the remainder to s,, 55, 53, where the 
bar denotes bit reversal. The numbers a(0,0), a(0,1), a(1,0), 
a(1,1) are arbitrary in [0,1]. 


Note that each destination receives packets at the average 
rate of \ packets per unit time. The resulting traffic intensities 
on the internal links of the network are as shown on the links 
in Figure 1. The links of stage 2 either have traffic intensity 
2d or carry no traffic. For general n and r, with (+7) even, 
there are many source-destination traffic specifications for 
which 2*”)/2 links at stage (n +r)/2 carry traffic of intensity 
AQ -MF 2 and the remaining links of the stage carry no traffic 
at all. 


4.2 Stability With Respect to Full \-Relations 


The stability condition of interest here is that at each link 
the traffic intensity is less than the mean link transmission time 
of packets. Our primary interest is in stability with respect to 
full d-relations, i.e. stability for all source-destination traffic 
specifications which satisfy the full A-relation. From the 
Corollary to Theorem 2, the condition for stability with respect 
to full A-relations is 


: 4 |e-r)/2 | ai, 


SOURCES DESTINATIONS 


word Tost eT Te pce 
m2 eo xe 
no q x 0 
a Se 


Figure 1: An extended omega network with n=3, r=l. The path from source 
101 to destination 001 of packets with scattering ticket c,=1 
is shown by chained line, and of packets with cy=0 by 
dashed line. The inscriptions on links are traffic intensities 
for Example in Section 4.1. 
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Figure 2: Extended omega network, n=10 and various values of r. 
Poisson processes at sources, exponentially distributed 
link transmission time. 
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For the omega network r = 0, and the stability condition is 
 < 1/VN if log, N is even, and \ < V2/N if log, is odd. 
For the fully extended omega network r =n-—1, and the 
stability condition is A < 1. 


4.3 Mean Delay for Various Degrees of Extensions of the 
Omega Network. 


Here we make the restrictive assumption that the packets 
at the sources form Poisson processes and that the link 
transmission times are random variables with exponential 
distributions. The exponential has the advantage that exact 
delay statistics are simple to obtain. The well-known result 
which makes the problem tractable is that, in equilibrium, the 
joint distribution of packets of all classes and in all links is of 
the same form as if the arrival processes to the links are 
Poisson. Therefore, 


W(i; a,,...,4@,) = mean queueing delay at link @; a,,..., a) 
ere Meee (4.3) 
1—ti;a,,...,a) 


where t(_) is the traffic intensity on the link. 
For each class, let 


D (class) & mean time from source to destination for class packets, 


- »> {W (link) + 1}. 


link € class route 
Finally, let 


D& max max D(class). 
full\-relations class 
From statement (ii) of Theorem 2 we obtain 


Proposition: With randomized routing in the extended omega 
network with n +r stages, 1 < r <n-1, 


(n—r—1)/2 
pests Lif, (a+r) is odd 
1 —X i=] 1 —)2' 
ort] l (n—r—2)/2 1 


— 
1-X 1 —j2n-r)/2 2 1—d2! 


Figure 2 plots D against \ for various values of r when 
n=10. This figure indicates that, unless \ can be a priori 
restricted to be quite small, the worst-case analysis of this 
paper argues in favour of maximum randomization. 


4.4 Approximate Analysis for General Renewal Sources and 
General Transmission Time Distributions in the Fully 
Extended Omega Network. 


Here we depart from the preceding treatment of the 
renewal case by assuming that the packets at the sources form 
general stationary renewal processes, and the distribution of 
the link transmission time to be general. Poisson sources and 
constant link transmission time are, in particular, allowed. 
However, the analysis is approximate. The program that we 
follow is based on a proposal of Kuehn [9]. The program 
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-, if (n+r) is even 


decomposes the network into GI/G/1 queues and relates the 
descriptors of the processes associated with each queue by 
using prior results [10,11] on GI/G/1 queues. The program 
undertakes (i) the approximation of the departure process from 
each link queue as a stationary renewal process, and (ii) the 
description of each stationary renewal process by two moments, 
the mean i.e. traffic intensity, and the coefficient of variation of 
the underlying interarrival distribution. In particular, the 
distribution of the link transmission time is specified by its 
mean, which is 1, and its coefficient of variation, 


c? & variance of link transmission time. 


For constant transmission time c,=0 and for exponential 
distributions cy = 1. Also, the stationary renewal processes at 
the sources are described by their mean \, and the common 
coefficient of variation c,, | 


aan variance of interarrival time of packets at source. 
so 2 
r 


For Poisson sources c, = 1. We mention that a large body of 
validations and corroborations with analytic solutions and 
simulations have been reported [9-12]. 


We only consider the fully extended omega network, i.e. 
r =n-—1; extensions to r<n-—1 are easy. In this case traffic 
is symmetric and we are able to obtain simple, explicit 
relations, in the form of recursions, for all traffic descriptors in 
the network. Specifically: (i) the departing process from a 
link queue in stage i is approximated by a stationary renewal 
process with coefficient of variation cy ;; (ii) this process is split 
into two stationary renewal processes with means \/2 and 
coefficient of variation c; ;4;; (iii) two processes described by 
(/2, c;-1;) are superimposed to form the arrival process 
described by (A, c, ;) at a link queue in stage i. 


We now give three relations to correspond to the above. 
From [10] we have the approximation 


chi = Cay + We7 — Weg, +7) EQ, C2 ;,¢7),1 2 14.4) 
where, 


—2(1-v) =; 


(A, a c?) = ex 
g& a,l £ p 3r (c2 +3) 


ife,; <1, 


(c?.; = 1) 
(c2,+4c7) 


= exp —(1—)) 


ife,; > 1. 
Decomposition, or splitting, of renewal processes yields [9], 


cP4,-1= Sei ~1), i 22. (4.5) 


Finally, for superposition we use the following approximation 
[12], 


(c2_,,-1) 
2 i-l,i ‘ . 
Cc. - ls, i 2 2. (4.6,i) 
ot 1+4(1—-r)2’ 7 
Also, there is the initial condition, 
eg, =c7, i=l. (4.6,ii) 


Equations (4.4)-(4.6) define a complete recursive system which 
yields {c, ;, qj, C413 | 2 1}. Note that the dimension of the 
network given by 7 does not appear in the system. 


The mean queueing delay, W, for a link queue at stage i 
(queueing delay excludes transmission time) is given by the 
following exact relation for GI/G/1 [11]: 


2 252 2 
Cai + 2X Ce — Cdii 


W, = 
2n(1—) 


24121; (4.7) 


The above together with Little’s Law gives the mean queue 
lengths at each link. Finally, the mean time from source to 
destination, D, is given by 


2n—l 
D=(Qn-1)+ 3 W,. 


i=] 


(4.8) 


The first term is from the transmission time and the second is 
from the cumulative queueing delay. Note that (4.8) is the 
only point in the procedure where n is used. 


Tables 1 and 2 give numerical values for the case of 
Poisson sources and constant link transmission time. The 
important observation is that for each A, the queueing delay 
shrinks at the early stages before rather quickly reaching an 
asymptotic value. For any fixed, positive A, c? and c?, the 
asymptotic values of c?;, cg; and c?,;, as i +, are 
obtained simply as fixed points of the recursions in (4.4)-(4.6). 
These may be substituted in (4.7) to obtain the asymptotic 
value WA, c2,c7), where W, ~ WA, c2,c?). From (4.8), 
as n — oo, we obtain the noteworthy relation 


D~ 21+ WO, c2,cP)}n. 


The last row of Table 1 gives the computed values of 
W(,, 1, 0) for various values of X. 


Mean Queueing Delay (W;) 


Stage i, i >4]| 0.124 


Table 1: For r = n—1, and full \-relations; Poisson processes 
at sources and unit time for transmission across each 
link. Calculated from (4.4)-(4.7). 
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Mean Time from Source to Destination (D) 


Table 2: Specification as in Table 1. Calculated from (4.4)- 
(4.8). 


The procedure based on (4.4)-(4.8) with c, = c, = 1 gives 
correct values for Poisson sources and exponentially distributed 
link transmission time. In this case, W = W; =d/(i-—d). 
Also, the numerical results for c, in the range [0,1] are 
bounded by the results for c, = 0 and c, = 1 in the case of 
Poisson sources. 


4.4.1 Accuracy of Approximation, Simulations 


Comparison with simulations has shown that for constant 
link transmission time and light traffic, i.e. small A, the 
decomposition overestimates the queueing delay in all stages 
beyond the first. On the other hand, the approximation is good 
in heavy traffic. The net effect of the inaccuracies on the 
mean time from source to destination in the fully extended 
omega network is small. This is, of course, because in light 
traffic this time is dominated by the transmission time. Also as 
expected, the largest errors occur at about A = 0.6. Generally, 
the errors are smaller as we depart from constant link 
transmission time and add more variability to it. 


As previously observed, the procedure in (4.4)-(4.8) yields 
mean queueing delays which converge quickly, with increasing 
stages, to asymptotic values. This important fact may be used 
to simplify simulations. Thus, an omega network of small 
dimension, say with four stages, may be simulated, and the 
mean delay observed for the final stage may be taken to apply 
to all subsequent stages in calculations for large extended 
omega networks. We have undertaken the program just 
outlined. The results of the simulation of the four-stage omega 
network for Poisson sources and unit link transmission time are 
given in Table 3. The results of the extrapolation to large, 
fully extended omega networks are given in Table 4, which 
should be considered a refinement of Table 2. 


Mean Queueing Delay 


Table 3: From simulations of omega network with 4 stages 
for Poisson packet traffic at sources and unit time 
for transmission across each link. 


Mean Time from Source to Destination 


Table 4: Extrapolated from data in Table 3. Compare with 


Table 2. 
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A Comparison of Two Synchronization Primitives in an 
Operating System for Parallel Processing Applications 
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Abstract — It has been claimed [REED79] that the 
synchronization primitives termed eventcounts and 
sequencers may be used to implement semaphores, 
and thus that they are “primitives at a lower level 
than semaphores’’. However, by expressing the func- 
tionality of conventional semaphores, and that of 
semaphores implemented via eventcounts and 
sequencers, via a common set of primitives, it may be 
shown that in fact the eventcount/sequencer pair 
imposes restrictions on process synchronization that 
do not exist with the traditional semaphore. Two 
such restrictions exist. First, the 
eventcount /sequencer pair imposes an ordering on 
the activation of suspended processes whereas sema- 
phores activate the processes nondeterministically. 
Second, it is difficult to implement a ‘‘conditional P”’ 
operation with eventcounts and sequencers because 
one cannot return a ticket obtained from a sequencer 
to the ticket pool: a ticket, once obtained, must be 
used. Thus, the claim that eventcounts and sequenc- 
ers are lower-level primitives than the traditional 
semaphore is contradicted by this analysis. 


Introduction 


D. P. Reed and R. K. Kanodia, in their 1979 paper 
“Synchronization with Eventcounts and Sequencers’’ 
[REED79] present a new process synchronization 
method with two primitives, advance and await. A 
process may await(E,V) an eventcount E’s reaching 
a certain value V; the eventcount is made to reach 
this value eventually by other processes performing 
an advance(E£). A separate ticket(S) primitive is 
defined to give unique, consecutive, increasing integer 
vaules on each invocation. 


The authors observe that eventcounts and sequencers 
are “‘primitives at a lower level than semaphores,”’ 
and that semaphores can be built out of them as a 
result. They also demonstrate that some more 
powerful operations on semaphores can be built. 


1. A third eventcount primitive, read(E), returns the current 
value of the eventcount. 


0190-3918/86/0000/0231 $01.00 © 1986 IEEE 
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Concurrent Computer Corporation is presently 
involved in research and development on primitives 
to be used in synchronizing OS-related operations on 
general-purpose, symmetrical, parallel processors. In 
investigating the synchronization primitives available 
in the literature, we chose Eventcounts and 
Sequencers as a synchronization primitive that would 
be compatible with our planned OS and hardware 
architecture. During the course of the 
implementation, however, it was found that problems 
existed which would have not occurred if 
conventional semaphores had been used, particularly 
in situations in which entry to a critical section was 
desired only if such entry would not cause suspension 
of the process. These situations tend to be common 
in real-time applications, including our application of 
I/O with a goal of maximum parallelism in tightly- 
coupled multiprocessors. In such an application, 
precise process synchronization is required, but 
delaying of a process which cannot enter a critical 
section is not desirable: the process often is able to 
perform some other task until the critical section 
becomes available. Certain deadlock avoidance 
algorithms also needed to function in this way. 


As a_ result of the apparent limitations in 
eventcounts and sequencers, the mechanisms 
underlying these primitives were examined using 
methods previously employed in _ investigating 
interprocess “communication primitives ~ in 
multiprocessors [ROSK84]. This paper demonstrates 
that the implementation of semaphores given in 
REED79 does not provide the flexibility of Dijkstra’s 
original primitives [DIJK68] due to the combining of 
two required primitives, those for token-oriented 
synchronization and sequencing, into the one ticket 
primitive. This is done by expressing both in terms 
of comparable primitive operations; the differences 
then become apparent. 


_ Process Sequencing /Suspending Primitives 


We first define two primitives which may be used to | 


effect the process suspending and resumption needed 
in order to cause process synchronization. There is 
one suspension primitive, which stops a process, and 
assigns it a specified integer ‘‘tag,’”’ and one 
resumption primitive, which selectively resumes all 
processes having a given tag. 


stop(p,L). 


Let stop(p,L) denote an operation which. 


places the process executing it on a list L, 
assigning an associated integer ‘‘process group 
tag’ p. Execution of the process is then 
suspended. The process group tag is simply 
an integer which may be used to selectively 
resume the process or group of processes which 
have been assigned that tag. 


start(L,p). 
If the set of processes in L with tag p is 
nonempty, the processes with tag p are 
removed from L. Execution of the processes 
thus removed are resumed. 


Token-Oriented Primitives 


We next define a set of token-oriented primitives used 
to arbitrate access to critical sections of code. 


Let T be a set of n tokens, t 1<c<n. 


g(T). 
Let g(7) denote a get operation defined as fol- 
lows. If TAG, g(T) returns an element t€T, 
with 7 arbitrary; and indivisibly sets Tit: 
{). 
If T=, g(T) returns the null value ¢. 

p(7,?). 
Let p(7,t) denote a put operation which sets 
T—TU{t}. 

ticket(T). 


Let the number of tokens in a set T be count- 
ably infinite; and let name(t) denote an 
integer which may be arbitrarily assigned to a 
token t€T. Then let ticket(T) return g(T), 
such that if on a_ given invocation of 
ticket(T), name(ticket(T)) = 7, then the next 
consecutive invocation yields name(ticket(7)) 
+1; and on the initial invocation, 
name(ticket(T)) = 0. 


Comparison of Semaphores with 
Eventcounts/Sequencers 


We can now compare Dijkstra’s original semaphore 
operations [DIJK68] with the implementation of 
semaphores described by Reed and Kanodia. 


Dijkstra’s semaphore operations may be defined in ~ 


terms of the above primitives as follows. Assume for 
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simplicity that t is a global variable in the calling 
process’ private address space, and thus is part of 
the program state of the process invoking the syn- 
chronization primitive. 


As is the case for most semaphore implementations, 
the sequences of primitive operations comprising the 
P and V operations are themselves critical sections: 
wherever a process has a copy of the state of some 
outside object (a pool of tokens, or a counter) in a 
local variable, the actual state of the object cannot 
be changed as long as the process will in the future 
act upon the copy of the state which it has in its 
posession. This is an instance of the implicit copy 
operation that occurs when a conventional memory is 
read, as discussed in ROSK84. 


P(S): 
repeat 
t := g(S.tokens); 
if t=¢ then stop(0,S.queue); 
until t<>¢; 
V(S) 


p(t, S.tokens); 
start(0,S.queue); 


In the above definition,.a V causes all waiting 
processes to briefly awaken and attempt to get a 
token; but only one will succeed for each execution of. 
V, so the others remain blocked. An alternative 
(and more commonly seen) implementation would be 
only to awaken one arbitrary suspended process for 
each execution of V; but the synchronization results 
are equivalent, and the latter approach complicates 
comparison with the eventcount-based semaphores. 


The eventcount /sequencer semaphore 
implementation as presented by Reed and Kanodia 
may likewise be defined in terms of the above 
primitives as follows. Assume that prior to the first 
V, S.count is initialized to zero. 


P..(S): 
Ze p := ticket(S.sequencer); 
if S.count < name(p) then 
stop(name(p),S.queue); 


V4<(S): 
ES*''S.count := S.count+1; 
start(S.queue,S.count); 


Clearly, these are not equivalent semaphore opera- 
tions. Dijkstra’s semaphore primitives do not employ 
the priority queueing portion of the stop primitive, 
and arbitrate the stop/proceed decision using 
anonymous tokens which are returned to a common 
pool of tokens after use. The eventcount /sequencer 


2. This may be generalized by making ¢ a variable denoting a 


set, with the assignment to t adding a member to the set, 
and the reference in P(t,S.token) removing one member 
from this set. 


implementation does employ priority queueing, such 
that the processes, if required to stop, are started in 
the order in which they were stopped; and it employs 
tokens which are labelled by a sequence number and 
discarded after use. Furthermore, the actual usage 
of the tokens differs: in Dijkstra’s semaphore, poses- 
sion of the token indicates the right to proceed; 
whereas with eventcounts and sequencers, the token 
merely provides an associated integer value which 
the process must wait for the eventcount to reach. 
This latter is fundamental to the distinction between 
the two kinds of semaphore. 


Except for the ordering imposed on processes by the 
latter version of P and V, it might nevertheless be 
argued that the two are equivalent, were it not that 
a third semaphore operation, the conditional P, is 
possible on Dijkstra’s semaphore which is not possi- 
ble without great difficulty on the 
eventcount/sequencer pair. The conditional P may 
defined as follows, when used with Dijkstra’s 
semaphore. 


CP(S): 
t := g(S.tokens); 
if t=¢ return FALSE; 
else return TRUE; 


The conditional P is equivalent to Dijkstra’s P, 
except that it does not wait if no token is available. 
It is useful in deadlock avoidance; see, for example, 
BACH84 for an example of its use in a large-scale 
operating system. 


In the eventcount /sequencer implementation, the 
“token” used to arbitrate access is inseparably 
joined with the sequencing mechanism built in to the 
ticket primitive. This mechanism causes the prior- 
ity used in the call to the above stop and start 
primitives to be inseparably joined with the tokens 
used to arbitrate the right to proceed. 


Consequently, once the ticket is obtained in the pro- 
cess of determining the right to proceed, it is impossi- 
ble to simply return the ticket to the sequencer; 
another process may have performed another ticket 
operation in the interim. On the other hand, it can- 
not be discarded, since it is the responsibility of the 
process taking a ticket to advance the eventcount 
past the value on the ticket, to allow the next pro- 
cess to proceed. 


Such problems do not exist for CP when used with 
Dijkstra’s semaphore, since the nature of the 
anonymous tokens is such that there is a limited sup- 
ply, and their absence indicates the need to wait, as 
compared to the infinite supply of tickets, whose 
values in relation to an associated eventcount indi- 
cate the need to wait (and impose an ordering on the 
waiting immediately upon taking a ticket). 


The distinction between the two definitions of P and 
V, while subtle, leads to the above difficulty in back- 
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ing out of a series of P operations in order to avoid a 
deadlock. While the authors do provide a simultane- 
ous P operation, it implies use of the strategy of 
deadlock avoidance by locking all resources prior to 
using any of them. The relative merits of various 
deadlock avoidance strategies are an issue beyond 
the scope of this paper; yet it should be borne in 
mind that strategies other than that of locking all 
required resources before using any of them may be 
considered necessary in certain applications. 


Conclusion 


The above decomposition demonstrates that funda- 
mental differences exist between Dijkstra’s P and V, 


and Reed and Kanodia’s implementation of the same 


primitives. It further suggests that the sequencer 
primitives differ from those required to implement P 
and V in that they combine two underlying primi- 
tives: the token primitive presented above, and a 
sequencing primitive. As such, the P and V 
presented in REED79 could more accurately be con- 
sidered sequencing semaphores, since they impose a 
sequencing on the waiting processes. They also make 
implementation of the conditional P operation less 
straightforward, due to the inability to put back a 
sequenced token without using it. 
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ABSTRACT-- The design of an operating 
system kernel for Star is presented. Star is a 
reconfigurable multiprocessor system based upon 
a multistage interconnection network. The kernel 
is intended to serve as a basis for further 
operating system research. The kernel provides 
mechanisms for process management, local 
resource allocation, and _ inter-process 
communication. 

A key design goal of the kernel was to 
minimize loss of transparency in the underlying 
hardware. This led to the development of a novel 


communication system to support the broadcast . 


and merging capabilities of the interconnection 


network. The communication system presented 
provides for asynchronous multicast 
communication between’ processes’ which 


simplifies the tasks of creating distributed 
applications and of synchronizing cooperating 
processes. 

The current status of our system is that we 
have a four node machine running and have 
implemented the kernel in MODULA-2. 


1 Introduction 


In this paper we present the design of the Star operating 
system kernel. This kernel is replicated at each processor 
node of Star, a reconfigurable multiprocessor system [1]. 
The kernel represents a common core of software which will 
be used to investigate the problems of resource management 
and processor scheduling. The kernel provides the low level 
primitives to interface the operating system with the hardware 
and it provides the primitives for process management, 
interprocess communication (including inter-node 
communication) and memory management. 


This work is motivated by a desire to develop an 
operating system which can fully exploit the inherent power 
of reconfigurable computer systems. Achieving this goal 
will require further research in the areas of resource 
management and scheduling. Our kernel design is a step 
towards this goal and has the following important 
characteristics: 


¢ Loss of transparency in the hardware is 
minimized; by loss of transparency we mean 
any operation that is feasible in the underlying 
hardware but is not feasible using the primitives 
provided by the kernel. 


*This work is supported in part by a grant from IBM 
corporation. 
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e All decisions concerning resource management, 
program models, and scheduling are left to 
higher levels of the operating system. 


¢ The kernel provides a convenient basis upon 
which a complete operating system can be built 
and is designed to simplify this task. 


There is a substantial body of previous work in 
multiprocessor operating systems including: Hydra[2], 
Medusa[3], Eden[4], Micros [5] and many others. The 
architectures which these operating systems support vary 
widely. Star is different from most machines because of its 
reconfigurable topology; however, there are other 
reconfigurable systems including: TRAC [6], Cedar [7], 
PASM [8], and RP3 [9]. All of these reconfigurable systems 
feature shared memory while Star does not. 


A common trend in multiprocessor operating systems is 
to design the operating system to hide the underlying 
communication system. Designs like HPC [10] , Eden, and 
Micros ignore topological issues altogether. In the Star 
system we are trying to address the topology issue by 
tailoring our operating system to obtain maximum 
performance given the very real constraints of a 
communication network. | 


Our work was influenced by a desire to ultimately 
provide the kind of support for distributed programs that 
Eden and HPC provide; however, while these projects were 
principally concerned with programability, we are principally 
concerned with developing a system which will efficiently 
utilize the interconnection network. 


Because many of the design decisions of the overall 
operating system are topics for future investigation, it was 
essential that the kernel design be flexible, simple and easily 
modified. The kernel has been written entirely in 
MODULA-?2 [11]. MODULA-2 is a convenient language for 
operating system design because it provides the mechanisms 
for modular design (Modules), the convenience of a higher 
level language, and reasonable access to the system 
hardware. MODULA-2 separates the definition of software 
from the implementation. The implementation of a software 
module can be modified at will without affecting other parts 
of the design. By implementing the kernel software in 
MODULA-?2 we have been able to quickly prototype the 
system with confidence that the design can easily be refined. 


Our experimental system has a limited number of 
processors; however, we are interested in developing 
application software which could be used on machines with 


large numbers of processors. One way of facilitating this 
goal is to assume that application software is structured as a 
number of communicating processes. Because there may not 
be enough processors available to allocate a separate 
processor to each process of an application, the kernel 
supports the scheduling of multiple processes at a single 
processor. In addition, we believe that the writer of an 
application program should not have to be concerned with 
how the processes of that application might be scheduled for 
execution. 


Because the processes of an application program might 
or might not share processors, our kernel provides the 
application programmer with a global communication model 
which, to communicating processes, appears the same 
whether the processes are at the same or different nodes. 


The Star kernel is level structured and consists of three 
basic levels [12]: 


LEVEL 1 Process Manager 
LEVEL 2 Memory Manager 
LEVEL3 Global Communication 


The process manager provides support for multiple 
processes and the primitives necessary for sharing of local 
resources among processes which are scheduled at that node. 
The memory manager controls an important shared resource, 
local memory, and uses the primitives of the process 
manager. The global communication layer provides the 
mechanisms necessary for communication among processes 
which are not scheduled at the same node. In order to 
provide a consistent view of the system to application 
programs, the global communication layer also permits 
communication among locally scheduled processes. 


In this paper we will examine levels 1 and 3. Level 2, 
the memory manager, provides facilities for managing the 
storage of a processor node and is not novel. The design 
decisions which led to this level decomposition are 
considered in later sections of the paper. In designing the 
kernel we followed one basic principle; design decisions 
were delayed as long as was feasible; each layer provides 
only a simple, basic set of operations while more specific 
operations were postponed, wherever possible, to higher 
levels of the operating system. 


The organization is as follows. In Section 2 we present 
a description of the target system, called Star. In Section 3 
we present the process manager and in Section 4 the global 
communication system. In section 5 we discuss building a 
distributed operating system on top of the kernel and 
consider the ways in which our design results in loss of 
transparency. 


Target tem 


Star , as illustrated in Figure 1, is composed of a 
collection of N processor elements (PE) and a 
communication network, called Starnet. N is generally a 
power of 2. Starnet can be composed of multiple baseline 
networks [13]. The prototype system, which is currently 
running, has one baseline network and four processor nodes. 
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The baseline network, as illustrated in Figure 2, is a 
multistage interconnection network which has the 
characteristic of being reconfigurable under the control of the 
processor nodes and can form such topologies as rings, 
trees, and meshes [14]. The baseline network does have the 
property of blocking; that is, the creation of a connection 
prevents some set of connections that would otherwise be 
feasible. Therefore, the management of interconnection 
resources iS an important consideration in designing an 
operating system. Connections in the baseline network 
originate on the left (in Figure 2) and may connect several 
destination nodes on the right (called multicasting). In 
addition the network provides the capability of merging 
connection requests from the left to a common destination on 
the nght. 


Multiple 
Baseline 
Networks 


FIGURE 1 Star System 


FIGURE 2 An 8x8 Baseline Network 


Processor nodes are connected to the interconnection 
network by hardware interface units which have limited 
buffering capacity and an additional processor to handle 
communications. The hardware interface is illustrated in 
Figure 3. The communication processor interfaces to the 
interconnection network and the main processor exchanges 
data with the communication processor through an area of 
shared memory. 


The essential characteristic of Starnet is that it is 
reconfigurable and may, at any moment, be partitioned into a 
number of private buses connecting two or more processor 
nodes. We will refer to a group of nodes connected by such 
a bus as a multicast group. 


The details of how the network is reconfigured are 
immaterial to this paper. Assuming such a multicast group 
exists, there is, at any moment, exactly one node which is the 
sender (has the right to send data over the bus). All other 
nodes are receivers. Receivers may, through hardware 
control, stop the flow of data and request the right to become 
the sender. The sender may send data (unless blocked by 
one of the receivers) or it may relinquish the right to talk. In 
order for data transmission to occur, all the receivers 
connected to a bus must accept every word of data 
transmitted. 


Interconnection 
Network 


Main Processor 


| 


The hardware supports a mode of communication which 
we will refer to as reliable multicasting. This is because 
message broadcasting within a multicasting group is both 
error free and order preserving. Reliable multicast is an 
important mode of communication. Gehani has shown that 
multicasting simplifies the development of distributed 
applications [15]. In addition, multicasting greatly reduces 
the message traffic between processors needed for insuring 
mutual exclusion of access to critical sections. The 
distributed mutual exclusion problem was studied by 
Lamport [16] and Ricart [17]. Ruicart showed that in a 
general message based system which supports broadcast, but 
not strict message sequencing, N messages (where N is the 
number of processors) are needed to invoke a critical section. 
With strict message sequencing only 2 messages are needed 
to invoke a critical region. Because of the efficiency which 
reliable multicasting provides in insuring mutual exclusion 
and the ease of use which Gehani has demonstrated, we 
concluded that the multicast capability of the hardware is a 
powerful feature which the kernel must support. 


Communication 
Processor 


FIGURE 3 Hardware Interface 
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3 Process Manager 


Support for the process abstraction is provided at the 
lowest level of the Star kernel by a process manager. The 
process manager provides mechanisms to create and destroy 
processes and low level synchronization and communication 
primitives. The primitives of the process manager may be 
invoked only by processes which are local to the node. 
Although these local primitives are not directly used by 
application programs, they are used in implementing the 
global communication system provided for application 
programs and distributed portions of the operating system. 


The choice of primitive operations and the major design 
issues addressed in the process manager have a profound 
influence on the rest of the operating system. In Section 3.1 
we discuss two important design issues: scheduling and 
memory management. The communication and 
synchronization primitives provided by the process manager 
are used in implementing the communication system 
provided by level 3 of the operating system. In Section 3.2 
we discuss the functional characteristics of the process 
manager. 


3.1 Design Considerations 


Two issues which affected the process manager design 
involve scheduling and memory management. Although the 
process manager does not make scheduling decisions, the 
design had to be flexible enough to permit the implementation 
of a process scheduler at a later phase of the operating 
system design. The second issue is whether memory 
management should have been provided at a level below the 


process manager. 


We chose not to implement the memory manager below 
the process manager, because we felt that it would be easier 
to implement using the primitives of the process manager. 
The process manager provides support for the objects 
process and semaphore. Creation and destruction of these 
two object types does involve the allocation and deallocation 
of memory. In addition the process manager must 
manipulate queues of process descriptors. However, in both 
cases the amounts of memory involved are small and are 
handled using statically allocated tables. 


One of the major responsibilities of the process manager 
is to maintain a list (the Ready List) of processes which are 
ready to run. The discipline followed in queuing processes 
on the ready list and in transferring processor control to a 
ready process clearly has an impact on the processor 
scheduling algorithms which may be implemented. It is of 
great importance that this queuing discipline be sufficiently 
general to support a wide variety of scheduling policies. 


_ We defined the following characteristics of our process 


manager: 


* The process manager supports priorities; 
operations are provided which permit the process 
priorities to be examined and modified. Processes 
are selected for execution based upon these 
priorities. The priority of the running process is at 
least as large as that of any ready process. 


° The ready list is structured as a priority queue 
using a FIFO discipline to break ties. 


* Although time slicing is not implemented in our 
kernel, this feature could be added to the process 
manager without modifying the interface to the 
process manager. 


Ruschitzka and Fabry [18] have shown that given these 
characteristics a function for assigning priorities can be found 
so that the following scheduling algorithms can be 
implemented: FIFO, LIFO, preemptive shortest job first. 
With the addition of time slicing, round robin and feedback 
algorithms can be implemented. Thus process scheduling 
disciplines are not unduly restricted by the design of the 
process manager. 


3.2 Facilities of Process Manager 


The process manager provides support for processes and 
for communication between processes. The procedures 
provided by the process manager fall into four catagories: 
Process Control, Process Synchronization, Process 
Communication and Utility procedures. The process control 
procedures include: 


CreateProcess Create a new process; 

DestroyProcess Destroy a process; 

Suspend Prevent a process from executing; 

Resume Resume a suspended process; 

GetPriority Get the priority of a process; 

SetPriority Set the priority of a process; 

CurrentProcess Get the identity of the current 
process; and 

WaitlO Allows a process to wait on a 


particular hardware interrupt 
vector. When the interrupt 
occurs, the process is placed on 
the ready queue. This permits the 
Structuring of device drivers as 
separate processes. 


The control procedures lead to the process state diagram 
shown in Figure 4. A running process may suspend itself, 
wait on an interrupt, or lose control of the processor through 
a rescheduling operation. Rescheduling occurs whenever a 
process with a priority higher than the running process 
becomes ready to run. 


*EXTERNAL EVENTS 
FIGURE 4 Basic Process States 


In addition to the process control procedures, the process 
manager provides abstract data types and procedures for 
process synchronization and communication. 


The synchronization and communication primitives of 
the process manager were chosen to provide a convenient 
basis for higher levels of the operating system. In the 
interest of flexibility, the process manager provides both 
semaphores and a primitive form of messages. 


Semaphores have been widely discussed in the 
literature [19]; we will say nothing further about them except 
to list the operations provided by the process manager : 
CreateSemaphore, DestroySemaphore, Wait, Signal. 


There are many different message systems and the 
design possibilities included blocking or non-blocking send, 
blocking or non-blocking receive, and buffering 
mechanisms. Because a fully synchronized message system 
may easily be implemented using semaphores, we concluded 
that a less tightly synchronized message system was 
appropriate at this level. Because there is no memory 
management facility available to the process manager, 
buffering had to be strictly limited. Process manager level 
messages (LocalMessage) are not intended to be used for 
transferring large amounts of data, but rather to provide what 
amounts to a private event flag with state information. In 
order to provide this capability at a minimum cost, 
LocalMessages were restricted to positive integers and the 
following LocalMessage primitives provided: 
SendLocalMessage (non-blocking), ReceiveLocalMessage 
(blocking), LocalMessageFlush (non-blocking receive). 
Because buffer space is limited (to a single LocalMessage) 
and SendLocalMessage is non-blocking, overflow 
LocalMessages are simply discarded. The process 
synchronization and communication primitives lead to the 
process state diagram shown in Figure 5. In addition to the 
state transitions shown in Figure 4, a running process may 
also lose control of the processor if it waits on a semaphore 
or waits for a local message. 


aiting, WaitlO aitin 
for Wait —> on 
interrupt Semaphore 
rN —— 
Interrupt* nal* 
Reschedule* 
suspend ReceiveLocalMessage 
Beoume aiting 
Suspend for 


Message 
Suspend* 
SendLocalMessage* 

*EXTERNAL EVENTS 


FIGURE 5 Complete Process State Diagram 
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A significant issue in the design of the process manager 
was the question of process termination. In general, the 
destruction of processes may affect the higher levels of the 
operating system. This is because higher levels of the 
Operating system may associate further object types with 
processes and may need to perform some. bookkeeping when 
a process is destroyed. For this reason, the process manager 
provides a general facility for installing termination 
procedures. The process manager maintains a list of 
termination procedures, each of which is called when a 
process is destroyed. These procedures enable higher levels 
of the operating system to be notified at process destruction 
time so that they may perform any necessary bookkeeping 
operations. This feature is both powerful and dangerous; 
however, some protection is provided by dictating that only 
the initial process of the node may install these termination 
procedures. 


4 Global Communication Layer 


As mentioned in Section 3, the process manager 
provides some local communication primitives. These 
primitives are provided to facilitate the implementation of 
higher levels of the kernel. In a distributed system such as 
Star a more general interprocess communication system is 
needed which can cross processor boundaries. In order to 
provide a coherent view of the system to application 
programs, the global communication system may also be 
used for local communications. The global communication 
system of the kernel is a message passing system supporting 
reliable multicasting. The design criteria for the low level 
communication system were: 


* To support hardware communication modes; 


* To support location independence of processes. 
The user of the system should not have to modify 
his code depending upon whether or not two 
communicating processes are to be scheduled at 
the same node; and 


* To keep the primitive system both simple and 
efficient. 


Because this is the first level of software above the 
underlying communication hardware, it was essential that it 
closely adhere to a communication model which the 
underlying hardware can efficiently support. In particular, 
we were concerned that the communication software fully 
Support the multicasting capability of the hardware. The 
communication system provides for location independence of 
processes by providing a unified model of communication 
between locally executing and distributed processes. 


4.1 Global Communication System Definition 


The global communication system for Star supports the 
multicasting capability provided by multicast and merge in 
the hardware. The communication system is based upon the 
following abstractions. All communication occurs 
asynchronously between privately owned ports . Two ports 
may communicate if they are attached to the same channel. 
Channels are multiway, multidirectional communication 
paths. An arbitrary number of ports may be connected to a 
single channel. Communication between two ports occurs 
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when a process writes a message to one of its ports anda 
receive operation is performed by a process on a port 


_ connected to the same channel. The channel model was 


chosen to closely match the multicasting buses which are 
supported by the interconnection network. 


All messages sent by a port are received, in the order 
sent, by all ports connected to the same channel. There are 
two types of channels, local and distributed. A local channel 
may connect ports which reside on a single processor node. 
Distributed channels may connect ports on different nodes. 
Distributed channels exist only where hardware connections 
between nodes exist. More specifically, distributed channels 
are realized by a particular configuration of the underlying 
interconnection network. 


4.2 Ports and Channels 


In this section we will consider what ports and channels 
are. There is a single interface for both local and distributed 
channels; however, the implementation of the two is 
different. When discussing implementation details we mean, 
for the moment, local channel. We will consider the 
implementation of distributed channels in Section 4.4. 


A channel is essentially a shared data structure consisting 
of a record for each connected port (the PortList), a fixed 
buffer space, and a pointer into the buffer space indicating 
where the next message should be deposited. The set of 
operations which can be performed on a channel is: 


CreateChannel- Create anew channel. Witha 
specified name and buffer size. 

DestroyChannel- A channel with no attached ports 
may be destroyed. 

ConnectPort - Connect the specified port to the 
channel. Does not block if the 
named channel does not exist. 

DisconnectPort -- Disconnect the specified port 


from the channel. 


Ports may be connected to or disconnected from 
channels at any time. In theory, an arbitrary number of ports 
may be connected to a channel (in practice this is limited by 
our implementation). Monitoring of communication can be 
achieved without introducing delay by attaching a monitor 
port to the channel. 


A port is a privately owned object. Only the owner of a 
port may read data from or write data to a port; however, a 
port may be connected to or disconnected from a channel by 
a process other than its owner. The communication system 
supports the notion of a process waiting to receive data from 
any subset of its ports. This is implemented using 
LocalMessages. Any port which is connected to a channel 
may enable or disable a notification mode. If the notification 
mode of the port is enabled, a LocalMessage is sent to the 
owner of the port whenever a message is added to the buffer 
of the channel to which the port is connected. 


The following operations may be performed upon a 
port. 


ReadFromPort--- Read next message. This is 
non-blocking and returns NIL if 
no message is available. 

SendToPort-- Send a message through a port. 


If the port is not connected to the 
channel, the message will be lost. 
EnableNotification-- When the notification mode of a 
port is enabled, a LocalMessage 
is sent to the owner of the port 
whenever a message is added to 
the buffer of the channel to which 
the port is attached. 
DisableNotification-- Disable notification mode. 


The ports connected to a channel may read the messages 
in the channel buffer at their own pace. We may view each 
port as having a private pointer into a shared list of 
messages. This private pointer is maintained as part of the 
channel data structure and is directly accessed only by the 
channel software. In Figure 6 the channel buffer model and 
the function of these private pointers is illustrated. In this 
figure, three ports are connected to the channel. Each port is 
represented by a private pointer into the channel buffer. 


Channel Buffer 


pl tessaae 


FIGURE 6 Channel Buffer 


Because buffer space for a given channel is fixed, we 
can view the buffer as being circular with new messages 
eventually overwriting old ones. Although this event is not 
an error it is important to know when it has occurred. For 
this purpose a tail pointer is maintained by the channel, 
which indicates the oldest message which has not been read 
by all attached ports. The channel data structure also consists 
of some Status flags including a note flag indicating whether 
any of the attached ports have enable their notification mode, 
a nil flag, indicating whether any of the attached ports have 
read the most recent message. 


The PortList of a channel consists of a record for each 
connected port. This record includes the port name, a pointer 
the the next message to be read, a notificationmode flag, a nil 
flag, and an overflow flag. The notificationmode flag 
indicates whether the port should be notified when a new 
message is written, the overflow flag indicates whether a 
message has been lost since the last read, the nil flag 
indicates whether there are any messages to read. 
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4.3 Bufferin 


The choice of asynchronous communication implied that 
some buffering capability must be built into the 
communication system. Because buffering is, of necessity, 
limited, we had to either institute flow control or accept the 
potential loss of data due to buffer overflow. We regarded 
the introduction of flow control at this level as a poor idea 
because it introduces additional overhead for all 
communication when some applications may not need it. 
This is particularly true if the communication protocol of an 
application is inherently bounded (can have a finite limit on 
the number of outstanding messages) and the bound is 
sufficiently small that overflow is impossible. Flow control 
may be easily introduced at a higher level. 


The question remained -- how much buffer capacity 
should be provided. Reid [20] solved this by providing a 
single message buffer. Each successive message overwrites 
the previous message. This approach has a serious flaw in 
that it is impossible to use broadcast to provide 
synchronization because we cannot be certain that a reply 
message did not overwrite the original message before all 
ports had read it. Any fixed buffering scheme has the 
limitation that it renders some communication protocols either 
impossible or difficult and provides excess capacity for 
others. 


In solving this problem we made the following 
assumption: a particular channel is shared by ports which use 
the same communication protocol. The protocol used must 
either be bounded or must handle potential message loss. In 
either case it is possible to allocate a fixed buffer space when 
the channel is first created. The buffer size is specified in the 
initial allocation request. Obviously it is possible that this 
request may not be met; however, this event will be 
discovered at the outset rather than at some point well into the 
execution of an application. 


4.4 Implementing Distribut hannel 


We have delayed discussing the implementation of 
distributed channels until now because their implementation 
is significantly different from local channels, although they 
respond to exactly the same set of operations. Distributed 
channels are realized through the coordinated efforts of two 
or more connected processor nodes. 


In the local channel model we can view a channel as a 
shared data structure with a number of attached ports. Each 
port is uniquely owned by a process. Unfortunately this 
model is much more complicated in the distributed case. 
First because the distributed channel cannot be implemented 
through shared memory and second because the 
communicating ports are on different nodes. 


In order to realize the channel model in the distributed 
case we maintain, at each connected node, a copy of the 
message buffer. The other parts of the channel data structure 
are provided for maintaining port status and are only 
maintained locally. Thus each node which is connected 
maintains separate data for the channel; part of which is 
replicated and part of which is unique. 


Recall that the multicasting buses realized in the 
interconnection network have the capability of controlling 
data flow across the interconnection network and there is 
always a unique sender on the bus. The capability of 
controlling data flow insures that a receiving node can 
process messages as they are received. The fact that there is 
a single sender insures that only one node will be able to 
transmit messages at any moment. Updates to a copy of the 
channel buffer occur in only two ways: the node is the 
current sender and a send is performed by one of its 
connected ports or the node is a receiver and a message is 
received in the buffer of the hardware interface. 


In practice multiple distributed channels can be supported 


over each multicasting bus. The multiple distributed channels 
share the bandwidth provided by the multicasting bus. 


5 Discussion 
In this section we discuss how the kernel can be used to 


build a complete operating system, how the design avoids 
loss of transparency, and the current status of the system 


5.1 Building a Complete Operating System 


The kernel is not an operating system; however, it 


provides the facilities needed to build a distributed operating | 


system. A distributed operating system for Star would 
consist of a copy of the kernel and some statically defined set 
of initial processes at each processor of the system. It would 
need a file system, some user interface, and a resource 
manager. An early version of a resource manager for Star 
has been described in [21]. 


When the system is initialized, the hardware is not 
configured and there are no communication channels. 
Therefore, it must be possible for the set of statically defined 
processes to create a set of communication channels so that 
they may cooperate to perform the operating system 
functions. The feasibility of this initialization step is 
illustrated by a simple example. Imagine that there is a 
process x residing at node X and a process y residing at 
node Y. The following code fragments demonstrate a 
method in which the two processes can become connected to 
a common channel. 


(* code fragment for x *) 


CONST _ buffersize = ?; 
VAR myport : port; success : BOOLEAN; 


BEGIN 
CreateChannel("channelone";{X,Y} ,buffersize); 
success := ConnectPort(myport, 'channelone"’); 


eoeoeoeoeeoe 


(* code fragment for y *) 


VAR myport : port; success : BOOLEAN; 
BEGIN 
REPEAT 
success := ConnectPort(myport, "channelone"’) ; 
UNTIL success 


eoeeeeeoen 


240 


Additional protocol steps involving message passing 
between x and y are needed so that the two processes may 
reach an agreed upon state. Although this initialization 
process may seem to be a great deal of trouble, it is only 
necessary for initializing the higher levels of the operating © 
system and should not be necessary for application 
programs. 


5.2 Loss of Transparency 


In Section 1 we mentioned that minimization of loss of 
transparency was a key design goal. We have achieved this 
by tailoring our global communication system to the 
capabilities of the interconnection network. The channel 
model supports the multicast connections among multiple 
processors. In addition the channel maintains the strict 
message sequencing provided by the underlying hardware. 


We have introduced some loss of transparency by 
forcing channel buffers to be fixed at creation; this 
introduces the possibility of message loss which does not 
exist in the hardware; however, we have argued that with a 
proper choice of communication protocols message loss can 
be avoided. 


5.3 Concluding Remarks 


In this paper we have presented the operating system 
kernel for Star, a reconfigurable multiprocessor system. The 
kernel is intended to serve as a basis for further operating 
systems research. It provides process management, local 
resource allocation, and inter-process communication 
mechanisms. The design is simple and avoids loss of 
transparency in the hardware by supporting the 
communication modes provided by the interconnection 
network. 


The kernel supports multiple processes and, at the 
process manager level, provides simple, yet powerful 
inter-process communication mechanisms. In addition the 
process manager provides for the implementation of a wide 
range of process scheduling algorithms. | 


The global communication mechanism was designed to 
closely support the communication modes provided for by 
the underlying hardware. It supports the general notion of 
merging by allowing additional communicating entities 
(ports) to join into a conversation (a channel) at will. In 
addition the communication mechanism maintains the 
capability of the hardware to provide reliable multicasting of 
messages. 


The present status of our system is that we have a four 
node machine running. We have implemented and tested the 
process manager, memory manager and local portion of the 
distributed communication system. In addition we have 
implemented a linking loader which allows separately 
compiled MODULA-2 programs to be loaded, linked to the 
operating system and executed as independent processes. 
The implemented code constitutes about two thousand lines 
of MODULA-?2. We plan to extend our work to provide 
greater support for distributed programs. 
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THE INTERFACE BETWEEN DISTRIBUTED OPERATING SYSTEM 


AND HIGH-LEVEL PROGRAMMING LANGUAGE 


Michael L. Scott 
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University of Rochester 
Rochester, NY 14627 


Abstract — A distributed operating system provides a pro- 
cess abstraction and primitives for communication between 
processes. A distributed programming language regularizes the use 
of the primitives, making them both safer and more convenient. 
The level of abstraction of the primitives, and therefore the division 
of labor between the operating system and the language support 
routines, has serious ramifications for efficiency and flexibility. 
Experience with three implementations of the LYNX distributed 
programming language suggests that functions that can be imple- 
mented on either side of the interface are best left to the language 
run-time package. 


Introduction 


Recent years have seen the development of a large number of 
distributed programming languages and an equally large number of 
distributed operating systems. While there are exceptions to the 
rule, it is generally true that individual research groups have focused 
on a single language, a single operating system, or a single 
language/O.S. pair. Relatively little attention has been devoted to 
the relationship between languages and O.S. kernels in a distributed 
setting. 


Amoeba [15], Demos-MP [16], Locus [26], and the V ker- 
nel [7] are among the better-known distributed operating systems. 
Each by-passes language issues by relying on a simpie library- 
routine interface to kernel communication primitives. Eden [5] and 
Cedar [24] have both devoted a considerable amount of attention to 
programming language issues, but each is very much a single- 
language system. The Accent project at CMU [17] is perhaps the 
only well-known effort to support more than one programming 
language on a single underlying kernel. Even so, Accent is only 
able to achieve its multi-lingual character by insisting on a single, 
universal model of interprocess communication based on remote 
procedure calls [ll]. Languages with other models of process 
interaction are not considered. 


In the language community, it is unusual to find implementa- 
tions of the same distributed programming language for more than 
one operating system, or indeed for any existing operating system. 
Dedicated, special-purpose kernels are under construction for 
Argus [14], SR [1,2], and NIL [22,23]. Several dedicated implemen- 
tations have been designed for Linda [6, 10]. No distributed imple- 
mentations have yet appeared for Ada [25]. 


If parallel or distributed hardware is to be used for general- 
purpose computing, we must eventually learn how to support multi- 
ple languages efficiently on a single operating system. Toward that 
end, it is worth considering the division of labor between the 
language run-time package and the underlying kernel. Which func- 
tions belong on which side of the interface? What is the appropri- 
ate level of abstraction for universal primitives? Answers to these 
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questions will depend in large part on experience with a variety of 
language/O. S. pairs. | | 


This paper reports on implementations of the LYNX distri- 
buted programming language for three existing, but radically 
different, distributed operating systems. To the surprise of the 
implementors, the implementation effort turned out to be substan- 
tially easier for kernels with low-level primitives. If confirmed by 
similar results with other languages, the lessons provided by work 
on LYNX should be of considerable value in the design of future 
systems. 


The first implementation of LYNX was constructed during 
1983 and 1984 at the University of Wisconsin, where it runs under 
the Charlotte distributed operating system [3,9] on the Crystal mul- 
ticomputer [8]. The second implementation was designed, but never 
actually built, for Kepecs and Solomon's SODA [12,13]. A third 
implementation has recently been released at the University of 
Rochester, where it runs on BBN Butterfly multiprocessors [4] under 
the Chrysalis operating system. 


Section 2 of this paper summarizes the features of LYNX 
that have an impact on the services needed from a distributed 
operating system kernel. Sections 3, 4. and 5 describe the three 
LYNX implementations, comparing them one to the other. The 
final section discusses possible lessons to be learned from the com- 
parison. 


LYNX Overview 


The LYNX programming language is not itself the subject of 
this article. Language features and their rationale are described in 
detail elsewhere [19, 20,21]. For present purposes, it suffices to say 
that LYNX was designed to support the loosely-coupled style of 
programming encouraged by a distributed operating system. Unlike 
most existing languages, LYNX extends the advantages of high-level 
communication facilities to processes designed in isolation, and com- 
piled and loaded at disparate umes. LYNX supports interaction not 
only between the pieces of a multi-process application. but also 
between separate applications and between user programs and 
long-lived system servers. 


Processes in LYNX execute in parallel, possibly on separate 
processors. There is no provision for shared memory. Interprocess 
communication uses a mechanism similar to remote procedure calls 
(RPC), on virtual circuits called links. Links are two-directional and 
have a single process at each end. Each process may be divided 
into an arbitrary number of threads of control. but the threads exe- 
cute in mutual exclusion and may be managed by the language 
run-time package, much like the coroutines of Modula-2 [27]. 


Communication Characteristics 


(The following paragraphs describe the communication 
behavior of LYNX processes. The description does not provide 
much insight into the way that LYNX programmers think about 
their programs. The intent is to describe the externally-visible 
characteristics of a process that must be supported by kernel primi- 
ties.) 


Messages in LYNX are not received asynchronously. They 
are queued instead, on a link-by-link basis. Each link end has one 
queue for incoming requests and another for incoming replies. 
Messages are received from a queue only when the queue is open 
and the process that owns its end has reached a well-defined block 
point. Request queues may be opened or closed under explicit pro- 
cess control. Reply queues are opened when a request has been 
sent and a reply is expected. The set of open queues may therefore 
vary from one block point to the next. 


A blocked process waits until one of its previously-sent mes- 
sages has been received, or until an incoming message is available in 
at least one of its open queues. In the latter case. the process 
chooses a non-empty queue. receives that queue’s first message, and 


figure 1: link moving at both ends 


executes through to the next block point. For the sake of fairness, 
an implementation must guarantee that no queue is ignored forever. 


Messages in the same queue are received in the order sent. 
Each message blocks the sending coroutine within the sending pro- 
cess. The process must be notified when messages are received in 
order to unblock appropriate coroutines. It is therefore possible for 
an implementation to rely upon a stop-and-wait protocol with no 
actual buffering of messages in transit. Request and reply queues 
can be implemented by lists of blocked coroutines in the run-time 
package for each sending process. 


The most challenging feature of links, from an implementor’s 
point of view, is the provision for moving their ends. Any message, 
request or reply, can contain references to an arbitrary number of 
link ends. Language semantics specify that receipt of such a mes- 
sage has the side effect of moving the specified ends from the send- 
ing process to the receiver. The process at the far end of each 
moved link must be oblivious to the move. even if it is currently 
relocating its end as well. In figure 1. for example, processes A and 
D are moving their ends of link 3, independently. in such a way that 
what used to connect A to D will now connect B to C. 


It is best to think of a link as a flexible hose. A message put 
in one end will eventually be delivered to whatever process happens 
to be at the other end. The queues of available but un-received 
messages for each end are associated with the link itself, not with 
any process. A moved link may therefore (logically at least) have 
messages inside, waiting to be received at the moving end. In keep- 
ing with the comment above about stop-and-wait protocols, and to 
prevent complete anarchy, a process is not permitted to move a link 
on which it has sent unreceived messages, or on which it owes a 
reply for an already-received request. 
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Kernel Requirements 


To permit an implementation of LYNX, an operating system 
kernel must provide processes, communication primitives, and a 
naming mechanism that can be used to build links. The major 
questions for the designer are then 1) how are links to be 
represented? and 2) how are RPC-style request and reply messages 
to be transmitted on those links? [t must be possible to move links 
without losing messages. In addition, the termination of a process 
must destroy all the links attached to that process. Any attempt to 
send or receive a message on a link that has been destroyed must 
fail in a way that can be reflected back into the user program as a 
run-time exception. 


The Charlotte Implementation 


Overview of Charlotte 


Charlotte [3,9] runs on the Crystal multicomputer [8], a col- 
lection of 20 VAX 11/750 node machines connected by a 10- 
Mbit/second token ring from Proteon Corporation. 


The Charlotte kernel is replicated on each node. It provides 
direct support for both processes and links. Charlotte links were the 
original motivation for the circuit abstraction in LYNX. As in the 
language, Charlotte links are two directional, with a single process at 
each end. As in the language, Charlotte links can be created, des- 
troyed, and moved from one process to another. Charlotte even 
guarantees that process termination destroys all of the process’s 
links. It was originally expected that the implementation of 
LYNX-style interprocess communication would be almost trivial. 
As described in the rest of this section, that expectation turned out 
to be naive. ; 


Kernel calls in Charlotte include the following: 


MakeLink (var endl, end2 : link) 
Create a link and return references to its ends. 


Destroy (myend : link) 
Destroy the link with a given end. 


Send (L : link; buffer : address: length : integer: enclosure : link) 
Start a send activity un a given link end. optionally enclosing 
one end of some other link. 


Receive (L : link; buffer : address: length integer) 
Start a receive activity on a given link end. 


Cancel (L : link; d : direction) 
Attempt to cancel a previously-started send or receive activity. 


Wait (var e : description) 
Wait for an activity to complete, and return its description (link 
end, direction, length, enclosure). 


All calls return a status code. All but Wait are guaranteed to com- 
plete in a bounded amount of time. Wait blocks the caller until an 
activity completes. 


The Charlotte kernel matches send and receive activities. It 
allows only one outstanding activity in each direction on a given end 
of a link. Completion must be reported by Wait before another 
similar activity can be started. 


Implementation of LYNX 


The language run-time package represents every LYNX link 
with a Charlotte link. It uses the activities of the Charlotte kernel to 
simulate the request and reply queues described in section 2.1. It 
starts a send activity on a link whenever a prucess attempts to send 
a request or reply message. It starts a receive activity on a link 
when the corresponding request or reply queue is opened, if both 
were closed before. It attempts to cancel a presious-started receive 
activity when a process closes its request queue, if the reply queue is 


also closed. The multiplexing of request and reply queues onto 
receive activities was a major source of problems for the implemen- 
tation effort. A second source of problems was the inability to 
enclose more than one link in a single Charlotte message. 


Screening Messages. For the vast majority of remote 
operations, only two Charlotte messages are required: one for the 
request and one for the reply. Complications arise, however, in a 
number of special cases. Suppose that process A requests a remote 
operation on link L. 


Process B receives the request and begins serving the operation. A 
now expects a reply on L and starts a receive activitv with the ker- 
nel. Now suppose that before replying B requests another operation 
on L, in the reverse direction (the coroutine mechanism mentioned 
in section 2 makes such a scenario entirely plausible). A will receive 
B’s request before the reply it wanted. Since A may not be willing 
to serve requests on L at this point in time (its request queue is 
closed), B is not able to assume that its request is being served sim- 
ply because A has received it. 


A similar problem arises if A opens its request queue and 
then closes it again, before reaching a block point. In the interests 
of concurrency, the run-time support routines will have posted a 
Receive with the kernel as soon as the queue was opened. When 
the queue is closed, they will attempt to cancel the Receive. If B 
has requested an operation in the meantime, the Cancel will fail. 
The next time A’s run-time package calls Wait, it will obtain 
notification of the request from B, a message it does not want. 
Delaying the start of receive activities until a block point does not 
help. A.must still start activities for a// its open queues. It will con- 
tinue execution after a message is received from exactly one of those 
queues. Before reaching the next block point, it may change the set 
of messages it is willing to receive. 


It is tempting to let A buffer unwanted messages until it is 
again willing to receive from B, but such a solution is impossible for 
two reasons. First, the occurrence of exceptions in [ YNX can 
require A to cancel an outstanding Send on L. If B has already 
received the message (inadvertently) and is buffering it internally, 
the Cancel cannot succeed. Second, the scenario in which A 
receives a request but wants a reply can be repeated an arbitrary 
number of times, and A cannot be expected to provide an arbitrary 
amount of buffer space. 


A must return unwanted messages to B. In addition to the 
request and reply messages needed in simple situations, the imple- 
mentation now requires a retry message. Retry is a negative ack- 
nowledgment. It can be used in the second scenario above, when A 
has closed its request queue after receiving an unwanted message. 
Since A will have no Receive outstanding, the re-sent message from 
B will be delayed by the kernel until the queue is re-opened. 


In the first scenario, unfortunately, A will still have a Receive 
posted for the reply it wants from B. If A simply returned requests 
to B in retry messages, it might be subjected to an arbitrary number 
of retransmissions. "To prevent these retransmissions we must intro- 


duce the forbid and allow messages. Forbid denies a process the 


right to send requests (it is still free to send replies). Allow restores 
that right. Retry is equivalent to forbid followed by allow. It can be 
considered an optimization for use in cases where no replies are 
expected, so retransmitted requests will be delayed by the kernel. 


Both forbid and retry return any link end that was enclosed in 
the unwanted message. A process that has received a forbid mes- 
sage keeps a Receive posted on the link in hopes of receiving an 


244 


allow message A process that has sent a forbid message 
remembers that it has done so and sends an allow message as soon 
as it is either willing to receive requests (its request queue is open) 
or has no Receive outstanding (so the kernel will delay all messages). 


Moving Multiple Links. To move more than one link 
end with a single LYNX message, a request or reply must be bro- 
ken into several Charlotte messages. The first packet contains non- 
link data, together with the first enclosure. Additional enclosures 
are passed in empty enc messages (see figure 2). For requests, the 


L 
simple case 
connect -------- request ____->> accept 
' compute 
<------- ney 22 reply 
multiple enclosures 
connect -------- IU. Sox ais accent 
goahead 
eee enc 
------------------ 2S 
Siac eeer rans i ee es 
compute 
ae eee eee eeaees reply 
enc 
<= aes Be Ta wees Sn a cc cee eee ee ek, Sands aes eee a Hl 
enc 
<i se ea a, Va eee Res ae ae Ee ea es Se 


figure 2: link enclosure protocol 


receiver must return an explicit goahead message after the first 
packet so the sender can tell that the request is wanted. No goahead 
is needed for requests with zero or one enclosures, and none is 
needed for replies, since a reply is always wanted. 


One consequence of packetizing LYNX messages is that links 
enclosed in unsuccessful messages may be lost. Consider the follow- 
ing chain of events: 


a) Process A sends a request to process B, enclosing the end of a 
link. 

b) B receives the request unintentionally; inspection of the code 
allows one to prove that only replies were wanted. 

c) The sending coroutine in A feels an exception, aborting the 
request. 

d) Bcrashes before it can send the enclosure back to A in a forbid 


message. From the point of view of language semantics, the 
message to B was never sent, yet the enclosure has been lost. 
Under such circumstances the Charlotte implementation cannot 
conform to the language reference manual. 


The Charlotte implementation also disagrees with . the 
language definition when a coroutine that is waiting for a reply mes- 


gage is aborted by a local exception. On the other end of the link 


‘ This of course makes it vulnerable to receiving unwanted messages 
itself. 


the server should feel an exception when it attempts to send a no- 
longer-wanted reply. Such exceptions are not provided under Char- 
lotte because they would require a final, top-level acknowledgment 
for reply messages, increasing message traffic by 50%. 


Measurements 


The language run-time package for Charlotte consists of just 
over 4000 lines of C and 200 lines of VAX assembler. compiling to 
about 21K of object code and data. Of this total, approximately 
45% is devoted to the communication routines that interac’ with the 
Charlotte kernel, including perhaps 5K for unwanted messages and 
multiple enclosures. Much of this space could be saved with a more 
appropriate kernel interface. 


A simple remote operation (no enclosures) requires approxi- 
matelv 57 ms with no data transfer and about 65 ms with 1000 bytes 
of parameters in both directions. C programs that make the same 
series of kernel calls require 55 and 60 ms, respectively. In addition 
to being rather slow, the Charlotte kernel is highly sensitive to the 
ordering of kernel calls and to the interleaving of calls by indepen- 
dent processes. Performance figures should therefore be regarded as 
suggestive. not definitive. The difference in timings between LYNX 
and C programs is due to efforts on the part of the run-time pack- 
age to gather and scatter parameters, block and unblock coroutines, 
establish default exception handlers, enforce flow control, perform 
“type checking, update tables for enclosed links. and make sure the 
links are valid. 


The SODA Implementation 


Overview of SODA 


As part of his Ph. D. research [12,13], Jonathan Kepecs set 
out to design a minimal kernel for a multicomputer. His 
“Simplified Operating system for Distributed Applications” might 
better be described as a communications protocol for use on a 
broadcast medium with a very large number of heterogeneous 
nodes. 


Each node on a SODA network consists of two processors: a 
client processor, and an associated kernel processor. The kernel 
processors are all alike. They are connected to the network and 
communicate with their client processors through shared memory 
and interrupts. Nodes are expected to be more numerous than 
processes, so client processors are not multi-programmed. 


Every SODA process has a unique id. It also advertises a col- 
lection of names to which it is willing to respond. There is a kernel 
call to generate new names, unique over space and time. The dis- 
cover kernel call uses unreliable broadcast in an attempt to find a 
process that has advertised a given name. 


Processes do not necessarily send messages, rather they 
request the transfer of data. A process that is interested in com- 
munication specifies a name. a process id. a small amount of out- 
of-band information, the number of bytes it would like to send and 
the number it is willing to receive. Since either of the last two 
numbers can be zero, a process can request to send data. receive 
data, neither, or both. The four varieties of request are termed put. 
get, signal. and exchange, respectively. 


Processes are informed of interesting events by means of 
software interrupts. Each process establishes a single handler which 
it can close temporarily when it needs to mask out interrupts. A 
process feels a software interrupt when its id and one of its adver- 
tised names are specified in a request from some other process. The 
handler is provided with the id of the requester and the arguments 
of the request, including the out-of-band information. The inter- 
rupted process is free to save the information for future reference. 


At any time, a process can accept a request that was made of 
it at some time in the past. When it does so, the request is com- 
pleted (data is transferred in both directions simultaneously), and 
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the requester feels a software interrupt informing it of the comple- 
tion and providing it with a small amount of out-of-band informa- 
tion from the accepter. Like the requester. the accepter specifies 
buffer sizes. The amount of data transferred in each direction is the 
smaller of the specified amounts. 


Completion interrupts are queued when a handler is busy or 
closed. Requests are delayed: the requesting kernel retries periodi- 
cally in an attempt to get through (the requesting user can proceed). 
If a process dies before accepting a request, the requester feels an 
interrupt that informs it of the crash. 


A Different Approach to Links 


A link in SODA can be represented by a pair of unique 
names, one for each end. A process that owns an end of a link 
advertises the associated name. Every process knows the names of 
the link ends it owns. Every process keeps a hint as to the current 
location of the far end of each of its links. The hints can be wrong, 
but are expected to work most of the ume. 


A process that wants to send a LYNX message, either a 
request or a reply, initiates a SODA put to the process it thinks is 
on the other end of the link. A process moves link ends by enclos- 
ing their names in a message. When the message is SODA-accepted 
by the receiver, the ends are understood to have moved. Processes 
on the fixed ends of moved links will have incorrect hints. 


A process that wants to receive a LYNX message, either a 
request or a reply, initiates a SODA signal to the process it thinks is 
on the other end of the link. The purpose of the signal is allow the 
aspiring receiver to tell if its link is destroyed or if its chosen sender 
dies. In the latter case. the receiver will feel an interrupt informing 
it of the crash. In the former case, we require a process that des- 
troys a link to accept any previously-posted status signa/ on its end. 
mentioning the destruction in the out-of-band information. We also 
require it to accept any outstanding put request, but with a zero- 
length buffer, and again mentioning the destruction in the out-ot- 
band information. After clearing the signals and puts, the process 
can unadvertise the name of the end and forget that it ever existed. 


Suppose now that process A has a link L to process C and 
that it sends its end to process B. 


before 


If C wants to send or receive on L, but B terminates after receiving 
L from A, then C must be informed of the termination so it knows 
that L has been destroyed. C will have had a SODA request posted 
with A. A must accept this request so that C knows to watch B 
instead. We therefore adopt the rule that a process that moves a 
link end must accept any previously-posted SODA request from the 
other end, just as it must when it destroys the link. It specifies a 
zero-length buffer and uses the out-of-band information to tell the 
other process where it moved its end. In the above example, C will 
re-start its request with B instead of A. 


The amount of work involved in moving a link end is very 
small, since accepting a request does not even block the accepter. 
More than one link can be enclosed in the same message with no 
more difficulty than a single end. If the fixed end of a moving link 
is not in active use, there is no expense involved at all. In the above 
example, if C receives a SODA request from B. it will know that [. 
has moved. 


The only real problems vccur when an end of a dormant link 
is moved. If our example, if L is first used by C after it is moved, C 
will make a SODA request of A, not B. since its hint is out-of-date. 
There must be a way to fix the hint. {f each process keeps a cache 
of links it has known about recently, and keeps the names of those 
links advertised, then A may remember it sent L to B, and can tell 
C where it went. If A has forgotten, C can use the discover com- 
mand in an attempt to find a process that knows about the far end 
of L. 


A process that is unable to find the far end of a link must 
assume it has been destroyed. If L exists, the heuristics of caching 
and broadcast should suffice to find it in the vast majority of cases. 
If the failure rate is comparable to that of other “acceptable” errors. 
such as garbled messages with “valid” checksums, then the heuris- 
tics may indeed be all we ever need. 


Without an actual implementation to measure, and without 
reasonable assumptions abuut the reliability of SODA broadcasts. it 
is impossible to predict the success rate of the heuristics. The 
SODA discover primitive might he especially strained by node 
crashes, since they would tend to precipitate a large number of 
broadcast searches for lost links. [f the heuristics failed too often, a 
fall-back mechanism would be needed. 


Several absolute algorithms can he devised for finding miss- 
ing links. Perhaps the simplest looks like .his’ 


e Every process advertises a freeze name. When C discovers its 
hint for L is bad, it posts a SODA request on the freeze name 
of every process currently :n existeme (SODA makes it easy to 
guess their ids). It includes the name «f L in the request. 


e Each process accepts a freeze request immediately, ceases exe- 
cution of everying but its own searches (if any), increments a 
counter, and posts an unfreeze request with C. If it has a hint 
for L, it includes that hint in the freeze accept or the unfreeze 
request. | 


e When C obtains a new hint or has unsuccessfully queried 
everyone, it accepts the unfreeze requests. When a frozen pro- 
cess feels an interrupt indicating that its unfreeze request has 
been accepted or that C has crashed, it decrements its counter. 
If the counter hits zero, it continues execution. The existence 
of the counter permits multiple concurrent searches. 


This algorithm has the considerable disadvantage of bringing every 
LYNX process in existence to a temporary halt. On the other hand, 
it is simple, and should only be needed when a node crashes or a 
destroved link goes unused for so long that everyone has forgotten 
about it. 


Potential Problems. As mentioned in the introduction, 
the SODA version of LYNX was designed on paper only. An 
actual implementation would need to address a number of potential 
problems. To begin with, SODA places a small, but unspecified, 
limit on the size of the out-of-band information for request and 
acceps. If all the self-descriptive information included in messages 
under Charlotte were to be provided out-of-band. a minimum of 
about 48 bits would be needed. With fewer bits available, some 
information would have to be included in the messages themselves. 
as in Charlotte. 


A second potential problem with SODA involves another 
unspecified constant: the permissible number of outstanding 
requests between a given pair of processes. The implementation 
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described in the previous section would work easily if the limit were 
large enough to accommodate three requests for every link between 
the processes (a LYNX-request put, a LYNX-reply put, and a status 
signal). Since reply messages are always wanted (or can at least be 

discarded if unwanted), the implementation could make do with two 

outstanding requests per link and a single extra for replies. Too 

small a limit on outstanding requests would leave the possibility of 
deadlock when many links connect the same pair of processes. In 

practice. a limit of a half a dozen or so is unlikely to be exceeded (it 

implies an improbable concentration of simultaneously-active 
resources in a single process), but there is no way to reflect the limit 

to the user in a semantically-meaningful way. Correctness would 

start to depend on global characteristics of the process 

interconnection graph. 


Predicted VWleasurements 


Space requirements for run-time support under SODA would 
reflect the lack of special cases for handling unwanted messages and 
multiple enclosures. Given the amount of code devoted to: such 
problems in the Charlotte implementation, it seems reasonable to 
expect a savings on the order of 4K bytes. 


For simple messages, run-time routines under SODA would 
need to perform most of the same functions as their counterparts for 
Charlotte. Preliminary results with the Butterfly implementation 
(described in the following section) suggest that the lack of special’ 
cases might save some time in conditional branches and subroutine 
calls, but relatively major differences in run-time package overhead 
appear to be unlikely. 


Overall performance. including kernel overhead, is harder to 
predict. Charlotte has a considerable hardware advantage: the only 
implementation of SODA ran on a collection of PDP-11/23’s with a 
1-Mbit/second CSMA bus. SODA, on the other hand, was 
designed with speed in mind. Experimental figures reveal that for 
small messages SODA was three times as fast as Charlotte.” Char- 
lotte programmers made a deliberate decision to sacrifice efficiency 
in order to keep the project manageable. A SODA version ‘of 
LYNX might well be intrinsically faster than a comparable version 
for Charlotte. 


The Chrysalis [mplementation 


Overview of Chrysalis 


The BBN Butterfly Parallel Processor [4] is a 68000-based 
shared-memory multiprocessor. The Chrysalis operating system 
provides primitives, many of them in microcode, for the manage- 
ment of system abstractions. Among these abstractions are 
processes, memory objects, event blocks, and dual queues. 


Each process runs in an address space that can span as many 
as one or two hundred memory objects. Each memory object can 
be mapped into the address spaces of an arbitrary number of 
processes. Synchronization of access to shared memory is achieved 
through use of the event blocks and dual queues. 


An event block is similar to a binary semaphe ‘e. except that 
1) a 32-bit datum can be provided to the V operation, to be 
returned by a subsequent P. and 2) only the owner of an event 
block can wait for the event to be posted. Any process that Knows 
the name of the event can perform the post operation. [he most 
common use of event blocks is in conjunction with dual queues. 


A dual queue is so named because of its ability to hold either 
data or event block names. A queue containing data is a simple 
bounded buffer, and enqueue and dequeue operations proceed as 


2 The difference is less dramatic tor larger messages; SODA's slow 
network exacted a heavy toll. The figures pbreak even somewhere between 
LK and 2K bytes. 


one would expect. Once a queue becomes empty, however, subse- 
quent dequeue operations actually enaveue event block names, on 
which the calling processes :an wait. \n enqueue operation on a 


queue containing event block names actually posts a queied event | 


instead of adding its datum to the queue. 


A Third Approach to Links 


In the Butterfly implementation of L YNX, every process allo- 
cates a single dual queue and event block through which to receive 
notifications of messages sent and received. <A link is represented by 
a memory object, mapped into the address spaces of the two con- 
nected processes. The memory object contains buffer space for a 
single request and a single reply in each direction. [t also contains a 
set of flag bits and the names of the dual queues for the processes at 
each end of the link. When a process gathers a message into a 
buffer or scatters a message out of a buffer into local variables, it 
sets a flag in the link object (atomically) and then enqueues a notice 
of its activity on the dual queue for the process at the other end of 
the link. When the process reaches a block point it attempts to 
dequeue a notice from its own dual queue, waiting if the queue is 
empty. 


As in the SODA implementation, link movement relies on a 
system of hints. Both the dual queue names in link objects and the 
notices on the dual queues themselves are considered to be hints. 
Absolute information about which link ends belong to which 
processes is known only to the owners of the ends. Absolute infor- 
mation about the availability of messages in buffers is contained 
only in the link object flags. Whenever a process dequeues a notice 
from its dual queue it checks to see that it owns the mentioned link 
end and that the appropriate flag is set in the corresponding object. 
If either check fails, the notice is discarded. Every change to a flag 
is eventually reflected by a notice on the appropriate dual queue, 
but not every dual queue notice reflects a change to a flag. A link is 
moved by passing the (address-space-independent) name of its 
memory object in a message. When the message is received. the 
sending process removes the memory object from its address space. 
The receiving process maps the object into its address space, 
changes the information in the object to name its own dual queue, 
and then inspects the flags. It enqueues notices on its own dual 
queue for any of the flags that are set. 


Primitives provided by Chrysalis make atomic changes to 
flags extremely inexpensive. Atomic changes to quantities larger 
than 16 bits (including dual queue names) are relatively costly. The 
recipient of a moved link therefore writes the name of its dual 
queue into the new memory object in a non-atomic fashion. It is 
possible that the process at the non-moving end of the link will read 
an invalid name, but only after setting flags. Since the recipient 
completes its update of the dual-queue name before inspecting the 
flags, changes are never overlooked. 


Chrysalis keeps a reference count for each memory object. 
To destroy a link, the process at either end sets a flag bit in the link 
object, enqueues a notice on the dual queue for the process at the 
other end, unmaps the link object from its address space, and 
informs Chrysalis that the object can be deallocated when its refer- 
ence count reaches zero. When the process at the far end dequeues 
the destruction notice from its dual queue. it confirms the notice by 
checking it against the appropriate flag and then unmaps the link 
object. At this point Chrysalis notices that the reference count has 
reached zero, and the object is reclaimed. 


Before terminating, each process destrovs all of its links. 
Chrysalis allows a process to catch all exceptional conditions that 
might cause premature termination, including memory protection 
faults, so even erroneous processes can clean up their links before 
going away. Processor failures are currently not detected. 
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Preliminary Measurements 


The Chrysalis implementation of LYNX has only recently 
become available. I[t consists of approximately 3600 lines of C and 
200 lines of assembler, compiling to 15 or 16K bytes of object code 
and data on the 68000. Both measures are appreciably smailer than 
the respective figures for the Charlotte implementation. 


Message transmission times are also faster on the Butterfly, 
by more than an order of magnitude. Recent tests indicate that a 
simple remote operation requires about 2.4 ms with no data transfer 
and about 4.6 ms with 1000 bytes of parameters in both directions. 
Code tuning and protocol optimizations now under development are 
likely to improve both figures by 30 to 40%. 


Discussion 


Even though the Charlotte kernel provides a higher-level 
interface than does either SODA or Chrysalis, and even though the 
communication mechanisms of LYNX were patterned in large part 
on the primitives provided by Charlotte. the implementations of 
LYNX for the latter two systems are smaller, simpler, and faster. 
Some of the difference can be attributed to duplication of effort 
between the kernel and the language run-time package. Such dupii- 
cation is the usual target of so-called end-to-end arguments [18]. 
Among other things, end-to-end arguments observe that each level 
of a layered software system can only eliminate errors that can be 
described in the context of the interface to the level above. Overall 
reliability must be ensured at the application level. Since end-to- 
end checks generaliy catch ull errors, low-level checks are redun- 
dant. They are justified onlv if errors occur frequently enough to 
make early detection essenual. 


LYNX rouunes never pass Charlotte an invalid link end. 
They never specify an impossible butfer address or length. They 
never try to send on a moving end or enclose an end on itself. Toa 
certain extent they provide their uwn top-level acknowledgments. in 
the form of goahead. retry. and forbid messages, and in the 
confirmation of operation names and tv es implied by a reply mes 
sage. They would provide additionai acknowledgments for the 
replies themselves if they were not So expensive. For the users of 
LYNX, Charlotte wastes ume by checking these things itself. 

Duplication alone, however. cannot account for the wide 
disparity in complexity and etfciency detween the three LYNX 
implementauons. Most of the differences appear to be due to the 
difficulty of adapting higher-level Charlotte orimiuves to the needs 
of an application for which they are almost. but not quite, correct. 
In comparison to Charlotte, the language run-time packages tor 
SODA and Chrysalis can 


(1) 
(2) 
(3) 
(4) 


move more than one link in a message 
be sure that all received messages are wanted 
recover the enclosures in aborted messages 


detect all the exceptional conditions described in the language 
definition, without any extra acknowledgments. 


These advantages obtain precisely because the facilities for manag- 
ing virtual circuits and for screening incoming messages are nol pro- 
vided by the kernel. By moving these functions into the language 


- run-time package, SODA and Chrysalis allow the implementation to 


be tuned specifically to LYNX. In addition, by maintaining the 
flexibility of the kernel interface they permit equally efficient 
implementations of a wide variety of other distributed languages, 
with entirely different needs. 


It should be emphasized that Charlotte was not originally 
intended to support a distributed programming language. Like the 
designers of most similar systems, the Charlotte group expected 


applications to be written directly on top of the kernel. Without the 


benefits of a high-level language, most programmers probably would | 


_prefer the comparatively powerful facilities of Charlotte to the com- 
paratively primitive facilities of SODA or Chrysalis. With a 
language. however, the level of abstraction of underlying software is 
no longer of concern to the average programmer. 


For the consideration of designers of future languages and 
systems, we can cast our experience with LYNX in the form of the 
following three lessons: 


Lesson one: Hints can be better than absolutes. 
The maintenance of consistent, up-to-date, distributed informa- 
tion is often more trouble than it is worth. It can be consider- 
ably easier to rely on a system of hints, so long as they usually 
work, and so long as we can tell when they fail. 


The Charlotte kernel admits that a link end has been moved 
only when all three parties agree. The protocol for obtaining 
such agreement was a major source of problems in the kernel, 
particularly in the presence of failures and simultaneously- 
moving ends [3]. The implementation of links on top of SODA 
and Chrysalis was comparatively easy. It is likely that the 
Charlotte kernel itself would be simplified considerably by 
using hints when moving links. 


Lesson two: Screening belongs in the application layer. 

Every reliable protocol needs top-level acknowledgments. A 
distributed operating system can attempt to circumvent this 
rule by allowing a user program to describe in advance the 
sorts of messages it would be willing to acknowledge if thev 
arrived. The kernel can then issue acknowledgments on the 
user’s behalf. The shortcut only works if failures do not occur 
between the user and the kernel, and if the descriptive facilities 
in the kernel interface are sufficiently rich to specify precisely 
which messages are wanted. In LYNX, the termination of a 
coroutine that was waiting for a reply can be considered to be a 
“failure” between the user and the kernel. More important, 
the descriptive mechanisms of Charlotte are unable to distin- 
guish between requests and replies on the same link. | 


SODA provides a very general mechanism for screening mes- 
sages. Instead of asking the user to describe its screening func- 
tion. SODA allows it to provide that function itself. In effect, 
it replaces a static description of desired messages with a for- 
mai subroutine that can be called when a message arrives. 
Chrysalis provides no messages at all, but its shared-memory 
operations can be used to build whatever style of screening is 
desired. | 


Lesson three: Simple primitives are best. 

From the point. of view of the language implementor, the 
“ideal operating system” probably lies at one of two extremes: 
it either provides everything the language needs, or else pro- 
vides almost nothing, but in a flexible and efficient form. A 
kernel that provides some of what the language needs, but not 
all, is likely to be both awkward and slow: awkward because it 
has sacrificed the flexibility of the more primitive system, slow 
because it has sacrificed its simplicity. Clearly, Charlotte could 
be modified to support all that LYNX requires. The changes, 
however, would not be trivial. Moreover, they would probably 
make Charlotte significantly larger and slower, and would 
undoubtedly leave out something that some other language 
would want. 


A high-levél interface is only useful to those applications for 
which its abstractions are appropriate. An application that 
requires only a subset of the features provided by an underly- 
ing layer of software must generally pay for the whole set any- 
way. An application that requires features Aidden by an under- 
lying layer may be difficult or impossible te build. For 
general-purpose computing a distributed operating system must 
support a wide variety of languages and applications. In such 
an environment the Kernel interface will need to he relatively 
primitive. | 7 


‘in the course of doctoral 
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Abstract 


DRAGON SLAYER is a distributed operating system 
for microprocessor based networks. It is currently 
implemented on a Carrier Sense Multiple Access 
(CSMA) network of 16 Leading Edge PCs. It provides 
message passing and uses a standard client/server 
model. A unique aspect of the system is the lack of 
any global process existence, global system kernel or 
central authority. In this paper, we explain our use of 
broadcast messaging and our mechanism for consistent 
and unique process naming. We achieve fair resource 
scheduling without any assumption on_ global 
management or control using a novel distributed 
algorithm derived from constructions using the Theory 
of Interaction Systems. Further, we address the 
usefulness of DRAGON SLAYER for dealing with 
data integrity in distributed data bases, for distributed 
vision processing applications, and for providing 
cheap, efficient, robust solutions for real-time process 
control problems in automatic manufacturing 
(factory—of-the-future). 


Introduction 


An increasing number of operating systems for 
distributed applications on microprocessor networks 
have been developed during the past five or six years. 
If one considers as typical and established examples 
the Stanford V-Kernel [6,7] or the INRIA CHORUS 
[1,2,9] systems, one finds that in order to achieve a 
high modularity and flexibility at design and run time, 
all communication is based on message passing 
mechanisms. These systems create and maintain global 
name tables at every site through a remainder central 
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service, a global kernel. Such an architecture makes 
sense in a local area network, with a central node 
responsible for message handling, name tables, process 
scheduling, managing large data files, etc., and with a 
standard connection (Ethernet). In a network with peer 
authority at all nodes, there should be no asymmetrical 
solution to handling control and management 
problems. Asymmetric solutions can cause undesirable 
message overheads in view of envisioned large 
distances between nodes of future distributed systems. 
Also, systems built on a master-slave relationship are 
vulnerable to failure since a master breakdown 
immediately halts all communication. 


With DRAGON SLAYER, a successful attempt has 
been made to develop and implement a distributed 
operating system without any form of centralized 
service or global kernel. We are not aware of any other 
system of this type. In DRAGON SLAYER, resource 
management has been distributed to the process level, 
requiring that we solve the general conceptual problem 
of fair distributed resource scheduling. Since 
resources or server processes are normally available at 
several sites, we insure that a set of resources 
requested by a process will eventually be assigned so 
long as such resources are available anywhere in the 
system. This restriction holds even if server processes 
initially dedicated to providing the requested resources 
fail (e.g. through node or link communication failures). 
Our system is therefore extremely robust. Robustness 
and peer authority at all nodes were major design goals 
for DRAGON SLAYER. In our paper, we describe the 
main ideas and structures of this development. 


After explaining the hardware configuration for our 
current implementation, we mention our integration of 
available software (MS-DOS) and the principles 
behind our system construction. We discuss the 
particularities of our message passing mechanism, 
broadcast messages and distributed naming scheme. A 
detailed outline is given of our distributed process 
scheduling and resource allocation handling algorithm. 


In order to construct the needed distributed algorithm, 
we made use of formal results of the Theory of 
Interaction Systems, enhancing a deadlock-free, but 
not necessarily fair, algorithm by J. Winkowski [19] 
which was already in a suitable format and which was 
the first algorithm presented for this purpose which 
placed no unrealistic restrictions on the behavior of the 
processes involved. This algorithm and the formal 
background mentioned are specified in an appendix. 
Finally, the conceptual and developmental 
achievements of the DRAGON SLAYER operating 
system are discussed in light of existing solutions and 
of similar conceptual results. We point out the 
advantages of completely distributed operating 
systems like DRAGON SLAYER for future distributed 
applications in real-time processing 
(factory—of-the-—future). We also discuss how 
DRAGON SLAYER provides for cheap, efficient, 
robust implementations of distributed algorithms for 
canonical vision tasks. 


1. The Hardware Environment 


DRAGON SLAYER was targeted for use on a network 
of Intel 8088 or 8086 based systems. It currently runs 
on a Carrier Sense Multiple Access (CSMA) network 
of 16 Leading Edge Personal Computer systems. 
These systems each contain 512K of local memory, a 
memory mapped character display and two 320k byte 
floppy disk drives (see figure 1). The systems are 
interconnected with a 1 Megabit per second network. 
Two of the systems have printers connected to them, 


two of the systems are connected via 9600 BAUD lines 
to the university mainframe host and one of the 
systems has an additional 10 megabyte disk drive (see 
figure 2). None of the individual systems is truly 
dedicated to any particular function, and the operating 
system kernel resident on any given system has not 
been tailored to augment particular functions. 
However, the systems to which the printers are 
connected must function as print servers at some time, 
the systems connected to the mainframe host must 
serve as gateways and the system connected to the 
winchester disk must serve as a file server. Our 
intention has been to allow each system to fulfill its 
service function without requiring dedication to that 
particular function. 


2. The Software Environment 


The DRAGON SLAYER system functions within and 
augments the operation of the MS-DOS or PC-DOS 
operating system, a single user operating system not 
designed to allow process communication, multiprocess 
execution, multitasking, concurrency or parallel 
program execution. Of chief concern during the design 
and construction of DRAGON SLAYER has been the 
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philosophical imperative that MS-DOS functions 
would be unchanged and available to user processes. 
We have deviated from that philosophy as little as 
possible. Maintaining the MS-DOS functions has 
allowed development of applications under existing 
MS-DOS program development tools. Most user 
programs that currently can run under MS- DOS 
continue to run under DRAGON SLAYER’s guidance. 
No disk or other hardware support facilities, other than 
network message facilities have been added. One clear 
advantage to using this approach was our ability to add 
a few procedures to TURBO PASCAL, a commercially 
available version of PASCAL for Personal Computer 
systems, and create a usable multiprogramming 
language in which to develop our application base. 


3. Principles of System Construction 


The DRAGON SLAYER System is based on the 
client/server model which is common to many of 
today’s distributed systems[1,2]. System resources 
are provided to the client processes through server 
processes while the operating system attempts to 
ensure fairness of resource allocation and efficiency of 
operation. When a client process wishes to use a 
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Figure 3. 
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particular system resource it must request use of that 
resource from the resource server and then become a 
client of the resource server (see figure 3). The use of 
self scheduling servers without global control of 
resources introduces a conceptual problem of 
synchronization and fair resource scheduling that is not 
present in systems which feature centralized scheduling 
strategies. Our system requires that processes follow a 
strict model for resource request and use. 


The resource client is insulated from the -details of 
handling a_ particular resource by the server which 
presents a well defined interface, an abstract or virtual 
resource, to the client process. In the DRAGON 
SLAYER system, the client and server are further 
insulated by the necessity of communicating through 
use of the message passing primitives provided by the 
Operating system. All server interfaces are defined in 
terms of the messages sent to the server by client 
processes. 


The DRAGON SLAYER system has been developed 
without any form of central control. As a result, client 
and server processes act with a maximum amount of 
process autonomy and a large share of the 
responsibility for allowing the system to function 
properly. As in other message passing environments, 
DRAGON SLAYER processes, compete for access to 
resources which are available only for one of them at a 
time, and can be considered as members of 
temporarily groups of processes requesting overlapping 
sets of resources. However, we have no main process 
or group manager, and the group members do not have 
information about other group members other than 
information available from the resources over which, 
peers may compete. The sequence of activities through 
which a process accesses a mutually exclusive set of 
resources needed for a particular process step or 
critical section are as follows: 
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. Establish a list of servers on the system. 

. Request commitment from servers in the resource list. 

. Initiate processing, once the resources/services are availabie. 
. Return resources and 


Mm hw NY = 


. Free the servers. 


Once a_ process reaches step five, it may either 
terminate existence or return to step two and initiate 
another processing cycle in which it utilizes a new set 
of resources. The main process may look for new 
resource servers to add to its list at any time. 


4. Message Passing 


The DRAGON SLAYER operating system incorporates 
novel concepts in process addressing and organization. 
It is a message passing operating system based on a 
standard client/server model. The sending mechanism 
is displayed in figure 4. Unique aspects of this work 
include the lack of use of any global process existence, 
name or address table, and the use of only 
asynchronous non-blocking messages. Each processor 
on the network has a unique name assigned in 
hardware on the network interface board. Each active, 
memory resident process has a unique name composed 
of the name of the processor on which it resides and 
the segment-offset address of the first byte of 
executable code within the process. (In this way 
unique process naming is an indirect product of a 
process coming into existence at a node.) The 
mechanism utilized to locate a process is the universal 
broadcast primitive of the message passing function. 
This primitive replicates a transmitted message to all 
processes troughout the network system. As every 
process instantiation will receive a copy of such a 
message, locating the process becomes the act of 
sending an “are you there” message and receiving a 
“ves I am” reply. Once two processes have exchanged 
addresses they can communicate directly. The lack of 
any centralized numbering or naming convention 
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Process 2 


awaken 


Figure 4 
Interprocess Communication 


allows the user complete flexibility in the broadcast 
messages which he wishes to use and to which he 
wishes to respond. (Obviously, if a user wishes to 
communicate with a printer he will have to obey the 
semantics of the messagesof printer servers.) Message 
passing and interprocess communication is handled 
through the functions of message buffer request, 
message buffer free, message send and message 
retrieve. Message buffer request is used to obtain a 
free message buffer from the operating system. This 
buffer may be used to transport information from one 
process to another. Message buffer free is used to 
return a used buffer to the operating system. Message 
send is used to request that a message buffer be 
transported from the current creating process to the 
process or processes identified as recipients of that 
message. Message buffer retrieve is used by a process 
to examine any message buffers that have been sent to 
it. Messages consist of a destination address, sender’s 
address, control information and text information. 
High level interface functions remove the possibility of 
loss of message buffers by providing composite 
functions that properly integrate requests for, and 
release of, buffers. Message buffers are maintained in 
independent local pools. Each node maintains its local 
pool. 


The sender’s address is always the three part segment 
number / offset number / processor number address 
of the sending process. The destination address is 
either the specific segment / offset / processor address 
of the desired process or may be specified as a 
broadcast message to be delivered to all processes on 
all systems. 


5. Process Scheduling 


Each processor is engaged in scheduling the execution 
of the processes that are locally memory resident. 
Ready-to-run processes are scheduled for execution 
in a round-robin order. A process is deemed ready to 
run if its timer has been decremented to zero or if it 
has a message pending. A key problem in process 
scheduling is fairness, sometimes called 
starvation-freeness. Requested resources must be 
available after some finite time. Because a server can 
join more than one group simultaneously it is possible 
that processes in each of two groups may 
simultaneously attempt to utilize their shared 
resources. We achieve fairness in deciding conflicts 
between clients using a distributed algorithm which can 
be outlined as follows: 


For entering one of its critical sections, say cs(J), a 
process P must proceed through the sequence of 
activities mentioned in section 3. Communication, 
broadcast or direct, would have to occur in these 
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activity sections. For short, we call "establish a list of 
servers on the system” rg ("registration”), the types of 
resources needed in order to execute cs(I) would be 
R(1),...,.R(n). cl(1), cl(2),...,cl(n) will be names for the 
phases in which the servers providing access to 
R(1),...,R(n), respectively, would have been freed. 
Every section different from the critical one is 
accessible. The critical section is free for access only 
if none of the servers in the servers’ list is serving 
another process. The remainder section (with respect 
to cs(1) is everything outside the activity sequence used 
to attach the particular set of resources needed in cs(J) 
(see section 3). 


The behavior of process P in order to enter and execute 
a critical section is given by the flow chart scheme in 
figure 5 which explains the structure for making a 
transition from one of the sections specified above, 
into the next. As part of the algorithm processes will 
reach a state of competition in which they play a 
distributed game which is developed from an algorithm 
by J. Winkowski [19] and which is deadlock—free in the 
sense that one of the players wins after a finite time. 
This game is not fair. Fairness is achieved as follows: 
The servers/resource managers are responsible for 
communication about resources. In particular, they are 
responsible for communication with or among 
competing processes. In phase cl(i), the given process 
P is subject to a locking influence from a neighbor 
competing with P about R(i) if this neighbor were in rg. 
In rg, P exerts a similar influence on its neighbors 
when these have just released R(i). While P has not yet 
left cs(1) it is subject to a temporarily locking influence 
from any neighbor being in a state where it has 
released the shared resource but has not yet entered its 
remainder section. The corresponding communication 
is performed through the resource managers involved 
in the particular competition. Thus, it is really the 
partial knowledge of these managers alone on which 
the progress of the involved processes depends.Every 
process proceeds according to the scheme given in 
figure 5, and all servers react to the best of their 
knowledge. In this way, we guarantee that P will, after 
a finite time, be able to enter its critical section cs(J). 


The process and communication mechanism displayed 
in figure 5 for one process P, for entering one of its 
critical sections cs(J), looks fairly straightforward 
because it is supposed only to give an idea about the 
two separate steps of communication needed for P to 
leave one section and enter the next, respectively. As 
usual for mutual exclusion problems, the correctness 
proof is not straightforward, even in the special case of 
only one critical section per process. It is not easy to 
see that no process can take advantage of a neighbor 
through a locking influence. The scheme in figure 5 is 
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the same for cases of multiple critical sections with 
varying sets and kinds of requested resources. In order 
to verify this modularity of our design approach, it has 
been proven [14] that the mechanisms implemented 
for each critical section do not “interfere”. More 
technical details about the implemented general 
algorithm and the game, can be found in the appendix. 


6. Discussion and Future Work 


We have briefly described some main features of the 
DRAGON SLAYER operating system. It allows fully 
distributed control of interacting and competing 
processes, without central control and without use of 
distributed relics of central control such as distributed 
name tables, group identifiers or multigroup 
identifiers. Through implementation, we have 
demonstrated the viability of our construction. The 
DRAGON SLAYER design principles feature novel 
aspects. Under completely decentralized control we use 
broadcast messages as universal communication 
primitives, and for resource scheduling we 
implemented a general and fair distributed algorithm. 


Broadcast messaging is a necessary function to avoid 
use of central control for establishment of direct 
communication between processes. Rather than 
multicast, broadcast messaging was incorporated in 
DRAGON SLAYER and provides a means for any 
process to communicate with all other processes 
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Figure 5 
Critical Section entrance behavior of Process P. 
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filter. In a single broadcast network like ours, the 
network load for broadcast is the same as for multicast 
messaging. Under our scheme, the operating system 
need not maintain any tables with information 
regarding resident processes. Admittedly, the use of 
broadcast messages requires that all processes be 
awakened so that they may actively ignore broadcast 
messages. On the other hand, the use of multicast 
messages requires that the operating system be aware 
of all local elements of all multicast groups introducing 
a kind of centralized control. We have avoided the use 
of such distributed lists in our system. 


A major focus of our current experimentation is on 
distributed vision applications: distributed scene 
analysis with moving objects to be located and 
identified, like in the Autonomous Land Vehicle 
project (DoD) or in _ factory-of-the-future 
environments. In such distributed applications, the 
amount of local computation is much higher than the 
amount of communication. Since broadcast messaging 
in DRAGON SLAYER only occurs when a process 
locates the resources needed, we can benefit from the 
high amount of process autonomy without being 
impacted by the broadcast costs. Currently, we are 
performing comparative studies using sequential and 
parallel versions of algorithms for canonical vision 
tasks — static object location and measurement of their 
features - as a benchmark test for our network 
performance. We compare up to 4 processors under 


DRAGON SLAYER against a VAX 11/780. Similarly 
to the results reported in [13], our first results give 
evidence that the speedup resulting from the 
parallelization more than outweighs the relatively 
modest computational power of our micro-computers. 
A performance of a mid-size computer can be reached 
or surpassed at minimal hardware and _ software 
expense. While the parallel algorithms used for the 
benchmark tests could easily be realized in hardware, 
because of the primitivity and repetitivity of the used 
operations we envision a major practical benefit of our 
system for dynamic scene analysis, with moving objects 
and objects coming into, or leaving, a frame. In this 
application, higher-level software operations would 
typically be distributed into partitioned search tasks. 


The scheduling algorithm we use results from solving a 
conceptual problem. The existence of such a general 
algorithm was investigated by using formal tools, 
within the Theory of Interaction Systems [11,18]. Our 
results easily allowed specification of another 
distributed algorithm for the same purpose, but without 
use of resource managers [14,15]. Recently another 
such algorithm, also not using resource managers, was 
found independently [5]. It uses the same idea of local 
precedence changes as found in [19]. 


In our implementation, time-out mechanisms prevent 
processes from being blocked in a situation where a 
node with resources dedicated to such processes breaks 
down during performance of its task. In case of such a 
node failure or link failure, the requesting process will 
make another request for the desired resources. 
Through our formal specification method, as an 
extended fairness requirement for which the proof can 
be found in [14], a process registered for accessing 
resources will eventually get them (maybe after several 
attempts) as long as such resources are available 
anywhere in the system. Thus, DRAGON SLAYER 
exhibits an unsurpassed degree of robustness, due to 
the complete lack of centralized control functions. 
Future research is targeted towards fault tolerance, 
data integrity, and error recovery at user levels, in the 
presence of faulty nodes. 


Our development provides the basis for innovative 
forms of distributed real-time processing with peer 
authority at every node and cheap, long-distance 
communication, through the envisioned progress in 
fiber optics technology. Such developments are under 
discussion in the vendor industry [3]. In order to 
provide for high-level real-time applications of 
distributed microprocessor networks a_ really 
distributed compiler for a CSP-like language has been 
developed in Milan/ Italy [4]. It is now in the phase of 
becoming an industrial product (Olivetti). We are 
cooperating with this group to implement it on 
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DRAGON SLAYER, as a first step into office 
automation applications. Another direction for 
distributed real-time applications research with the 
DRAGON SLAYER system, then running on a 
68000-based network with Ethernet at Wayne State 


University, is to support a project in real-time, 
multi-layer distributed production control with 
distributed information management 


(factory—of-—the-future). 
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Appendix 


The DRAGON SLAYER resource scheduling algorithm 
resulted from an exciting discussion about Dijkstra’s 
Dining Philosophers years ago: In 1979, one of the 
authors was asked to study Dijkstra’s most recent 
contribution to the theme of fair scheduling of 
distributed processes [8] which turned out not to be 
fair without substantial assumptions on centralized 
control. Using the Theory of Interaction Systems, a 
specification and analysis tool for distributed systems 
[11,14,18], a starvation-free solution for Dijkstra’s 
generalized distributed process scheduling problem 
was developed in an incremental procedure and proven 
correct. The main idea behind the formal construction 
was a_ suitable enhancement method for any 
deadlock-free realization of mutual exclusion without 
central control. Given such an algorithm one could 
immediately write down a starvation-free or fair 
distributed algorithm. J. Winkowski presented such an 
algorithm in 1981 [19] and no problem remained. 


Winkowski’s algorithm, besides being based on explicit 
message passing, was the first to make no assumption 
on boundedness of “hungry” periods (called 
registration rg phases in section 5). In its main outline, 
the algorithm works as follows: 
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Each process is assumed to have just one critical 
section which may be periodically visited. Each 
process p, at a time when it wants to enter its critical 
section, needs a set A(p) of resources. In order to 
make the requests, p sends visiting cards to the 
resource/server processes in A(p). The requesting 
processes have a local priority list which contains 
priority information with respect to neighbor processes, 
i.e. processes also possibly requesting resources in 
A(p). The resource processes have queues for storing 
visiting cards and store the information about the local 
priorities obtained from their clients. In order to have 
coherent information, it is assumed that initially all 
local priorities are projections of a global precedence 
relation. (This is easily achieved in DRAGON SLAYER 
by using the processor, segment and offset numbers of 
the requesting processes.) 


The algorithm is a game in which a process wins iff it 
succeeds in having all visiting cards in the top position 
of each resource queue in A(p). 


If p is interested in accessing its critical section, it 
sends visiting cards while the following conditions 
hold: 


-There are no winning cards of the neighbor players in resource 
queues. 

-There are no visiting cards in resource queues from players with 
higher priority. P’s card in such a queue already preceeds such a 
card. 


P is to stop leaving cards and collect or take back 
already placed cards when one of the following 
situations occurs: 


~A winning card from another player appears at a resource in 
A(p). 

-~A card from another player with a higher local priority is in the 
resource list before p has succeeded in getting its card enqueued. 


After P wins and the other competing processes have 
removed their cards from the resource lists, p accesses 
its critical section. Then P reverses all its local 
priorities and notifies the resource processes which in 
turn notify p’s neighbors in order to cause the 
corresponding changes. Then p releases the resources. 
The key idea of the algorithm is that all local priority 
lists, through the indicated local operations, can still be 
understood as local projections of a (changed) global 
precedence relation. No deadlock would occur in the 
game since one of the competing processes always 
wins. 


The extension of this algorithm into a starvation-free 
One is constructed by using the Theory of Interaction 
Systems. The detail of this construction is available in 
[18]. The main idea is as follows: 


The formal objects corresponding to distributed 
processes are called parts, their relevant sections are 
called phases. Two formal relations between phases of 
different parts are used as primitives to define any kind 
of interaction or cooperation between distributed 
processes: a coupling relation specifying mutual 
exclusion of process states, and an excitement relation 
defining basic asymmetric forms of influence between 
processes. 


A concept of global state and of (formal) operational 
semantics is then developed without making any 
reference to global time, purely using local information 
such as local states or phases, process steps or local 
events. Entering a next phase is possible unless 
specified mutual exclusion requirements would be 
violated. In Dijkstra’s original Dining Philosophers’ 
Problem the 5 philosophers involved had a critical 
section ("eating”) and a _ remainder _ section 
(*thinking”). Figure 6 shows a_ corresponding 
Interaction Systems representation in which all events 
in the remainder sections are possible and where 
section e can be entered only if both neighbors do not 
have access to their sections e. Thus figure 6 is a 
formal framework for a deadlock-free algorithm. 


The excitement relations specify asymmetrical local 
influences from one part to another. If a part b(1), 
being in phase p(1), excites b(2), being in phase p(2), 
then b(1) cannot leave p(1) unless b(2) has not left 
p(2). The effect on b(2) is described by axioms which 
account for the absence of any form of global control 
or management. For an elementary example, consider 
a process P, being in section r, which requests a 
resource R: P-cannot leave r until R has been allocated 


to P. The request triggers service processes through: 


which the global resource manager, or supervisor in a 
conventional environment would eventually make the 
requested resource available. 


This formal theory, though unconventional, has been 
proven to be a powerful and flexible modeling and 
analysis instrument for distributed systems 
[11,14,16,17,18}. In [18] it was used to construct, in an 
incremental procedure, an easy extension for the 
solution in figure 6 such that gradually all conflicting 
requirements of the problem were satisfied. The result 
is shown in figure 7. It has been proven that given the 
same interaction specification between neighbors in an 
arbitrary topology, the resulting formal system 
specifies a fair or starvation-free solution for Dijkstra’s 
generalized resource scheduling problem. The flow 
diagram in figure 5 is then an immediate translation of 
this formal specification into an algorithmic one. 
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In general, processes may have several critical 
sections, with varying numbers, and different kinds, of 
resources requested. They also may die or be newly 
generated. The major advantage of our formal 
specification method is that it allows for an easy proof 
of the fact that communication mechanisms 
constructed for two different critical section problems, 
with an appropriate formal specification similar to that 
in figure 7, do not interfere. (The technical details can 
be found in [14].) Thus, the final general distributed 
resource scheduling algorithm can, in a modular way, 
be composed through iterating distributed procedures 
each of which solves a single section problem and 
which was a result of directly breaking down a formal 
specification, after formally proving its correctness. 
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Figure 6 
Mutual exclusion between dining philosophers 


Figure 7 
Fair interactions of dining philosophers 
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Abstract 


Algorithms for the parallel solution of problems are usually designed 
assuming an unlimited number of processors. Physical parallel machines 
have a fixed number of processors. The algorithm contraction problem 
arises when an algorithm requires more processors than are available on 
the physical machine. We present tools for comparing algorithm contrac- 
tions based on bottle neck communication paths. We apply these tools to 
minimum, matrix product and sorting. 


Introduction 


Algorithms for parallel computers are usually designed assuming an 
unlimited number of processors. For non-shared memory parallel algo- 
rithms, this assumption generally manifests itself by the algorithm utiliz- 
ing One processor "per point", or some other input size-dependent proces- 
sor allocation. The physical machine has only a fixed number of proces- 
sors, of course, which will almost certainly be less than the number 
required by the algorithm. In order to make the logical processes of the 
algorithm conform to the physical processors of the machine, we must 
group processes together into a module to be executed on a single physi- 
cal machine. This activity is called contraction(13]. The way this con- 
traction is performed can have a significant affect on performance. 


Consider two examples based on an grid nxn of processes, i.e. the 
processes communicate with their four nearest neighbors: 

(1) There is much process-to-process communication and 
approximately equal computation required of each process. 

(2) There is little process-to-process communication and the 
amount of computation per process is proportional to its j index, 
é.g. process i,j iterates j times. 

Suppose we have only one fourth the required number of processors and 
now compare two ways of forming contractions of four processes per 
processor[4]: Coalescing groups of adjacent 2x2 subarrays; folding 
groups as if the grid is folded in half and then in half again, i.e. i,j 


(Ist,j<5) is associated with in—j+1, n-i+1,j and n-i+1,n—j+1. 


Clearly, algorithm (1) should be contracted by coalescing because the 
process-to-process communication for the processes sharing the same 
processor will become intraprocessor communications (i.e. fast memory 
references) rather than slow interprocessor communication; folding 
would not be as attractive because no communication is saved by locality. 
Alternatively, algorithm (2) should be contracted by folding because the 
work is balanced since each processor will perform a matching amount of 
long and short computations; coalescing would not be as attractive 
because the processors receiving processes with large indexes will 
become a bottleneck. _ ve | 


Using the results of Berman and her colleagues[3], an algorithm can: 
be be automatically contracted, and this seems to be the best approach 
when nothing is known about the algorithm. At the other end of the spec- 
trum, however, the programmer has "complete" knowledge about the 
algorithm. How should he be guided when performing his own contrac- 
tion? In this paper we develop some apparatus to guide the programmer 
who must contract an algorithm. We will provide some case studies of 
contraction that show an unexpected diversity and we offer some general 
contraction strategies that can find application in other algorithms. Con- 
traction is a nontrivial problem for parallel programmers[13], and so a 
secondary goal here is to expose it as an important topic for study and a 
subject suitable for rigorous analysis. 


Definitions 


The generic parallel architecture under consideration in this paper is a 
non-shared memory model. It is a collection of homogeneous sequential 
computers operating asynchronously and connected in a communication 
network that is a bounded degree graph[{13]. A single “edge” in the graph 
provides bidirectional communication between two processors. The 
' CHiP[11] architecture is an example of this generic architecture. 
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The method used for programming this model consists of defining a 


“sequential program for each processor and a communication graph. We 
are assuming a configurable architecture. (The problem of mapping a 


communications graph onto a different processor connection graph is dis- 
cussed by Berman and Snyder[4] and Bokhari[5].) Communication is | 
explicitly shown in the sequential code by specifying a data value to be : 


' sent to the processor connected by a given edge. 


The algorithm contraction problem arises when an algorithm that is 
designed for use on n processors must be mapped to a physical parallel 
computer with only p <n processors. The programmer must decide which 
logical processes are to be mapped to the physical processors. Assuming — 


‘that the logical processes have balanced loads (they run for the same’ 


length of time), we would like the physical processors to have balanced 
loads. This is done by mapping the same number of logical processes to 
each physical processor. The number of logical processes assigned to 
two arbitrary physical processors should differ by at most an additive 
constant c. For most contractions, it would be best to have c=1. 


The contraction induces a communication graph for the p physical 
processors. This new graph is defined by logical processes needing to 
communicate with other logical processes not mapped to the same physi- 
cal processor. We assume that if a logical process in processor i needs to 
communicate with a logical process in processor j, there is a physical 
edge connecting the two processors in the new graph. The contraction 
may map many of these logical edges to one physical edge in the new 
graph. That is, we are allowing only one edge between physical proces- 
sors. Under the assumption of a bounded degree graph for the generic 
architecture, this induced graph must also be of bounded degree. 


As an example of contraction, let us assume we have an algorithm 
with a tree graph. Consider the contraction to 5 processors shown in Fig- 
ure 1. This contraction caused an increase in the degree; for example, the 


new root vertex has four descendants. Using this kind of a contraction, it. . 


can be shown that given p processors, contracting an algorithm with at. 
least p logical processes requires degree p—1. Figure 2a shows a con- 

traction of the tree to 4 processors. An extension of this method yields a 
binary tree in the p processors. 


Figure 2b gives another contraction to 4 processors. This contraction. 
is derived by the recursive tree construction given by Leiserson[9]. 
Given two instances of a tree each with an associated free node, we can 
build a new tree and an associated free node. This produces a linear area 
layout in the plane with several desirable properties, one of which is the 
constant number of external edges. 


In this paper, we will be considering the contribution of the communi- 


cation time to the performance of the contracted algorithm. Unless other-. 
wise stated, we assume that a communication between processing ele- 


ments costs a fixed time ¢,. During this time, no other communication in 
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Figure 2: Two 4 processor contractions. 


the same direction may take place. We are specifically allowing all edges 
to have simultaneous communication. Communication internal to a pro- 
cessor costs the fixed time #;. We also assume that t, >> f;. 


We would like to develop tools for reasoning about the relative merits 
of different contractions. This includes their communication costs and 
their execution times. To aid in this objective we give the following 
definitions. 


Let A=(V,E ) be an algorithm where V is a set of logical processes ( 
vertices and associated programs ) and |V|=n, E is a set of edges 
(V1,V 2), V1,V2EV. 

Let M(A,p)=B be a contraction of algorithm A into algorithm B 
where B uses p processors and p < |V,|. The contraction M maps ele- 
ments of V4 onto Vz such that the number of elements of V4 mapped to 
an arbitrary element of Vg differs by no more one from the number of 
elements of V4 mapped to any other element of Vz. 


Let w(e), the weight of e, for e =(V,,V), be the larger of the 
number of messages from V ; to V, and the number of messages from V2 
to Vi. 

Let K(A) = MAX w(e), for e€E, be the communication "cost" of A. 


This cost is an estimate of the minimum communication time required for 
the algorithm. Due to dependancies, the actual communication cost may 
be more. 


Let T(A ) be the execution time for A. 


PROPOSITION 1: For a given A, p, M,, and M 9, and t, >t;, if 
K(M,(A4,p)) <K(M2(4,p)) then T(M ,(4,p)) $< T(M2(A,p)). 

This proposition is formalizing the notion that the bottleneck edge 
will be a lower bound on the time required for the execution of the 
mapped algorithm. If the processors have a small amount of computation 
relative to the communication, the execution time will depend on the 
communication time. The bottleneck edge of the contraction M, will 
require a minimum of 1#,K(M,(A,p)) time, which is less than 
t,K(M(A,p)). With a higher minimum communication time, we can 
not expect M > to execute in less time than M. If the processors have a 
large amount of computation in ratio to the communication, the computa- 
tion time will dominate, yielding near equal times. Even in this case, M, 
uses less time for communication than M,. This proposition then 
motivates us to map the busiest edges of an algorithm to internal edges. 


Case Studies 


We now look at several parallel algorithms and some contractions. 
We approach these by considering algorithms with similar communica- 
tion graphs. The three graphs considered are the tree, grid, and binary n- 
cube. 


Tree algorithms 


There are several algorithms that run on complete binary trees (Figure 
1) having similar characteristics, like the aggregation operations of 
minimum and global sum. All processors have a value and we want to 
compute a global value that depends on all these values. Leaf processors 
send their value to their parents. Internal processors take the minimum 
(sums) of their own value and their children’s values and then send the 
result to their parents. The final value will be computed at the root pro- 
cessor in O(log n) time. The communication in these algorithms 
Tequires one message over each edge for each global minimum (sum). 
For a single minimum we have K (minimum) = 1. 


Consider the contraction in Figure 2a. Let us call this contraction 
M ,(minimum ,p). Each edge in the original algorithm requires one mes- 
sage. Each edge in the smaller graph has 4 edges from the original graph. 
Since we have only one connection between the physical processors, we 
have 4 messages for each edge. For an arbitrary n (size of original algo- 
rithm) and p (the number of processors) we have 
n 


K(M ,(minimum.,p )) = 
P 


Figure 3: Berman and Snyder tree contraction. 


A similar contraction to Figure 2a is touched on by Berman and 
Snyder[4]. Figure 3 shows this contraction. This is achieved by "fold- 
ing" the tree. As Berman and Snyder notice, this contraction, M », has 


K(M,(minimum p)) = ~. 
Pp 


Consider the contraction in Figure 2b. Let us call this contraction 
M (minimum ,p). We note that each edge in the smaller graph has at 
most one edge from the original graph in each direction. For an arbitrary 
n and p we have K (M 3(minimum ,p )) = 1. 


Proposition 1 tells us that since M3 has a smaller K, it is the preferr- 
able contraction. Both M ; and M , depend on n and p for their cost. But, 
M; has a constant cost, regardless of n and p. In fact, this contraction is 
optimum for all tree algorithms that have identical edge weights and uni- 
directional communication (all toward the root or all toward the leaves). 


We first look for a lower bound. Since the tree is connected, the phy- 
sical processors must be connected. This requires at least one incident 
edge for each physical processor. The smallest cost K(M(A,p)) would 
be where a maximum of one logical edge was mapped to a physical edge. 
Therefore, K (M (A,p )) = K (A), the cost of the original algorithm. 


LEMMA 2: For complete binary tree algorithms with balanced pro- 
cessor loads, equal edge weights, and unidirectional communication, 
algorithm contraction based on Leiserson’s binary tree layout technique 
yields optimum results. 


PROOF: For the mapping M (A ,p), each processor contains a com- 
plete subtree and an "extra" node. The extra nodes are used in the tree 
above the subtrees contained in the processors. Therefore, there at most 
4 external connections. Of these four, two edges are used to 
receive(send) data from(to) the children of the extra node, and two edges 
are used to send(receive) data to(from) the subtree’s and the extra node’s 
parents. Since the root of the subtree and the extra node are not at the 
same level in the tree, edges with data flowing in the same direction can 
not be connected to the same physical processor. (It is possible to have 
two of these edges over the same physical edge, but the data moves in 
Opposite directions.) This gives the same weight to the physical edges as 
the original edges. Therefore, K(M3(A,p))=K(A), which is the lower 
bound. O 


Notice that this layout technique will place two logical edges in the 
same physicial edge for some physical edge. For tree algorithms with 
bidirectional communication, we then get K(M 3(A,p )) = 2K (A). 


To help verify these results, the minimum algorithm was programmed 
using the Poker parallel programming environment[12]. Both M, and 
M were programmed. Each contraction was timed using 4 and 16 data 
items per processor with 4 and 16 processors. The results of these tim- 
ings are given in Table 1. Each "tick" represents a mircosecond on the 64 
processor Pringle. 


Grid algorithms 


We next look at algorithms that run on a grid interconnection. Con- 
sider the matrix product algorithm for the Wavefront Array 
Processor(WAP)[8]. It uses n? processors for the nxn matrix product 
AB =C. The data is fed in along the top » processors and from the left 
n processors. The matrix A is arranged to enter column by column, start- 
ing with the first column. The matrix B is arranged to enter row by row, 
starting with the first row. (See Figure 4.) All processors execute identical 
procedures. The result, c;;, is initialized to zero. A loop is executed n 
times that reads an A value from the left and a B value from above, mul- 
tiplies them together, and adds the result to cj. The A and B values are 
sent to the right and down, respectively. This causes the upper left pro- 
cessor to be the first processor to start execution. As the data moves into 
the array, there is a wavefront of executing processors on the cross diago- 
nal. Each edge is used to send all of one row of A or one column of B. 
For the WAP algorithm we have K (WAP) =n. 

Consider the contraction in Figure 5. Let us call this contraction 

M (WAP ,p). This is the contraction done by cutting the graph into p 
equal size connected subgraphs and assigning one process from each sub- 
graph selected from corresponding positions to a single processor. The 


Minimum: ticks for n (items) on p (processors) 
256 on 16 


M, 11650 0568 53496 105801 | 
M, 4356 7682 8878 


12067 
Table 1: Timings of the minimum algorithm. 
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Figure 4: WAP organization 


Figure 5: A contraction of 16 logical processes to 4 processors. 


physical connection graph, shown in Figure 5, is a grid with end around 
(i.e. toroidal) connections. For each logical process in a physical proces- 
Sor, there are horizontal and vertical communication paths. Since we have 


ae logical processes in a processor, the number of logical edges using 
Pp 


2 

: _ A Sate: ; 
one processor-to-processor connection is ——. Since all horizontal and 
vertical edges have the same number of messages, n, we have 


K(M (WAP ,p)) = ae, 


Consider the contraction in Figure 6. Let us call this contraction 


MWAP.,p). This is the contraction done by cutting the graph into p — 


equal size connected subgraphs and assigning an entire subgraph to a pro- 
cessor. We see that only the perimeter processes have edges that go from 
processor-to-processor. Also, notice that no end around connections are 
needed. The number of communication paths over one processor-to- 


2 
Each communication path requires n 


2 
messages giving K (M .(WAP ,p)) = = 
P 


processor connection is 


Comparing the two contractions, we see that K(M.(WAP,p)) is 

smaller than K (M ,(WAP ,p )) by a factor of aa Proposition 1 tells us 
Pp 

that M » is the better contraction. We conjecture that M > is the best con- 

traction that can be achieved for grid algorithms. The basis for this con- 

jecture is that this contraction has the smallest perimeter for a given area, 


and has been commonly used for contraction in published algorithms, for ° 


example for the Jacobi iterative method[1] and for the conjugate gradient 
method[6]. 

. Both M, and M2 were programmed using Poker. Table 2 summar- 
izes the results of the timings. As predicted, M. was the faster contrac- 


tion, but because the communication time is not the only time consuming 


part in these algorithms the difference is perhaps not as dramatic as might 
be seen on a larger problem. 


Binary n-cube algorithms 

We now look at two algorithms for the binary n-cube. The first algo- 
rithm is the divide-and-conquer algorithm for matrix product given by 
Nelson[{10]. The other algorithm is Batcher’s bitonic sorting algo- 
rithm[2]. 


The matrix product algorithm takes two n Xn matrices, A, and B, and 
computes their product C =AB. A and B are assumed to be in row 


‘major order in the binary n-cube of order 2k, where k = logn. The algo- 


rithm views A and B as a 2x2 matrix of ais matrices. The 2x2 matrix 


algorithm is then used to multiply the submatrices. Figure 7 shows a 


order 4 cube layed out in the plane using the CHiP architecture. The 
numbers in the boxes show the index of the matrix elements initially con- 
tained in that processor. We are assuming that the processors are num- 
bered in row major order. The dotted boxes show cubes of order 2. 


These cubes, which generally have order 2(k—1) contain an et sub- 


matrix of both A and B. Note that these cubes are constructed by 
"removing" the edges of order k and 2k, where and edge of order & con- 
nects processors that are 2“~) distance apart. 


To compute the 2x2 matrix product, all processors exchange values 
of B on the order 2k edge and values of A on the order k edge. After the 
exchange, each cube of order 2(k—1) contains 4 submatrices of size 


aka This is all the data that is required for each cube of order 2(k—1) 


to compute its part of the 2x2 matrix product independantly. If the sub- 
matrix is not a single element, two matrix products of 3*> matrices are 


required. These matrix products are done using the same algorithm. 
Matrix addition is done element by element. Because corresponding ele- 
ments of the matrices are contained in the same processor, no communi- 
cation is required. 


To find the cost of this cube matrix multiply, K(CMM), we need to 
find the edge with the most messages. At the first level of recursion, the 
order k and 2k edges were used to send a message each way. This is the 
only use of these edges in the algorithm. Therefore, w(e)= 1, where e is 
a order k or 2k edge. At the second level of recursion, two matrix pro- 
ducts are computed using the order k—1 and 2k-—1 edges. Each matrix 
product sends one message each way on each edge giving w(e)=2, 
where e is a order k-1 or 2k—1 edge. At level / of the recursion, 
w(e)=2!'"! messages over the order kK-(/—1) and 2k-(I-1) edges. The 
recursion stops when we have order 2 cubes. This is at the log n level of. 


recursion. There are = matrix multiplies done by order 2 cubes. These’ 


Figure 7: An order 4 binary n-cube. 
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order 2 cubes use the order 1 and k+1 edges. Each matrix multiply sends 
1 message each way giving w(e) = oe where e is a order 1 or k+1 edge. 


Since this is the largest value, K (CMM ) = oe 


Consider any contraction, M (CMM ,p) where p=2” for some 
m <2k. M(CMM,p) will map 2. logical processes to every processor. 


2 
This allows us to put a cube of order log 7 | =2k-m into each proces- 


sor. The processor-to-processor connection graph is also a cube and is of 
: n 
order m. Each processor-to-processor connection supports —— commun- 


ication paths in the original graph. The real question is which sub-cube 
do we map to each Processor. The cost of the contraction, 


K(M(CMM.,p)) will be "times the maximum w(e), where e is 
mapped to a physical edge. If e is order 1 or 2k+1 from the original 
cube, K (M(CMM,p)) = ae 


a | and order k-+1 through k+ — 


Consider the contraction that maps the edges of order 1 through 
| | internal edges. This 


2k—m 
2 


makes the edge of order k+ +1 the edge with the most mes- 


2k—m 

2 
k- ewe ti 

before we know that w(e)=2 ’ = om Therefore 


sages. This edge is used by level k— of the recursion. From 


25 
K(M(CMM.,p))= op Clearly, this contraction is better in terms of 


the number of messages over the busiest physical edge than any contrac- 
tion that does not keep the high traffic logical edges internal to a proces- 
sor. 


By contrast, let us consider the Batcher bitonic merge sort. This sort 
runs on a order k cube to sort n =2* elements. The final sorting will 
have the smallest element in the first processor and the largest element in 
the last processor. Figure 8 shows a graphical representation of the algo- 
. rithm. The arrows represent a data exchange and a compare, leaving the 
larger number at the end with the arrow and the smaller at the other end. 
It is obvious from the figure that the order 1 edge has the most messages. 
Therefore, K (SORT ) = log n. 


Again, to contract this algorithm, we see that we want to assign a 
sub-cube into a processor. Consider the contraction M (SORT ,p) where 
the edges of order 1 through order log p are mapped to internal edges. 
We are assuming that p =2”, for some m <logn. This contraction 
assigns the busiest logical edges to be internal edges. These edges carry 


: ee ee 
log n—log p messages. Since each processor contains — logical proces- 


sors, K(M (SORT ,p )) = n(log n—log p) . Any contraction that does not - 


map these first log p edges to internal edges will have a higher communi- 
cation cost. These results agree with and explain the results of Hsiao[7], 
even though his final algorithm was embedded in a grid instead of 
another cube. 
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Figure 8: Batcher’s bitonic merge sort. 
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In comparing the contractions for matrix multiply and Batcher’s sort, 
we see that the same size cube is mapped in a different way when 
mapped to the same number of processors. The busiest edges are dif- 
ferent for the two algorithms, thus, the contractions are different. 


Conclusion 


The algorithm contraction problem is an important problem for paral- 
lel programmers. The way in which an algorithm is contracted can have 
a significant affect on performance. Processor-to-processor communica- 
tion can be used as a lower bound on the execution time for an algorithm. 
It is the processor-to-processor communication that is affected by dif- 
ferent contractions. 


We have looked at algorithms for the tree, grid, and binary n-cube 
interconnections. For each algorithm we have compared possible con- 
tractions of these algorithms. For trees, we proved that Leiserson’s lay- 
out technique was the best for contracting tree algorithms such as 
minimum and sums. For grid algorithms, we conjectured that coalescing 
by maximizing the area for a given perimeter is optimal for the algo- 
rithms with balanced edge loadings. Finally, we showed two algorithms 
for binary n-cubes that required different contractions to produce the 
optimal results for the algorithm. 
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Abstract: Convolution is a basic operation of linear systems 
theory, and 2-D convolution is a frequently used operation in 
image processing. However, 2-D convolution is computationally 
intensive. For an N by N search area and an M and M window, 
the time complexity of a serial algorithm is O(N*M7). This paper 
presents several parallel convolution algorithms for array proces- 
sors with N? processing elements connected by various communi- 
cation networks. By using inter-PE communication networks 
efficiently, each PE requires only a small local memory, many 
unnecessary data transmissions are eliminated, and the time com- 
plexity is reduced to O(M7%). . 


I. INTRODUCTION 


Great efforts have been devoted to developing parallel processing 
architectures and associated algorithms for many popular 
mathematical operations [DuLe81, Fulc82, HwFu83, LiNi85, 
NiHw85, NiJa85]. Convolution is a very important operation in 
linear systems theory and image processing. The convolution of 


two functions G and W is a function C’ of a displacement y 
defined as 


C(y) = [G (2) W (2-y ade. 
We generalize it to the following form: 
O(y) = Q[G (2 Je W (x-y Jj, 


where e is a binary operator, which could be a multiplication (*), 
addition (+), inf, logical AND, etc.; and 2 is a function defined on 
all values of [G (2 )eW(az-y)] over the entire region of z. 2 
could be an integration (continuous case), summation (discrete 
case), sup, or maximum. 


In two dimensions, one function is "rubbed over” the other 
[Ball82]. The value of the convolution at any displacement can be 
defined as the following general form: 


C(t,j) = VO(G (t +5 ,7 +t )eoW (s ,t)| 
where ©® is a function defined over the region of ¢, and W is a 
function defined over the region of s. In the finite discrete case, 
G is defined on a square grid array of size N? as 


G = Gi; |t J €[0,N -1] \, 


and W is defined on a finite rectangular grid of size M by M' 
(M,M' <N), 


W = { Wal €(0,M-1],¢ €[0,M' 1}, 


Hence, the 2-D digital convolution of G and W can be defined as: 
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Gis +t *Wit ‘ (1) 


Equation (1) is also called image correlation or image template 
matching. 


Many other applications, such as object labeling including fuzzy 
model, discrete model, and linear stochastic model and morpholog:- 
cal operations including dilation and erosion, also require some 
forms of 2-D convolution. Their difference lies on the different 
choices of V, ©, and e functions. For example, the linear stochas- 
tic model for object labeling in computer vision requires the com- 
putation of the probability that a label y fits object x. Let W,, 

denote the probability that label ¢ fits object s. Also let G,, 

denote the conditional probability that label y fits object z given 
that ¢ is the correct label for object s. Then the probability that 
y is the correct label for object z of the (k +1)th iteration is 


C(t,j) = VO[G (¢+58 ,7 +t )leW (s ,t I, 
where : ==j =—0, e defines a multiplication, ® defines a summation 


over the range of ¢, and W defines a weighted summation over s 
with weighting factors @,, for different values of s . That is, 


Ww +2 ,y) = Bs L1G (2 ,y 38 ,t)*W\(s ,t)]], 


or in a more familiar form [RoHB76]: 


PFT), ) ac ce Doles (ry le )P,* 4 )I}. 


We are concerned with the communication complexity of 2-D con- 
volutions. Various definitions of Y¥, ®, and e only affect the compu- 
tational complexity. Equation (1) will be used as a representative 
case to indicate the computational requirement for a wide variety 
of window-based image processing tasks. G is called the search 
area and W 1s called a window. 2-D digital convolution is used to 
find sub-search-areas (subareas) from G that match closely with 
W. For convenience, each M by M' subarea of G can be 
uniquely referenced by its upper left corner coordinate (7,7). Let 
SUB (i,j) denote the subarea at location (¢,j). There are 


(N-M+1)(N-M' +1) such subareas for 7¢€[0,N-M] and 


j €[0,N—M' | as shown in Fig.1. 


The window W is matched against every subarea SUB(t,j) in G; 
and the convolution C;; is computed as the proximity measure 
between W and each subarea SUB(i,j). This measure is also 
called non-normalized correlation, or cross correlation. The nor- 
malized correlation is discussed in Section VII. To simplify the 
discussion, we assume that M=M' and M is a power of 2. We 
also assume the cross correlation C;; is used. In Section IV, we 


‘discuss the general case for array processors where M54M' and 


M is not necessary to be power of 2. 


During the convolution computation of all displacements, 
(N-M-+1)(N-M' +1) similarity values are generated, and one of 


(4.4): 


) 
M SuB(+,j) 
Ue 
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Fig. 1. An N by N search area G and an M by M'‘ window W 


the following results can be obtained: 


1. another 2-D array, formed by the similarity values of all 
subarea locations; 

2. all subareas with a similarity which exceeds a given thres- 
hold; 
the subarea with the greatest similarity to a window; and 

4. the first subarea whose similarity exceeds a predefined thres- 
hold. 


Convolution can be used as a simple filtertng method, for which 
option 1 above is desired. For edge detection, the edge operator is 
a window and option 2 is produced [FrCh77]. Image registration, 
or scene matching, is another application of convolution where 
option 3 or 4 is used [WoHa78]. 


Our paper discusses parallel algorithms for 2-D convolution, as 
defined in Eq.(1), on array processors with three different types of 
interconnection networks. Usually, 2-D convolution operations are 
extremely time-consuming, involving the time required both to 
perform the calculations and for communication among the PEs. 
With a single processor and traditional algorithms, the number of 
steps for convolution operations is (V-M+1)(N-M' +1)MM' , 
approximately N?MM' for M,M' <<N. With multiple proces- 
sors and parallel processing algorithms, the total time can be 
significantly reduced. These algorithms can be easily modified for 
other applications of convolution, such as those listed in Table 1. 


Different convolution applications have different computational 
complexities in their evaluation of different functions ®, V and e. 
For each application, however, the time complexity of the compu- 
tation performed in the PEs is exactly the same, regardless of the 
type of network. It is the time complexity for communication 
among the PEs that varies widely from network to network. We 
will compare the communication complexities of three different 
types of networks, independent of the forms of the functions 4, V 
and e. 


In this paper, parallel algorithms are developed and discussed for 
2-D digital convolution on array processors with different types of 
interconnection networks. Section II introduces a model of an 
array processor and the three types of interconnection networks. 
Detailed algorithms for array processors with N* PEs with a mesh 
network, a hypercube network, and a shuffle-exchange network 
are given in Sections III, [V and V, respectively. Section VI pro- 


vides an explanation of the modifications required to the algo-: 


rithms for different size of array processors. Section VII discusses 
the computation of generalized convolution. Section VIII presents 
our conclusions, including a table comparing the network commun- 
ication complexities. 
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Il. MODEL OF AN ARRAY PROCESSOR 


of 


An array processor is comprised of Q=2! processing elements 
(PEs), each having some local memory [HwBr84]. We assume that 
the PEs are indexed 0 through Q-1 and refer to the pth PE as 
PE(p ). The synchronized PEs execute instructions issued from a 
control unit. The control unit broadcasts an instruction to all 
PEs, and all enabled PEs simultaneously execute. the instruction. 
The enable/disable mask can be used to select a subset of the PEs 
that are to perform an instruction. The set of enabled PEs can be 
changed from instruction to instruction. 


A network is needed to provide inter-PE communication. While 
several types of interconnection networks have been proposed 
[Sieg79], some are topologically equivalent and some are not 
[WuF'e80]. The control transfer algorithm among the equivalent: 
networks has been studied in [Fang84,FaDe85|. Three interconnec- 
tion networks which are not topologically equivalent are con- 
sidered in this paper -- the mesh, hypercube, and shuffle-exchange 
networks. 


Mesh Interconnection Network 


In this model, the PEs may be thought of as being physically 
arranged in a k-dimensional array A (m,_1,m,-2, °° * ,%o), Where n,; 
is the size of the eth dimension and Q =n,_ *ng_o*...#ng. The PE 
at location A (p,-_1,...,.p0) is connected to the PEs at locations 
A (Pkety--sPj41)-+-9P 0), OS J <k, provided they exist. Data may be 
transmitted from one PE to another via this interconnection pat- 
tern only. The interconnection scheme for 16 PEs with k =2 is 
given in Fig. 2a. 


Fig. 2a. A mesh topology with 16 PEs 


Hypercube Interconnection Network 


Let p,_\...P9 be the binary representation of p, the index of 
PE(p), for p=0,1,...,.27-1. Let p{) be the number with binary 
representation py, °° * Py+41P) Pb-1° °° Po, Where p, is the com- 
plement of p, and 0<6 <q. In the hypercube model, PE(p ) is 
directly connected to PE(p)) for 0<b <q. A 4cube (16 PEs) 
interconnection network is shown in Fig. 2b. The hypercube has 
been proposed and used in both SIMD and MIMD computer sys- 
tems [PrVu79,Seit85]. 


Fig. 2b. A hypercube topology with 16 PEs 


Shuffle-Exchange Interconnection Network 


Let q, p and p (>) be the same as in the hypercube model. Let 
Pq-1---Po be the binary representation of p. Define SHUFFLE(p ) 
and UNSHUFFLE(p ) to be the integers with binary representation 
Pq-2Pq-3:--PoPg-1 20d PoPg-1---P1, respectively. In the shuffle- 
exchange model, PE(p) is connected to PE(p (0), 
PE(SHUFFLE(p )) and PE(UNSHUFFLE(p )). These three connec- 
tions are called exchange, shuffle and unshuffle, respectively. Once 
again, data transmission from one PE to another is possible only 
via the connection scheme. A shuffle-exchange network of 16 PEs 
is shown in Fig. 2c. 


Fig. 2c. A shuffle-exchange topology with 16 PES 


It should be noted that the shuffle-exchange model requires at 
most three connections per PE, while the mesh model requires 2k 
connections per PE, and the hypercube model requires g connec- 
tions per PE. It should also be emphasized that, in any time 
instance, only one unit of data can be transmitted along an inter- 
connection line, although all lines can be busy at the same time. 


To evaluate the efficiency of parallel algorithms on array proces- 


sors, one must consider the amount of parallellism that can be 


exploited in order to fully utilize the large amount of PEs. Such a 
study on image correlation was conducted in [SiSF82]. In their 
approach, each PE has a large local memory to handle a number 
of subimages, and only neighboring pixels need be sent among 
PEs. In our approach, we use the interconnection networks 
efficiently to obtain an optimal solution. Each local memory is 
small, containing only M locations. 


In describing our algorithms, we follow the notations and assump- 
tions used in [DeNS81] as stated below. PE(p) denotes the PE 
with index p. R(p), A(p), B(p), and C(p) are registers in 
PE(p ). MAR(p) denotes the local memory address register in 
PE(p ), and M/MAR(p )} denotes the local memory location with 
address in MAR(p). The symbol ”<” signals an assignment 
involving data routing between directly connected PEs, while the 


symbol ”:=” indicates an assignment in which all variables in 


both sides of ».——” are local to the same PE. ”” indicates move- 


ment involving common data or constants from the control unit 
memory to the local memories whose address is specified in 
MAR(p ). For any integer 7, ¢ will denote bit b of the binary 
representation of 7, and ¢,., will signify the number whose binary 
representation is 4, #,_;...t,. PEs may be enabled by providing a 
selectivity function following a statement. For example, if we want 
to perform R (p ):=A (p )+B(p ) for those PEs whose index has bit 
b equal to 0, the statement will be 


R (p ):=A (p )+B(p ), (25 =0). 


The communication complexity of an algorithm includes both the 
time needed to route data from PE to PE, and the time needed to 


broadcast data from control unit memory to local memories. A 


untt-route is a data transmission from a PE to a directly connected 
PE. The following unit-route statements will be used in the three 
interconnection networks in this paper: 


(1) In the 2-dimensional mesh model: 
R (p 1-1,p 2)+-R (p1,p 2) (* go up *); 
R (p,+1,po)-R (71, 2) (* go down *); 
R (p1,P2-1)R (p 1,7 2) (* go left *); 
R (p1,Po+1)—R (p1,p2) (* go right *). 


(2) In the hypercube model: 
R (p))<-R(p) where O0<b <q. 


(3) In the shuffle-exchange model: 
R (p)—R (p ) (* go exchange connection *); 
R (SHUFFLE (p ))—R (p ) (* go shuffle connection *); 
R (UNSHUFFLE (p ))+-R (p ) (* go unshuffle connection *). 


The above notations will be frequently used in the algorithms in 
the upcoming sections. The following procedure ROTATE will be 
extensively used and is demonstrated below. Consider a 2-D 
search area, G={G;;, for ¢,7 €[0,N-1]}, and an array processor 
with Q 2! =N? PEs, where the element G;; is stored in register 
R(p) of PE(p=tN+ 7) and N=2". For convenience, we may 
use PE(?,j) or R(t,j) to denote PE(t¢N +7) or R(tN+ 7). Thus, 
initially, we have R(t ,j)=G,;. We may wish to rotate, or end- 
around shift, the 2-D search area 2" positions to either 


right (R(t ,7 )=G. 


aes i,(7-2*) mod n)s 
left (R (4 J J=G; (5 42h) mod N ?? 


up (R (¢ J J=G (49k) mod N,j ), 
or down (R(t ,7 )=G (of) mod Nj? 

for O0<k <n-1. The rotation procedure for a 2-D mesh machine is 
straightforward. The procedure to perform a rotate operation in a 
hypercube array processor is shown below. The rotate procedure 
for a shuffle-exchange network is explained in Section V. In pro- 
cedure ROTATE, s indicates the bit position where the bit com- 
plement operation begins; r is the bit position where the bit com- 
plement ends. Flag is set to 0 for left and up rotate operations; it 
is set to 1 for right and down rotate operations. 


procedure ROTATE(R ,s ,r , flag); 
begin 
R (p))—R (p ); 
for b:=—s+1tor do 
if flag—0 then 
R (p))—R(p), (ppa=1 & °°: & p,=1) 
else (* Note that & is a logical AND operator *) 
R(p")—R(p), (pes=0 & --- & p, =0); 
end; 


In the following sections, we shall present detailed convolution 
algorithms for array processors with three different types of inter- 
connection networks. 


Ill. MESH MODEL WITH N? PEs 


In this section an O(M*) convolution algorithm MESH-N2 is 
developed for a two-dimensional mesh model with N? PEs, which 
can handle various sizes of windows. These N* PEs may be 
viewed as an N by N 2-D array. We assume that N=2" and 
M=2". Initially, the element G;; is stored in register A (?,j) of 
PE(t ,7) for ¢,7 €[0,N-1]. The M by M window is stored in the 
control unit memory in column-major order with a starting address 
a. The resulting similarity measures of C;; will be stored in regis- 
ter C'(t,7) of each PE. 


This algorithm is designed to process one column of the window at 
a time. For a fixed column ¢, M_ window elements, 
Wor, '°*>Wm-1¢ Will be broadcast to all PEs. Thus, M local 
memory locations are needed in each PE to hold these elements. 
If we rotate up the search area G one row per step, we can com- 
pute a partial sum of M products for a given t. Therefore, 
PE(t,7), which is responsible for evaluating C;;, will accumulate 
the following M products. 


M-1 
> Gese,j +t *Wat 


s=0 


(2) 


MESH-N2 matches an M by M window in an WN by N search 
area on a mesh machine with N? PEs. The initial configuration is 
A(t,j) = Gij;- 
procedure MESH-N2 (A ,C) 
begin 
C(t,7) = 0; (*initialization*) 
for t:—0 to M-1 do begin 
(*read one window column to local memory*) 
MAR(t ,j ) := 0; (*initialize MAR*) 
for s :=0 to M-1 do begin 
CMAR := a+t4#M+s; 
M[MAR(i ,j)]] = CM[CMAR]; 
MAR(i,;) == MAR(,5) + 1 
end; (*end loop s *) 
(*Rotate-multiply-add on the search area segment*) 
MAR(# ,7) :-= 0; 
O(t,7) = C(t,7)+A (1,7 )*M[MAR(#, 7 )); 
(*get first product*) 
for k:=1 to M-1 do begin 
A (t-1,j )—A (7,7); 
MAR(?,7) := MAR(i,7) + 1; 
C(t,j) == C(t,7) + A(t,7)*M[MAR(?,7 )]; 
end; (*partial summation of M products*) 
for k:=1 to M-1 do A(#+1,j)—A(t,7); 
(*recover the original value*) 
A (t,g-1)—A (7,7); 
(* rotate G left one position *) 
end; (*end loop t, process one window column*) 
end; 


Clearly, the communication time complexity of the MESH-N2 
algorithm is 

M-1 

>) 3M =3M?=0 (M”) 

t =0 
It is trivial to extend MESH-N2 to the general case where either 
MM' or M is not a power of 2. 
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IV. HYPERCUBE MODEL WITH N? PEs 


Now we turn to CUBE-N2, the convolution algorithm for a hyper- 
cube array processor with N? PEs. As with the mesh model, the 
M by M window is stored in the control unit memory in column- 
major order with a starting address a; and each PE has M loca- 
tions in the local memory. We logically divide each column of the 
search area G into N/M segments, where each segment has M 
consecutive elements. 


There are four phases involved in order to compute Eq.(2) for a 
given window column. In phase 1, all M elements of the window 
column are sent to the local memory of all PEs. These M ele- 
ments are located in addresses from 0 to M-1. In phase 2, we per- 
mute the elements within each segment. Then, we multiply the 
element in register A with the window element, which is stored in 
local memory. Finally, we add the product to register C’. These 
permute-multiply-accumulate (PMA) steps are repeated so that 
every element of the segment appears exactly once in each position 
of the segment. Since we only permute elements within the same 
segment, some positions which need elements from the next seg- 
ment cannot be satisfied. For instance, the last position of a seg- 
ment needs M-1 elements of G from the segment below it. Thus, 
only half of the products are generated in phase 2 (this will be 
clear from the later example and discussion). 


In phase 3, we first rotate G up M positions. Then, we repeat the 
PMA steps in the same way as in phase 2 in order to generate the 
remaining half of the products needed in Eq.(2). Before we go to 
the next phase, we have to rotate G down M positions to recover 
it to the original status. In phase 4, we rotate G left one position 
so that we can start to work on the next window column. After 
we have processed all M window columns, the final result of C;; is 
stored in C(t,7). 


To simplify our description of the PMA steps, we omit the second 


subscripts of 7 and ¢ for G and W. We use Go,G,° °° ,Gy_1 to 
denote the M elements of one segment. Initially, we have 
A(t)=G; and C(t)=0. After the execution of 


C (i )=C(t)+A (7)*M[MAR(?)], we have G; *Wo in register C of 
every PE. 


Now we will discuss the PMA sequence. First, we exchange one 
element in each segment so that A(?)—G,, where j=? () We 


assign MAR(t)=1. Only the PE(?) with *9>=0 can perform the 
multiplication to generate G;.,*W). 


In the second step, we exchange pairs of elements within each seg- 
ment. If there is a valid local memory address (to be discussed 
later), only PE(¢) with ¢,—0 can multiply A (?)*M[MAR(¢)}). 
The product will be added to C(#). Since the second step should 
generate two terms in Eq.(2), G;4.*W2 and G;43*W3, in PE(?) 
with 7,0, a procedure GRAY (to be defined later) is employed to 
exchange the elements in each pair within the segment. In the 


‘same way, the (k+1)th step exchanges 2* element pieces within 


each segment. Using a valid address, we perform 
C(t):=C(t)+A(t)*M[MAR(?)], (4% =0). 


Just as in the second step, the (k +1)th step generates 2* terms in 
Eq.(2), and procedure GRAY executes a permutation subsequence. 
After executing the m-1 steps, each element in a segment should 
appear in any position of the segment once and only once. 


We employ the GRAY sequence concept [Roth79] to generate the 
permutation subsequence for the small section of 2* elements in 
each step. Given an integer ¢ in the range [0,M-1], where 
i =t, 1.90, and a given k (0<k <m-1), starting with integer 1, we 
can change one bit of the binary expression at a time to generate a 
sequence of gf 1 integers in the range 
{Li /2* | #2* ,...,(L¢ /2*|+1)*2*-1} with ¢ excluded. Let S;_, 
denote the sequence of bit positions on which the GRAY sequence 
with 2*-1 integers will take a complement operation to generate 
the corresponding integers. S,_, can be recursively defined as fol- 
lows: 


So = 0 
Spy == Spo, K-1, Sp_o. 


The length of the GRAY sequence can be obtained by the recur- 
sive equation L (S,_,)=2L (S,-2)+1, where L(So)=1. The follow- 


ing lemmas are useful in developing our parallel algorithms and 


can be easily proved by induction on k. 


Lemma 1: Given an integer ¢=t,,_1.% %_1%-9.9 in [0,M-1], start- 
ing from ¢, the last number in the GRAY sequence of 2*-1 
numbers is #—YV—=¢,, 1.4 ty 1% _2.0- 

Let & be a boolean variable, which is changed alternatively 
between true and false in the generation of numbers of a GRAY 
sequence. The initial value of F' is false. In the sequence of bit 
positions S,_,, each bit position 6 (0<b <k -1) is involved in pairs 
of complement operations. In each pair, the first complement is 
from i, to % when F is true. The second complement is from %, 
to % when F is false. 


‘Lemma 2: During the GRAY sequence, when %, is to be comple- 
mented to generate the next number, MAR(?) is modified by the 
following formula. 


MAR(?) := MAR(i)+2° if (F &#, =0)|(F &%, =1) 

MAR(?) := MAR(#)-2° if (F &#, =1)|(F &#, =0) 
Lemma 3: After executing A (i ))—A (i), the MAR(i) is incre- 
mented by 2* if &_,=0. If #41, then MAR(?) is unchanged. 


The recursive procedure GRAY(U,F ,phase) is designed to gen- 
erate the GRAY sequence, modify MAR(?), and execute the 
multiply-accumulate operation on C(t ). 


recursive procedure GRAY(U ,F ,phase) 


- begin . 
if U —0 then return else 
begin 
flag:—true; 


- GRAY(U -1,flag, phase); 
A (p\¥*"#-)—A (p ); 
if F then 
MAR(p ):=MAR(p )+2"", (py 4n-1=0); 
MAR(p ):=MAR(p )-2"1, (py 4n-1=1); 
else 
MAR(p ):==MAR(p )-2°", (py 4n-1=0); 
MAR(p ):=MAR(p )+2"", (py +n-1=1); 
end; (* end if-else *) 


if phase=2 then C(p ):=C (p )+A (p )*M[MAR(p )}, (2, =0) 


else C(p ):=C(p )+A (p )* M[MAR(p )], (>, =1); 
flag:—false; 
GRAY(U -1,flag,phase); 
end; (* end if-else *) 
end; 
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Finally we come to procedure CUBE-N2, a convolution algorithm 
for a hypercube array processor with N? PEs. 


procedure CUBE-N2(A , C) 
begin 
C (p ):=0; (* initialization *) 
(* process one window column at a time *) 
for t:—0 to M-1 do begin 
(* Phase 1: read one window column to local memory *) 
MAR(p ):=0; (* initialize MAR *) 
for s :=0 to M-1 do begin — 
CMAR := a+t#M +s ; 
M[MAR(p )] =CM[CMAR]; 
MAR(p ) := MAR(p )+1; 


end; 
(* Phase 2: PMA on its own segment *) 
MAR(p ):=0; 


C(p ):=C (p )+A (p )*M[MAR(p )]; (* get first product *) 
for k:==n to n+m-1 do begin 
A (p*))—A (p); 
if k=n then MAR(p ):—MAR(p )+1 
else MAR(p ):=MAR(p )+2*-* , (p,_:=0); 
C(p ):=C(p )+A (p )*M[MAR(p )], (p, =0); 


F := false; 
U :=k-n; 
GRAY(U ,F ,2); 


end (* end loop k *) 
(* Phase 3: PMA on the next segment *) 
(* rotate the image up one segment *) 
A (p("+™-))._A (p ); (* recover original value *) 
ROTATE(A ,n +m ,2n -1,0); 
(* PMA on the new segment *) 
MAR(p ):=M; 
for k:=n to n+m-1 do begin 
A (p'*))—A (p); 
if k=n then MAR(p ):=MAR(p )-1 
else MAR(p ):=MAR(p )-2*-* , (p,_,=1); 
C(p ):=C(p )+A (p )*M[MAR(p )], (p, =1); 


F :=false; 
U :=k-—n ; 
GRAY(U ,F,3); 


end (* end loop k *) 
(* rotate G down one segment *) 
A (p("+™-1))_ A (p ); (* recover original value *) 
ROTATE(A ,n +m ,2n -1,1); 
(* Phase 4: Rotate G left one position *) 
ROTATE(A ,0,n -1,0); 
end; (* end loop ¢ *) 
end; 


Algorithm CUBE-N2 uses M local memory locations for each PE. 

The inter-PE communication time in procedure GRAY with 

parameter U is 24-1 unit routes. Therefore, the communication 
m—- 

time in phase 2 or phase 3 requires 5) (2*-1)M _ unit-routes. To 
k =0 

rotate G up or down one segment takes logo(N /M) unit-routes. 

Phase 4 takes logpN unit-routes to rotate G left one position. The 


communication complexity is 


M-1 
S) (3M +2(log 2N -logeM )+log2N ) 
t=0 


= M(3M +log,N —2log.M )< M* max(3M ,log.N )=O (M”). 


Note that the 
O(N 24M 2) 


complexity of the traditional computer is 


*5.9 
*3.8 
5,11 
#3,10 


°4,8 
°4,9 
#4,10 
e4,11 


*6.17 


5.16 


6.18 
*6.19 


*7,19 
°5,18 
©3,17 
91,16 


05,17 
*3,16 


*4,16 
°4,17 
°4,18 
e4,19 


2.16 
°2,17 


*5,19 
*3,18 


phase 3 


Fig. 3. An example shows the permutation and address generation sequences for 16 PEs, where M=8 


To further illustrate our algorithm, Fig.3 shows an example of pro- 
cessing one window column where M=8. The first column of the 
figure gives the initial configuration, where we list 16 PEs (PE(0,7 ) 
to PE(15,7)). The first value, D, associated with each PE, is the 
value of its MAR, and the second value, G,, is the index ¢ of G;;. 
Again, subscripts 7 and ¢ are omitted from Fig.3. The results of 
all enable operations are followed by the symbol *. The first 
column in phase 3 demonstrates the rotation of G up one segment 
(8 positions). The second to last column in both phase 2 and 
phase 3 shows the restoration of G to its original value (before per- 
mutation). The last column in phase 3 shows the effects of rotat- 
‘ing G down one segment. Fig.3 demonstrates that, after phase 3, 
all eight products are available for the evaluation of Eq.(2). 


In the general case, a window could be an M by M' rectangle 
(M<M' ), and M might not be a power of 2. Let 
M=M,+Mo+ ---+M,, where M, is an integer power of 2, and 
d =1,2,....¢. Algorithm CUBE-N2 can be modified by looping k 
from n to n+m-1 inside another loop on d from 1 to q. For 
each value of d, the variable m takes the value log.(M,). In the 
outermost loop, t increments from 0 to M' —1 to process columns 
in the window. It is clear that the time complexity of the modified 
CUBE-N2 algorihm is O(M+#M' ). 


V. SHUFFLE-EXCHANGE MODEL WITH N? PEs 


An O(M?) convolution algorithm for a N22! shuffle-exchange 
array processor can be arrived at by simulating the routes in pro- 
cedure CUBE-N2, using the technique used in [DeNS81]. We shall 
use the same logically two-dimensional view of PEs and the same 
initial configuration described in Section III and Section IV. Ini- 
tially, A(t,7 )=G,; for 7,7 €[0,N-1], and the M by M window is 
stored in the control unit memory in column-major order with the 
starting address a. The resulting similarity measures will be in the 
register C’. 


The basic steps in the shuffle-exchange model convolution algo- 
rithm are the same as those used in CUBE-N2. The only 
difference is the data routing. In the shuffle-exchange model, we 
have to employ EXCHANGE, SHUFFLE, and UNSHUFFLE to 
simulate data transmission in the hypercube model. In the hyper- 
cube model, PE(p ) is directly connected to PE(p)) and the fun- 
damental data transmission is R(p ))—R(p) for 0<b <q. 
Clearly, the following subroutine in the shuffle-exchange model 


implements the basic data transmission in the hypercube model. 
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proceudre TRANSMIT (Rf :register; b :integer) 
begin 
for ::—1 to 6 do R (UNSHUFFLE(p )) — R (p ); 
R(p) — R(p); 
for ¢:=1 to 6 do R (SHUFFLE(p )) — R (p); 


end; 


Note that the SHUFFLE and UNSHUFFLE connections are com- 
plementary operations. If several UNSHUFFLE operations are fol- 
lowed by the same number of SHUFFLE operations, the result will 
recover the search area to its original value. Since SHUFFLE or 
UNSHUFFLE rotates the bit position in the binary representation 
of index p, the selectivity function should take the corresponding 
adjustment if it is necessary. 


Procedure ROTATE(R ,s ,r , flag ), developed in Section II for the 
hypercube model, should be modified to perform rotation for the 
shuffle-exchange model. The paprameters in the new procedure 


PS-ROTATE retain the same meanings as in the procedure 
ROTATE. 


procedure PS-ROTATE(R ,s ,r flag); 


begin 
for ¢:=1 to s do R (UNSHUFFLE(p )) — R (p ); 
R (p)—R (p ); 


for 1:—1 to r-s do 
R (UNSHUFFLE(p )) — R (p ); 
if flag—O then 
R(p—R(p), (ppr=1 & +++ & pys=t) 
else (* Note that & is a logical AND operator *) 
R(p)—R(p), (pps 0 & «++ & py_;=0); 
end; (* end of if-then-else *) 
end; (* end loop ¢ *) 
for 1:—1 to r do R (SHUFFLE(p )) — R(p); 


end; 


It is important to note that PS-ROTATE requires only twice as 


many unit-routes as does ROTATE in the hypercube model. Recall 
that each PE in a shuffle-exchange is connected to up to three 3 
PEs, while in a hypercube each PE is connected to g other PEs. 


The parallel convolution algorithm SHEX-N2 for a. shuffle- 
exchange network with N? PEs is obtained by (i) replacing the 
data transmitting statement R(p'))—R(p) with procedure 


‘TRANSMIT; and (ii) substituting PS-ROTATE for ROTATE in 


both the recursive procedure GRAY and algorithm CUBE-N2 from 
Section IV. SHEX-N2 also requires M local memory locations for 


each PE. As mentioned above, PS-ROTATE takes about twice as 
many unit-routes as does ROTATE in the hypercube model. Thus, 
it takes 2 *log.{N /M) unit-routes to rotate G up or down one seg- 
ment. Similarly, rotating G left one column takes 2*logoN unit- 
routes. 


Now we need to investigate the number of unit-routes required by 
procedure GRAY with parameter U. By definition of the GRAY 
sequence in Section IV, Sy==S,y_,,U ,Sy_, where Sy represents the 
sequence of bit positions on which the GRAY sequence with 27 -1 
integers will take a complement operation to generate the 
corresponding integers. It is clear that Sy is the same as an 
inorder traversal list of a complete binary tree with its root 
labelled U, and every other node of level z labelled U-z +1. 
From this fact, an interesting property of the GRAY sequence is 
described in the following lemma. 


Lemma 4: In Sy, corresponding to the GRAY sequence with 
parameter U, the number of integer r’s, <r <U, is 27. 


Each integer r in Sy con ee to a GRAY sequence involves 
a data transmission A(p'"))}—A(p) in algorithm CUBE-N2. 
Therefore, by procedure TRANSMIT, it will take 2r—1 unit-routes 
in the shuffle-exchange model. The inter-PE communication time 
in procedure GRAY with parameter U is 


U 
JY) (2r-1)* 27-7 = 3427 _24U-3 


r=1 


Therefore, phase 2 takes 


ml m-1 : 
S} (3 # 2* -2 #k-3)+ J) (2k +1) = 342"-2m-3=3M 
k =0 k=0 


unit-routes. 
rithm is 


The overall time complexity of the SHEX-N2 algo- 


M-1 
} (M +2((3M —2logoM -3)+2loge(N /M ))+2log.N ) 
t =0 


=M (7M +6log,N )<M+#maz (7M ,6log.N )=O (M?). 


Note that the complexity of algorithm CUBE-N2 is O (2M), while 
SHEX-N2 uses only triple as many unit-routes as does CUBE-N2 
in the hypercube model. Each PE in the shuffle-exchange model is 
connected to up to 3 PEs however, instead of g PEs as is the 
hypercube model. é 


VI. ARRAY PROCESSORS WITH DIFFERENT 
NUMBERS OF PEs 


The algorithms in the previous sections were developed on array 
processors with the number of PEs same as the size of the search 
area. We now demonstrate the modification of algorithm CUBE- 
N2 so that it can be applied to other numbers of PEs. The 
modification of algorithms on other types of array processors 
would be similar to this example. 


First, let us consider the case of L* PEs, where M<L<N. 
Again we assume that L is an integer power of 2. The search area 
G can be partitioned into (N /L )* equal-sized blocks, where each 
block G,, is L by L for 0<2,y<(N/L). For a fixed y, we call 
procedure CUBE-N2 for blocks from Gp, to G,,, where 
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z=(N/L)-1. Then we repeat the procedure call from y=0 to 
y=(N /L )-1. The only modification we have to make is the pro- 
cedure ROTATE. In algorithm CUBE-N2, there are three places 
to call ROTATE: one for rotate up, one for rotate down, and one 
for rotate left. Instead of performing a rotate or end-around shift 
for those boundary elements, we must instead shift G from the 
neighboring (either down, up, or left) blocks. ROTATE can be 
easily modified to add this feature by reading the neighboring ele- 
ments before performing the rotation. For example, to shift a 
block left one position, we can add the following statement to the 
beginning of ROTATE. 


read block[2#L :(z +1) #L -1;(y +1) #L +t :(y +2) 4#L +¢-1] 
(Pa—1.0=0 * - - 0); 


Note that the leftmost column of the block to the right of the 
current block is read. Here n =logoL . 


An alternative approach does not involve modifying ROTATE. 
We can read the neighboring elements into another register, for 
instance, B. Suppose that we want to shift G up 4 positions. The 
top four rows of PEs (or Bs) will be loaded with the element 
values of the top four rows of the blocks below it. Before we per- 
form the shift on A, we exchange the contents of A and B. 
Then, we perform ROTATE to get the desired result. 


Now we present algorithm CUBE-L2 for a hypercube with L? PEs. 
G is initially stored in the I/O system and can be read block by 
block. The window W is stored in the control unit memory in the 
same way as was specified in CUBE-N2. Once it is obtained, the 
resulting similarity measure C is sent to the I/O system block by 
block. 


procedure CUBE-L2 
begin 
for y :=0 to (N/L)-1 do 
for z:=0 to (N/L}1 do begin 
read block|x#L :(z +1) #L —1;y#L :(y +1)*L -1]; 
Modified-CUBE-N2(z ,y ); 
end; 
end; 


It is simple to show that the time complexity of CUBE-L2 is 
O (N°M?/L?). 


Algorithm CUBE-N2M2, developed in [FaLN85], is a parallel tem- 
plate matching algorithm for an array processor with N*M? PEs. 
By combining algorithms CUBE-N2M2 and CUBE-N2, one can 
obtain an efficient parallel convolution algorithm, CUBE-N2K2, 
for a hypercube with N?K? PEs, where 1<K <M. Algorithm 
CUBE-N2M2 has a time complexity of O(log.N*log.M). 
Meanwhile, algorithm CUBE-N2K2 has a time complexity of 
O (M?/K*+log.N* logoK ). One readily sees that when K =1, 
CUBE-N2K2 works exactly like CUBE-N2, and when K=M, 


CUBE-N2K2 works as CUBE-N2M2. When K=M/log,M and 
there. are N2M?/(log.M)* PEs, it is interesting to see that the. 
complexity of the algorithm © CUBE-N2K2 is _ still 
O (log oN* logoM ). 


VI. COMPUTATION FOR GENERALIZED 
CONVOLUTION 


For some applications we need to normalize both arrays (search. 
area and window) before the computation of convolution so that 
the similarity measure is a generalized digital convolution, i.e.,' 
normalized correlation. First we define 


M-1M! -1 


H=\y> > Gyse,g+t 


e=0 t=0 


M-1M! -1 


Z= »; > Gere gt 


e=0 t=0 
M-1M! -1 


D=)) bi We 
e=0t=0 
M-1M! -1 
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The normalized coorelation is thus defined as 


AT 


E-Tii 


G3 = 
H? T? 
4M" | ar 
In each PE(p ), we use four registers MSI(p ), MWD(p ), MSIS(p ) 


and MWDS(p) to store H, T, Z, and D, respectively. The 
modifications to algorithm CUBE-N2 include the following: 


1. After a window column is read to local memory, cumulate 
M[MAR(p)] into MWD(p), cumulate M[MAR(p)]? into 
MWDS(p ). 

2. During the execution of multiplication-addition in both the 


main loop body and in procedure GRAY, cumulate A (p) 
and A (p )* into MSI(p ) and MSIS(p ), respectively. 


3. Before exiting, calculate the final result from contents in~ 


C(p ) and the four new memory locations. 


The complexity of the algorithm generating the normalized corre- 
lation coefficient is still O(M+#M' ). 


VOI. CONCLUSIONS 


Two-dimensional digital convolution, a basic operation in image 
processing, is computationally time consuming in traditional 


machines. We have proposed several parallel algorithms on array 
processors with different types of interconnection networks. PE 
operations can be carried out between the data movement steps 
among PEs, eliminating many unnecessary data transmissions. 
Our approach employs a small local memory requiring only M 
locations for each PE. 


With N? PEs and M=M! =2™, the communication time com- 
plexity of all three networks has the same order of magnitude 
O(M?”). The degree of communication (connection links) of each 
PE is 4, q, and 3, for mesh, hypercube, and shuffle-exchange, 
respectively. However, the mesh connection provides the smallest 
communication complexity as summarized in Table 1. This is due 
to the inherent topological matching between mesh network and 
2-D search area. 


Table 1. Communication Time Complexity 
| mesh | MF 
: 3M?+M log,N -2M logsM 
Memaz (7M log N 
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Abstract 


In this paper we consider the Mesh of Trees organi- 


zation for several fundamental tasks in computer vision. 


We illustrate the suitability of this architecture by show- 
ing fast parallel algorithms for many tasks in low to 


medium level vision. We derive poly logarithmic algo- 
rithms for many problems on digitized pictures such as — 
identifying and labeling figures in a 0/1 image, drawing’ 
digitized straight lines, computing convexity propertics' 
of digitized images, determining distances, etc. For «!): 


the above problems, the Mesh of Trees organization has 
superior time performance compared with the pyramid 
and mesh connected computer of corresponding size. 


I. INTRODUCTION 


Mesh Connected Computers (MCC) and Cellular 
Arrays have been considered suitable for image process- 
ing, since images can be naturally mapped onto these so 
that neighboring pixels are mapped onto neighboring 
processing elements [MILL 85, ROSE 83]. But solving 
many image processing problems on a NXN 2-MCC 
requires as much as O(N) time. An alternate parallel 
architecture is the pyramid organization which has loga- 
rithmic diameter [TANI 83, MILL 84]. The hierarchical 
organization of pyramid results in logarithmic time per- 
formance for many image processing tasks. However, 
many other problems require as much as O(N'/?) time on 
a NXN base pyramid [MILL 84]. In this paper, we 


explore the suitability of the Mesh of Trees for image 


processing applications. 


The Mesh of Trees organization has been introduced 
to solve several problems such as sorting, matrix multi- 
plication and some graph problems [LGHT 81, NATH 
83]. In this paper we propose this as an alternate archi- 
tecture to the pyramid computer and evaluate its perfor- 
mance with respect to several fundamental tasks in 
image processing. 
images, the Mesh of Trees organization is shown to have 
superior performance compared to either the pyramid or 
2-MCC of corresponding size. 
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For many problems on digitized . 


The rest of this paper is organized as follows. In 
the next section, we define the structure of the Mesh of 
Trees and standard operations on it which will be used 
throughout the paper. In section III, we present O(logN ) 
parallel algorithms for the following problems: determin- 
ing the extreme points of the convex hull of a figure, 
enumerating and deciding the PEs within the convex 
hull of a figure, deciding if two sets of processors are 
linearly separable, finding the PEs along a digitized 


_ straight line, detérmining nearest neighboring figure to a 


figure, etc. We also derive poly logarithmic algorithms 
for several problems including labeling 0/1 images, 
estimating convexity properties for digitized pictures 
having multiple objects, etc. For several problems, the 
solution on the Mesh of Trees turns out to be consider- 
ably simpler than those on the pyramid. Due to space 
limitation proof sketches will be provided. Details 
appear elsewhere [PRAS 86b]. 


Il. MESH OF TREES 


The Mesh of Trees network can be looked upon as 
an N XN matrix of processors in which each row and 
each column of processors forms the leaves of a binary 
tree [LGHT 81, NATH 83]. The root and the internal 
nodes of each binary tree are also processors. This struc-_ 
ture is called an (NXN)-MOT for short. The N? leaf: 
processors form the base of the network and are called 
base processors (BPs). Most of the processing is done by 
the BPs. The internal processors (IPs) are used for com- 
munication between BPs. During the course of this com- 
munication, the IPs may also be required to carry out 
some simple operations such as summing and extracting 
the minimum on the data. A (4X4)-MOT and its layout 
is shown in figure 1, where the BPs are represented by 
white circles and the IPs are represented by. black circles. 
Any two adjacent rows or columns of the base are 
O(logN ) distance apart. Since there are N rows and N 
columns, the total VLSI area of the layout is 
O{N*(logN )*), which can be shown to be optimal [LGHT 
81]. Notice that, in the standard VLSI model, (N XN) 
2-MCC and (N XN) pyramid have O(N?) area require- 
ment. 


(4x4)-MOT and it’s layout 


Fig. 1 


The following are some of the commonly used com- 
munication operations on the Mesh of Trees [NATH 83]. 


1) ROOTOLEAF-The contents of the data register 
of the root of the corresponding tree are broadcast to the 
leaves of the tree,.- 


2) LEAFTOROOT-Selector specifies one BP whose 
register R contents are sent to the root of the 
corresponding tree. 


3) LEAFTOLEAF-This can be expressed as a 
sequence of LEAFTOROOT and ROOTOLEAF opera- 


tions. 


The time required for the above operations is 
O(logN ) which is the height of the tree. 


I. PARALLEL GEOMETRIC ALGORITHMS 


In this section we consider several basic tasks on 
digitized images and derive fast parallel algorithms for 
these tasks. For all our algorithms the input is an 
NXN image with the base PE(#,j) storing the 


pixel(: ,j). 
3.1 Connected Components 


Given a 0/1.image a fundamental task is to identify 
figures in the image. Figures correspond to connected 1’s 
in the image. The labeling problem is to identify and 
associate an unique ID with the connected 1’s in the 
image. An essential part of this algorithm is: 


Lemma 1:{[NATH 83] Given a NXWN adjacency matrix 
of a N vertex graph stored in the BPs of an (N XN)- 
MOT, the connected components can be determined in 
O((logN )°) time. 

Theorem 1: Given a NXN 0/1 image, all figures can 
be labeled in O((logN )*) time using an (N XN)-MOT. 


Proof sketch: The basic idea is to label blocks of size 
k Xk and merge 4 adjacent blocks to form bigger size 
blocks. 
N XN. To merge 4 blocks of size k Xk, we look at the 
boundary of the blocks as shown in figure 2. At the 
boundary between two blocks we can associate an 


This is repeated until the block size becomes | 
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adjacency graph as follows: each figure incident on the 
boundary corresponds to a vertex. Two figures adjacent 
to each other at the boundary has an edge between 
them. There can be O(k) edges and O(k) nodes across 
the boundary of two adjacent blocks. We then convert 
the O(k) nodes and edges into adjacency matrix format 
using the basic data movement operations in QO(logk ) 
time. Now using lemma 1, the connected components of 
the adjacency graph is found in O((logk )*) time. Repeat- 
ing this along the horizontal boundary, we can determine 
the new labels within a block of size 2k X2k. This infor- 
mation is propagated to the outer boundary of the block 
of size 2k X2k as shown in figure 2. Repeating this logN 
times, all the figure are labeled. By traversing the above 
steps in reverse, we can relabel all the figures. o 
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Fig. 2 Merging 4 blocks of size k Xk 


3.2 Convexity Algorithms 


Now we consider convexity problems. We use the 
following definition of convexity: A set of PEs is said to 
be convex if and only if the corresponding set of integer 
lattice points is convex. Given a set S of PEs, the con- 
vex hull of S, denoted Hull(S), is the smallest convex set 
of PEs containing S. 


Theorem 2: In an (NXN)}-MOT, in O(logN) time one 
can identify the extreme points of 1’s ina N XN image. 


Proof sketch: The algorithm operates in two steps. It 
first finds the Right Most (RM) and the Left Most (LM) 
1 in each row using the LEAFTOROOT and 
ROOTOLEAF operations. Then it verifies if these 
points are extreme points or not. The verification is 
done by having the sth column compute the enclosing 
angle made by the Is in the image with the RM of the 
ith row (RM,). This angle is defined to be the smallest 
angle ¢, such that all the Is are enclosed inside the 
region X RM, Y as shown in figure 3. This can be easily 
computed in O(logN) time, using ROOTOLEAF and 
LEAFTOROOT operations. RM; is an extreme point if 
and only if ¢;<180 degrees. This step is repeated for 


-LMs. o 


Fig. 3 Enclosing angle X RM; Y 


Lemma 2: In an (N XN)-MOT, suppose the extreme 
points of a set S have been marked. Then, 
a. in O(logN) time the extreme points of S can be 
enumerated. 
_b. in O(logN) time the base PEs within the hull 
can be marked. 


Using Lemma 2 we have: 


Corollary: Using an (N XN)-MOT , it can be decided if | 


two sets of base processors are linearly separable in 
O(logN ) time. 
Theorem 3: Using an (N X N)-MOT, 
a. the diameter of a figure can be determined in 
O(logN ) time. 
b. a smallest enclosing box, a smallest enclosing 
circle for a figure can be determined in O(logN) 
time. 


Proof sketch: a) The algorithm finds the diameter, by 
finding the maximum of distances between any two 


points on the boundary. The LM and the RM of each 


row having an extreme point is broadcast to all the BPs 
along that row. The #th column computes the distance 


between the extreme points in the ¢th row and the rest. 
of the extreme points. In the next step, the maximum of. 


those values is obtained in O(logN) time. 


b) Given a set S of points in the plane, a smallest 
enclosing box is a rectangle of least area containing S. 
This rectangle must contain an extreme point of S on 
each side, and at least one of it’s sides must contain two 
consecutive extreme points [FREE 75]. Therefore, after 
finding the extreme points of the convex hull enclosing 
the points, using theorem 2, each pair of consecutive 
extreme points is assumed to form the base of a rectan- 
gle. In O(logN) time, the extreme points of the convex 
hull of the figure lying on the other three sides can be 


found. Then the area of each of these rectangles is 


obtained. Using the data movement operations of sec- 
tion II, the rectangle with the minimum area is identified 
as the smallest enclosing box. Similarly, the smallest 


enclosing circle is a circle of least area containing S. 


This can also be computed in O(logN) time using a pro- 
perty of a smallest enclosing circle [FREE 75], and using 
the following fact [VOSS 82]: in a NXN digitized 
image, a figure can have at most O{N2/*) extreme points. 


Details can be found in [PRAS 86b]. po 


By combining adjacent convex hulls we can show: 


Theorem 4: In an (NXN)-MOT, one can determine 
and enumerate the extreme points of the convex hull of 
all figures simultaneously in O((logN )*) time. 


Proof sketch: The idea is to find convex hulls locally 
within blocks of size k Xk, merge the convex hulls 
within adjacent 4 blocks to construct convex hulls of 


figures within blocks of size 2k X2k, 1<k <<. Notice 


that given two adjacent blocks of size k Xk, there can be 
at most O(k) figures that run.across the boundary 
between the blocks. Thus, O(k) convex hulls need to be 
merged. Merging of convex hulls involves finding 
tangent lines pq and rs as shown in figure 4. This can 
be done by a binary search on the extreme points of A 
and B. Using suitable representation for the convex 
hulls and a basic technique in [OVER 80] we can show: 


Lemma 3: Given convex hulls within blocks of size 
k Xk, the convex hulls can be merged to yield the con- 
vex hull of figures in block of size 2k X2k, in O((logN)’) 
time simultaneously for all the figures. 


A basic step in the proof of the above lemma is to 
gather information about the relationship of a given line 
zy (joining an extreme point of A with an extreme point 
of B) to the convex hulls A and B. If this line is tangen- 
tial to both A and B, then we are done; otherwise we 
need to identify one of several cases [OVER 80]. A cru- 
cial step is to identify this information in_ parallel for all 
the O(k) figures. By recursively representing information 
about the convex hulls and using efficient data move- 
ment this can be done in O((logk )*) time [PRAS 86b]. o 
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Fig. 4 Merging of convex hulls 
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3.3 Distance Problems 


Now we consider several distance problems. In the 


following discussion we use the |, metric. However, it 
can be modified to operate for any § metric. 


Theorem 5: Using an (N X N)-MOT, 
a. the nearest neighbor 1 to each PE having a 1 
can be found in O(logN ) time. 
b. the nearest neighboring figure to a given figure 
can be found in O(logN) time. 


Proof Sketch: a) The algorithm starts by finding the 
nearest neighboring 1, along the X and Y axis, to each 
BP. This ‘can be easily implemented by traversing the 
row and column trees. At the end of this step, each PE 
has the coordinates of the nearest PE with a 1, along the 
X and Y axis. Then each PE having a 1 broadcasts it’s 
coordinates to all the PEs within it’s nearest neighbor 1 
in each direction. Each PE calculates the minimum dis- 
tance between the broadcast address and it’s nearest 
neighbor 1 along X and Y axis. In the return movement, 
the minimum of all these distances is sent back to the 
PE having a 1 which sets its nearest neighbor. 

b) The idea of the algorithm is to first find the 


nearest neighbor for all the boundary PEs of the figure, 
and then to find the nearest figure, by finding the 


minimum among all those values obtained for the PEs at 


the boundaries. Details will appear in the full paper. o 
A related problem is the closest pair problem for 

which efficient solutions are known on the pyramid com- 

puter [STOU 85]. This can be easily solved on the Mesh 

of Trees: 

Corollary: Given an (NXN)-MOT, the closest pair 

problem can be solved in O(logN ) time. 


IV. CONCLUSION 


In this paper we showed that many of the image 
processing tasks which required O(N) computation time 
on an (N XN) mesh connected computer, can be com- 


puted in logarithmic time using a (NXN) Mesh of. 


Trees. Indeed for several problems on digitized images, 
Mesh of Trees organization seems to be a natural candi- 
date as opposed to the pyramid organization. This is 
illustrated by simple and elegant parallel algorithms for 
these problems. 
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Abstract-- A criterion is presented for validating the optimal- 
ity of solutions acquired by parallel A“ and AO” algorithms using 
multiple processors and a global OPEN queue or local best solution 
graph selection. It is shown analytically that the parallel A“ algo- 
rithm with an OPEN queue can achieve linear speedups for up to 
a large number of processors when processor contention to access 
the queue is small. For the parallel AO° algorithm, utilizing 
Nilsson’s solution graph cost revision in every processor gives a 
much higher performance than ordering solution graphs in a global 
queue. 

Introduction 


The introduction of Very Large Scale Integration has enabled 
the construction of massively parallel systems(1,2]. In the field of 
Artificial Intelligence, parallel processing has been recognized 
belately as a powerful tool for accelerating the execution of search 
algorithms. In this paper we concentrate on both the state-space 
search and the AND-OR search. An optimality termination cri- 
terion for the parallel versions of the A’ and AO” algorithms is 
presented and its correctness demonstrated. The solution that has 
the best merit at the time the criterion is first satisfied is the 
optimal solution. 


In the past, research on performance evaluation of search 
algorithms concentrated on Branch-and-Bound search. Wah and 
Ma [3] have proposed an architecture, called MANIP, that uses a 
global register to hold the current best heuristic value. Wah has 
also given the performance speedups for Branch-and-Bound 
search[4,5]. Lai and Sahni (6] find that when using multiple pro- 
cessors it is possible to be trapped in an anamoly where the 
speedup dips below 1.0. Our emphasis, however, is on the state- 
space search and the AND-OR search algorithms that utilize 
admissable heuristic evaluation functions. 

Algorithm EF“ 

Throughout this paper we will use the same notations for 
heuristic evaluation functions (HEF's) as given in Nilsson(7]. We 
will refer to a heuristic evaluation function (HEF) as being admiss- 
able if the cost estimate function h(n) of the HEF f (n) satisfies 
the relation h(n)<h“(n) for all nodes n. A HEF f (n) is also said 
to be monotonic if f (n,)<f (n.) for any pair of nodes (n,,n.) such 
that n, is a predecessor of no. We will call our parallel version of 
the A“ algorithm FE” algorithm. The algorithm FE” runs on every 
PE in a system consisting of p processors(PEs), a global OPEN 
queue for ordering unexpanded nodes, and a shared variable a 
which is initially set to 00. The algorithm FE” is identical to A’ 


except that everytime a solution is found by the system, a is 
updated to contain the cost of the least costly solution found thus 
far. The purpose of a is explained in the next section. 


Proofs for the Alpha Criterion 


Since our multiprocessor system is not synchronized, it is 
possible to obtain a suboptimal solution before an optimal one is 
found. It is therefore essential to derive a sufficient condition, 
called the a-criterion in this paper, for determining the optimality 
of a solution that has been acquired. We will omit the proofs of 
the following sequence of propositions that shows the condition 
does indeed exist which optimally terminates the E* algorithm. 
For the derivations of all the expressions in this paper the reader 
shall refer to [8]. 


Lemma 1: If an optimal solution exists, E“, without the a- 
criterion, will acquire it. 
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Lemma 2: Before algorithm E° acquires the optimal solution there 
exists at least one node n on the OPEN queue such that its heuris- 
tic merit f (n)<a. 


Theorem 1: If for every node n; on the OPEN queue the condition 
f (n;)>a holds for all 7, then the least costly solution that has 
been found by E’ is the optimal solution. 


The condition f (n;)>qa for all 7 is referred to as the a—cri- 
terion. 
Performance Evaluation 


Assumptions for A* and E* 

Our purpose in analyzing the E’ algorithm is primarily to 
derive an upper bound for the number of nodes that EZ” needs to 
expand in order to acquire the optimal solution. If the contention 
by the PEs to access the global queue is minimal, then this upper 
bound becomes indicative of the potential speedup possible for E*. 
Therefore, in the analysis of the E° algorithm we will assume the 
amount of contention is negligible for one of two reasons: (1) p is 
small so there is little contention or (2) hardware support for queue 
management is available for reducing the processor idle time due to 
contention caused by queue management overhead. Moreover, we 
make the following assumptions to simplify the evaluation of algo- 
rithms A“ and E’. 

(1) There exists a unique optimal solution of cost N (=WN,). 
There are 7-1 other suboptimal solutions with costs 
N,,No,.--,N41. The heuristic merit f (n), where node lies 
on the i” solution path, lies in the range between 0 and N;. 
If node n is on a path that does not lead to any goal, then 
f(n) lies between 0 and co. The cost of each arc in the 
search graph is assumed to be 1. 


(2) The search graph is acyclic and every node except the root s 
has exactly one parent. 


(3) The heuristic evaluation function f is admissable and mono- 
tonic. 


(4) N,;<2N for alli. The probability density of f(n), where n 
is on the 7” solution path, at any point before reaching its 


goal is » where the end points of the density 


a 
Ni-9(n) 
function are g(n) and N,. 


(5) For evaluation, EZ” is assumed to execute on a cycle- 
synchronized multiprocessor. 


The assumption of admissability implies that A° will expand 
a path from « up to a node n such that n is the last node on this 
path to have f(n)<f‘(s). Probabilistically, f(n) of the final 
node n expanded along a suboptimal path is equally likely to be 
less than N as to be greater than N. Therefore, 


N,-N = N-g(n) 


Expansion Cost for A* 


From Nilsson [7] if a HEF is admissable and monotonic then 
a node on OPEN, excluding the optimal goal node, will be 
expanded by A’ if and only if f (n)<N. Henceforth, a node 
expanded by A’ will be referred to as a critical node, and E,(z) 
will denote the expected number of nodes expanded by algorithm J 
before the optimal solution is obtained. 


E,.(z)=N+ Fat) 


s=1 


where E(L,) is the expected length of a suboptimal solution path 
before the search stops for that path. To compute E(L,;) we note 
that the expected length is at the point on the density function 
where the probability of f(n)<N equals the probability that 
f(n)>N. Therefore, 


E,.(z)=N+ 3 (2N-N;) 


i=1 


Expansion Cost for E* 


Assume that all p processors are used for every expansion 
cycle. Then there is a distinct possibility that some of the nodes 
expanded during a cycle may not be critical. We call these cycles 

mized cycles. A cycle during which only critical nodes are 

expanded is called a_ perfect cycle. It can be shown that every 
node expanded by A’ will also be expanded by E’. When the 
monotonicity is considered as an additional constraint on the HEF, 
any non-critical nodes expanded by E“* in mixed cycles cannot pro- 
duce critical successors that may rank ahead of the nodes on the 
optimal path and slow down search. This leads to the following 
theorem. 


Theorem 2: If a HEF is admissable and monotonic, then the critical 
nodes expanded by A’ are exactly those critical nodes expanded 
by E. 
Li and Wah [4] has shown that the number of mixed cycles 
y in an E” search is less than or equal to N. Then E,+(z) is 


E,+(z) < E,.(z) | yp Em] 


s=1 


where m,; is the number of critical nodes expanded during the i** 
mixed cycle. Knowing y<N and m,>1 for all 7 and using the 
uniform density assumption for all the nodes and letting all 


suboptimal solutions have the same depth N,, we have 


Bg.(z) < pN + (1-1)(2N-Ny) 


Now we can estimate the speedup U, for a p-processor E . 
algorithm: 


EB, +(z)p 


= E,+(z) 


2 6 (42%) 
oN + (141 ) (29-0, ) 


From the equations above we can see that the search speedup 
is at its best when we have a bushy search graph where the values 
of N, N,and y are large. To define the range of processors in 
which the speedup is almost p, we rearrange the speedup equation 
into 


p? N-pN 


p~P- 


For the speedup in the above equation to be approximately 
p-l, p must be greater than 1 and  Iless’' than 


1+ /r+=(1 I 2N-N,, } As we can see this range 


increases for a larger and/or deeper search graph. 
Algorithm AO* 

The AO~ algorithm is the equivalent version of A‘ algorithm 
for the AND_OR problem reduction search. The purpose of AO* 
and its parallel counterparts is to find a solution graph from s to 
the terminal set T that has the minimum cost. The algorithm 
consists of two major operations (see Nilsson [7]|): (1) a top-down 
graph-growing operation to find the best partial solution graph 
(PSG) and (2) a bottom-up, cost revising, connector-marking opera- 
tion called upward cost revision (UCR). -Starting with the node n 
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just expanded, this step revises its cost using newly computed costs 
of its successors, and marks the outgoing connector from n on the 
estimated best path to the terminals. This revision process is 
repeated for every node in the PSG from the parent of n to the 
start node s. 


Assumptions for AND-OR Algorithms 

All the assumptions pertaining to the search graph for state- 
space search are assumed valid for the PSG in AND-OR search. In 
addition, every node in a PSG has / connectors and every connec- 
tor is a k-connector. The cost of every connector is exactly k. The 
probability density for the cost estimation function f;(s) ( the 
estimated cost of the i!* PSG G; ) is uniformly distributed between 
its known arc cost and the actual total cost N;. 


Total Cycle Time for AO* 


From our assumptions it is obvious the algebraic expressions 
for E,,.(z) and E,,.(z) are identical to those for E,-(z) and 


E,,+(z) respectively if the parameters are properly interpreted. 
Unlike the cases for A“ and E, however, the speedup performance 
for AO* is different because the comparison cost in UCR must be 
taken into account. Similarily, the speedup expression for EO‘ 
will also be different because of the comparison costs in queue 


management and processor idle time waiting for PSG insertion into 
the global OPEN queue. 


For.the calculations that follow, let w be the unit cost for a 
node expansion, and let 


__ unit comparison cost 


unit expansion cost 


where the unit comparison cost is the time taken to compare two 
numerical values in a given computer system. The unit comparison 
cost is used in computing the UCR comparison cost, and in the 
search for a place in the OPEN queue when ordering a PSG. The 
unit expansion cost is the cost of expanding a node in a PSG. 


The AO* cycle time in the i cycle varies with the cycle 
number, the height of the PSG being expanded in the i" cycle, the 
unit expansion cost and the unit comparison cost. In our calcula- 
tion, the height of the expanded PSG will be taken to be the 


height of an uniform search graph that has been expanded 
breadth-first. 


The average height is 
hetghtae,(t) = flogy (ilk )] 


where i(cycle number ) = 1,2,3..... 
all cycles is 


The total execution time over 


‘10 #4?) 


aus 


s=1 


= feof 


e(i-1) 
In Lk 


TCT, (. +e(/-1) [heightarg (i 1) 


e(/-1) In(lk E,+(2 
In Ik 


) eI-1) 


In lk 


+ 


(In lk 14] 


Algorithm EO* and Its Total Cycle Time 


_ To execute AND-OR search in parallel, Algorithm EO* 
employs a global queue to order PSGs. Since its implementation is 
virtually identical to that of E°, EO* will also terminate when a- 
criterion is satisfied. However, an computing the performance of 
EO* we will include the cost of comparison intrinsic to the inser- 
tion operations in queue management, and show that the com- 
parison cost causes an unacceptable bottleneck at the OPEN queue. 
The average number of comparisons needed to locate a place to 


OPEN, 
insert in OPEN will be taken to be nage where |OPEN,| is 


the size of the OPEN queue in the i cycle. The expansion process 
is partitioned into two phases: the initial phase in which 


|OPEN;|<p, and the second phase in which |OPEN;|>p. Then 


. E :(z) 
my EO é 


+5 (Bio: (2)(74) 


+ Byo-(z \(2 flog; p |pl +216: _epl*flog, p +) 


The Eso (z) term is the dominant term in the whole equa- 


tion. The speedup for EO* is tabulated in Table 1 where the 
increasing size of the OPEN queue apparently increases the ordering 
cost, which in turn increases the EO* single cycle cost dramatically 
over that of AO* as the cycle number increases. 


Table 1 
. T 
EO Speedup (= 


Ao* 
TCT 20° 
1 =3,k=2,e=0.1, 
y=100,N =50,N, =75 


Distributed Queueing AND-OR Search: The FO* Algorithm 


An algorithm that is more "distributed” than EO" is needed 
to eliminate the global bottleneck. Such an algorithm, called FO", 
assumes that an interconnection network connects all PEs to enable 
PSGs to be passed during the execution of FO*. However, for the 
performance evaluation that follows, we assume that every PE has 
one or more PSGs and that no communciation between PEs takes 
place. Every PE does UCR to determine the next best PSG to 
expand. To terminate optimally, the system periodically checks if 
f’(s)>qa is satisfied for all 7, where f’(s) is the cost of the best 
PSG in the j* processor. 


Cycle Types 

Unlike E* and EO“, FO" contains five distinct cycle types 
classified according to the patterns with which critical PSGs ( those 
having f;(s)<N ) and non-critical PSGs (f;(s)>N) appear in the 
PEs. Let Q be the total number of critical PSGs in all the PEs at 
an instant during execution. Then the five cycle types can be 
described as: 


(1) PB(p < Q) 
The p-best type a cycle. Cycles belonging to this type has a 
critical PSG as the local best PSG in every PE, and every 
local best PSG is among the global p-best in the system. 
PB,, is identical to the perfect cycle type in EO”. 

(2) PB (p > Q) 
The p-best type 6 cycle. This type corresponds to the mixed 
cycle in EO*. This cycle expands not only all the critical 


PSGs in the system but also some of the non-critical ones as 
well. 


(3) NPB, (p < Q) 
The Not-p-best type a cycle. Every OPEN has a critical 


PSG as its local best PSG, although not every one is global 
p -best. 


(4) NPB, (p < Q) 
The Not-p-best type 6 cycle with p<@. At least one PE in 


this cycle has a non-critical PSG, and therefore a non-global 
best PSG, as its local best PSG. 


(5) NPB (p > Q). 


The Not-p-best type 6 cycle with p>Q. At least one PE in 
this cycle contains at least two globally best PSGs. 


We may now define two probability parameters and a param- 
eter for measuring the relative expansion power of an algorithm. 
First, the existential probability P:(I) is defined to be the probabil- 
ity of occurrence of cycle type + in algorithm / during execution. 
The ezpansion probability Pi,, (I) is the probability of expanding 
the optimal solution in a cycle type ¢ of algorithm J. The product 
Pi(I)-Pip (I) gives us the expansion power of cycle type t in algo- 
rithm J, where expansion power of a cycle type is the probability 
that it is that cycle type that will expand the optimal solution 
when given a random cycle. Using these two parameters we may 
define a new parameter, the power ratio PR, which measures the 
ratio of expansion powers of certain cycle types M and N in algo- 
rithms C and D, respectively. 


LPC) Pow (C) Epn (2) 
PR(CuDw) = pip) Pi,(D) Foul) 
JEN 


where Epy(z) and Ecy(z) are the expected numbers of nodes 
expanded by cycle set N in algorithm D and cycle set M in algo- 
rithm C respectively. 


Expected number of nodes for FO* 
Let cycle type 1 and type 2 be the perfect cycle and the mixed 
cycle in EO“ respectively. Then 


E,0+(2) 


E_,.(z) = ———————— 
ro‘(2) PR (FOi234,5 £91, ) 
<< #,(y{§ fm EO elzO ) Pop (EO")P2(EO*) 
e(Z]° =< pe PA SV lak 7 * i * 
= "EO Y Pi, (FO*)P!(FO*) X Pow (FO )P.(FO ) 
é=1,3,4 ‘= 


We will denote the first term inside the parentheses power ratio 
PR,, and the second term PR >. 
Assumptions for FO* 


If the number of critical PSGs in a EO° cycle is taken to be 
Q, then we can assume that the probability of expanding the 


optimal PSG in this cycle to be S if all p<Q. Thus 


P31, (£O")== and P2,(EO*)=1.0 for the perfect and the mixed 


cycles of EO“ respectively. For the cycle types of algorithm FO“ 
we have: 


(1) Type 1. 

Same as the perfect cycle of EO“. That is, | scl (FO 9 : 
(2) Type 2. 

Same as the mixed cycle of HO“. That is, Pi,, (FO*)=1.0. 
(3) Type 3. 


In this cycle type every local best PSG is a critical PSG, but 
not every local best PSG is also global best. Again w esti- 


mate the expansion probability to be Pac (FO a5 , 
(4) Type 4. 


In this cycle there is at least one PE that has a non-critical, 
non-global best as its loca] best PSG. Letting the number of 
local best PSGs that are also global best be r, then 


Pe: (FO"}=<, where r<p. 


(5) Type 5. 


In this cycle there exists at least one PE that has at least two 
critical PSGs as its top two PSGs. We use a lower bound 


1 ‘ 
G for P3(FO*), where G; is the number of global best 


PSGs in the PE that has the largest number of global best 
PSGs in the system. To understand this lower bound see [8]. 


Since we are only interested in an upper bound of Exo+(2); 
we will assume that P'(EO°) = P2(EO") =1.0. The existential 


probabilities of all five cycle types of FO° is computed in the fol- 
lowing section by using combinatorial analysis. 


The Power Ratio and Combinatorial Analysis 


The existential probabilities can be determined by examining 
how p-best PSGs rank among other PSGs in heuristic merit in 
every processor. The number of combinations of p-best PSG rank- 
ings in different PEs, given the number of local queues that do not 
possess any p-best PSGs (known as the bad queues), is known as 
an i—combination (denoted by BQ; ), where i is the number of 
local queues lacking p-best PSGs. For example, if p==5 and 1=2, 
a possible configuration in the 2-combination is 

queue 1: 1,4 
queue 2: none 
queue 3: 3 
queue 4: 2,5 
queue 5: none 


where the numbers 1,4 in queue 1 indicates that queue 1 has the 
first and the fourth best PSGs in the system. We also define a pat- 
tern to be a set of numbers where each number corresponds to the 
number of p-best PSGs in an unspecified queue. For instance, if 
i1==2, then there are two possible patterns: 3-1-1 and 2-2-1, where 
each number separated by a dash denotes the number of p-best 


PSGs queued up at a local OPEN queue without considering which 
processor this queue belongs. Every such number in the pattern is 
referred to as a digit. The number of combinations in BQ; ( com- 
binations with 1 number of bad queues ) can be written as: 


[Poke 
ay 1=0 
. I ed 41 


where cj =0 for all J 1 <i < p-l, and 7 is the total number of 
patterns possible for 7 bad queues, p is the number of PE’s, ¢/ is 
the / digit of the ie pattern, k/ is the number of times the /** 
digit appears in the j pattern and m is the total number of dis- 
tinct digits in the 7 pattern. The existential probability of a 
cycle type 7 can be expressed as 


» BQn 

Pi(FO*) = “<7 
y 3a, 
n=0 


Since every OPEN queue in FO* determines its best PSG by 
using local UCR, the equation for total cycle cost for FO" is identi- 


s Exot ag 
cal to that of AO except for a substitution of for 
E,9+(z). Therefore, 
(hE 5: (2) 
B,(2)|  €(I-1)n—2 
FO e(/-1) 
TCT...» < w]| ——— a 
rot = “(| a nik Inlk 


na ) (1+1nz}1 


The Table 2 shows a decreasing trend for processor efficiency. 
However, the rate of decrease is relatively small, and the speedup is 
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Table 2 
FO* Measurements 


pos 


large enough to justify the use of local queueing over global queue- 
ing for small p. 
Conclusion 


In this paper we have examined the performances of two 
parallel search algorithms. When only the number of nodes 
expanded is taken into account, we show that the speedup is linear 
for a large number of processors. However, when one also considers 
ths cost of managing a global queue, especially in the context of 
AND-OR search, then the availability of more processors in EO™ 
actually slows down execution. Finally, it has been shown that 
algorithm FO", which uses local UCR, performs much better than 
EO". 
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PARALLEL PREFIX ON FULLY CONNECTED 
DIRECT CONNECTION MACHINES 


Clyde P. Kruskal 
Department of Computer Science 
University of Maryland 
College Park, Maryland 20742 


ABSTRACT This paper presents an algorithm to perform 
the parallel prefix operation on a linked list of items. Let N 
be the length of the list, and let P be the number of proces- 
sors. The algorithm described here runs on a fully con- 
nected direct connection machine (DCM) in time 
O(N/P +P?). It attains linear speedup for N=(P%). 
Previously, linear speedup has only been attained on shared 
memory models. 


1. Introduction 


Graph algorithms for sequential machines have received 
intensive study, and by now the basics, at least, are well 
understood. In contrast, the study of parallel algorithms 
began fairly recently; even for some seemingly simple prob- 
lems, efficient parallel algorithms are hard to find. Two fun- 
damental problems are product computation and initial 
prefix computation. The product computation problem is to 
compute the product ayg0a,0--:: oa,_,, given N ele 
ments do, 4}, ..-, @y_;, and a binary, associative operation, 
denoted o. The initial prefix problem is to compute all N 
initial prefixes @9, @y) 0 a@,, The initial prefix problem when 
solved in parallel is known as parallel prefix. In this paper, 
we present efficient parallel algorithms for product computa- 
tion and initial prefix computation, when the elements are 
stored in a linked list. As shown in [8], these results can be 
used to obtain efficient parallel algorithms for other graph 
theoretic problems. 


Since parallel processors will be built with a fixed 
number of processors, and since users tend to push the lim- 
its of their machine, it is important to consider the situation 
in which there are fewer processors than data elements. In 
such cases, the user wishes to use all the processors 
efficiently. We assume that there are p < N_ processors 
cooperating (to solve a problem of size NV’). We desire paral- 
lel algorithms that solve a problem in O(t(N)/p) time, 
where t (NV) is the sequential time to solve the problem; this 
is called linear speedup and is always optimal. | 


Various models of parallel computations have been stu- 
died. In the PRAM model, a set of processors can all con- 
currently access shared memory. Three variants are dis- 
tinguished depending on the type of simultaneous action 
permitted to each memory cell: The most powerful model, 
the CRCW, allows concurrent reading and writing to any 
cell by any number of processors. The CREW model 
requires that no two processors simultaneously write to the 
same cell; however, simultaneous reading is allowed. The 
EREW model restricts reading and writing so that no two 
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processors can simultaneously access (read or write) the 
same cell. In contrast to the PRAM model, a somewhat 
more realistic model is the Direct Connection Machine 
(DCM): Memory is divided into memory modules with an 
equal number of processors and memory modules. Any pro- 
cessor can access any cell in any module; however, simul- 
taneous access to any module is prohibited (even if the 
accesses are to different cells). 


The product and prefix problems have been extensively 
studied for the case that the memory layout of the input is 
data independent, i.e. the location of z; is determined by the 
index 7 [3],[5],/6],[9],{11],[14]. In the data dependent versions 
of the product and initial prefix problems, the locations of 
the elements are given but it is not known which element is 
which. Only the location of the first element is given along 
with a map from the ith element to the (¢+1)st element. 
In practice, the mapping will be represented as a linked list, 
so that if element ¢ is contained in A [j| then element ¢+1 
is contained in A [Neat [j]]. It is easy to solve the product 
and initial prefix problems sequentially in O(N) time by 
starting at the first element and following the links. Exam- 
ples of such problems are that of computing the rank of 
each element on a linked list, or that of labeling each ele- 
ment of a linked list with the name of the first element of 
the list. 


The parallel prefix operation may be described as fol- 
lows. We are given a set S of items, together with an asso- 
clative binary operation o on these items. Let 


(21, Zo, vegan) 
be a list of items, z; € S,1< 7 <N. The parallel prefix 
operation produces the list of items 


3 Ooty), 


ie. the 7th item is the 7th partial product 2,0z,0-- - oz;. 
Parallel prefix is an extremely important operation on paral- 
lel machines. For example, it is used in efficient parallel 
algorithms to solve the following problems (see 
[1],[2],[4],[20], [12], [15}): | 


(1) summing, ie. the operation is ordinary addition; 


(% 1, %1O%o,..., £2 0%.0-- 


(2) broadcast a value to all processors; 

(3) packing data items; 

(4) preorder and postorder traversal of trees; 

(5) graph problems, such as finding connected com- 
ponents. 


In general, the efficiency of the known algorithms to 
implement parallel prefix depends upon the data structure 
used to represent the list of items. When the items are 
stored in contiguous memory locations, i.e. in an array, we 
have the following previous results. Let P denote the 


number of processors. There is an O(N /P + log P) algo- 
rithm for parallel prefix on the Shuffle Exchange Machine 
(see [12]). This readily yields an algorithm of the same time 
complexity for a fully connected Direct Connection Machine 
(DCM) and for the (EREW) PRAM (exclusive-read, 
exclusive-write parallel RAM). Parallel prefix when the data 
items are stored in a linked list was first discussed in [18]. 


There is an O (Ee) algorithm for the EREW (7]. 
P log(2N /P ) 
It attains linear speedup for N = Q(P!**), « > 0 any con- 
stant. See [17] for randomized shared memory algorithms. 


This paper describes an algorithm for parallel prefix on 
a linked list that runs in time O(N/P + P?) on a fully 
connected DCM. It attains linear speedup for N = 2(P°). 


Recently, researchers have become interested in finding 
good models of parallel computation, and studying the 
differences between the models. Upfal and Wigderson [16] 
show that any on-line simulation of T steps of a PRAM 


takes at least 0)( je eee 
log log P 

they show that a fully-connected DCM can simulate any 
computation of an EREW PRAM at the cost of a factor of 
O (log P (log log P)*) in running time (assuming that the 
size of memory is only polynomial in the number of proces- 
sors). Together, the two results say that PRAM’s are, in 
some sense, more powerful than DCM’s, but not much more 
powerful. 


To really prove that PRAM’s are more powerful than 
DCM’s, one should produce a (natural) problem that can be 
solved faster on a PRAM than on a DCM — not merely an 
algorithm that cannot be simulated efficiently step by step. 
While it certainly may be the case that such problems exist 
and may even be common, it is not easy to think of reason- 
able candidates — much less actually prove that some prob- 
lem is hard on a fully connected DCM. Very small problems 
will not separate the models, because, if the problem is small 
enough, every data item can be placed in a separate memory 
module. (Depending on the exact rules for accessing com- 
mon locations in a memory module, a fully connected DCM 
with at most one data item in each memory module will be 
equivalent to some PRAM model.) Intuitively, large graph 
problems would seem to be plausible candidates because of 
the difficulty in following pointers in parallel when there is 
more than one concurrent reference to the same memory 
module. The techniques in this paper coupled with some 
recently discovered techniques [8], will produce efficient 
DCM algorithms for many graph problems; thus, perhaps 
surprisingly, large graph problems do not separate (the 
power of) the two models. Our results, however, leave open 
the possibility that intermediate sized problems separate the 
models. 


) steps on a DCM. Furthermore, 


2. Preliminaries 


The model of parallel computation that we will be pri- 
marily concerned with in this paper is the fully connected 
DCM. There are P processing elements (PEs), each of 
which has a (large) local memory. We denote the 7 th pro- 
cessor by PE;, 1< 1 <P, and will refer to its local 
memory by M;. Each pair of processors is connected by a 
data link. During one cycle of the machine clock a processor 
may access its own or any other processors local memory. 
However, only one processor may gain access to the same 
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local memory during any single cycle. So if k processors 
attempt to access the same local memory at the same time, 
then it will take k cycles before all of their requests are 
satisfied. In particular, during a single cycle, a set of 
memory cells may be simultaneously accessed only if they 
are all in different local memories. This is in contrast with 
the EREW PRAM, where any set of k <P distinct 


memory cells can be accessed simultaneously. 


3. The Parallel Prefix Algorithm: An Outline 


3.1. The Algorithm 


The main idea of our parallel prefix algorithm is to 
reduce efficiently the size of the list by a constant factor, call 
the algorithm recursively, and then obtain the parallel prefix 
for the whole list from the parallel prefix for the collapsed 
list. Each of the nodes in the list consists of two parts: a 
data twtem and a nezt link pointing to the next item in the 
list. The data items are stored in arrays Value/ /, and the 
next links are in arrays Nezt/ /. We reduce the size of the 
list by patring adjacent nodes, as illustrated in Fig. 1. 


We assume that the N data items are distributed 
evenly among the processors, i.e. about N/P items in each 
local memory. If N <P, there is at most one item per 
processor, and we can use the usual parallel prefix algorithm, 
which is given below. We assume without loss of generality 
that the data items are located in the first N_ local 
memories. 


for j+-1 to log N do 
forall:,1< 2 < JN, in parallel do 
if Next |: | 54 NIL then 
Value [Next |i |} — Value [i] o Value [Nezt [¢ |]; 
Neat {i |  Nezt |Nezt |: ]]; 
end if 
end forall 
end for 


Now we consider the case when N > FP. The follow- 
ing is a general scheme for solving this problem. We assume 
the existence of a predecessor map Pred, which is easy to 
construct in O(N /P ) time. 


General Parallel Prefix Algorithm 
if the list contains more than one element then 
pick a set S of nonadjacent elements in the list; 
for each z € S do 
{replace a pair of adjacent elements by one 
element} 
Value [Succ [2 |] 
Value [x] 0 Value [Succ [z |]; 
Succ [Pred [z |] :== Succ |z |; 
Pred |Succ [z |] :—= Pred [z | 
end for; 
execute algorithm recursively; 
for each A € S do 
{expand element back into a pair} 
Value [x | := Value [Pred |z |] o Value [z |; 
Succ [Pred |z |] := 2; 
Pred |Succ [z |] := 2 
end for 


end if 


The first part of the algorithm solves the product prob- 
lem, successively compacting the list until only one item is 
left; the second part, where the recursion unfolds, expands 
the list back, and computes the missing partial products. 
Any parallel algorithm that solves the product problem by 
successive compaction can be used to create a parallel prefix 
algorithm that expands the list back, by matching step by 
step the compression operations done by the parallel product 
algorithm. The resulting algorithm will have about twice 
the running time of the original algorithm. We shall hen- 
ceforth consider only the product part. 


At the outermost level of the recursion, each processor 
will process its N/P elements one at a time (replacing a 
pair of adjacent elements by one element). An efficient 
parallel implementation will process O(P ) elements at each 
time step. This must be done while avoiding two potential 
conflicts: First, no two adjacent elements can be processed 
at the same time (which would destroy the linked list). 
Second, no two processors can access the same memory 
module at the same time. The first type of conflict must be 
avoided when one is implementing this algorithm on a 
shared memory machine; the second type of conflict is 
unique to DCM machines. We ensure that neither type of 
conflict occurs by using a memory access routine described 
below. After pairing elements at the outermost level of the 
recursion, the elements are packed so that there are approxi- 
mately an equal number of elements in each processor. The 
algorithm is then called recursively. 


4. A Memory Access Technique 


In this section we will describe and analyze a method 
for organizing memory references on a DCM. This is essen- 
tially a refinement of an idea due to Gottlieb and Kruskal 
[4]. It will allow us to obtain efficient parallel algorithms on 
the DCM (for N >> P). This section is organized as fol- 
lows. One of the components of the main algorithm is a dis- 
tributed maximal matching algorithm, which is discussed in 
the first subsection. The next subsection describes the 
memory access method itself, which is actually an algorithm 
schema intended to be used as a procedure in other algo- 
rithms. In the third subsection we analyze the parallel time 
required by our algorithm, under natural assumptions that 
will hold in all of our applications. 


4.1. Maximal Matching Algorithm 


A matching in a graph is a set edges in the graph, no 
two of which share the same vertex. A matching is maximal 
if no edge from the graph can be added to it and still obtain 
a matching. There is an obvious sequential algorithm for 
finding a maximal matching in O(E) time: start with the 
empty matching, and iterate through the edges one at a 
time, adding the edge to the matching if neither of its end- 
points is already covered. Here we informally describe an 
O (V7) sequential algorithm that is easy to parallelize. 


Let N be the number of vertices, and number them 0 
to N-1. For convenience we assume that N is a power of 
two. We use a Boolean array Matched [0..N-1]|. We initial- 
ize Matched [i] + false for all i, O< i < N. The basic 
idea of the algorithm is as follows. We try to match the 
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first N/2 vertices with the last N/2 simply by checking 
each possible pair of vertices (7,7), withO <i < N/2 and 
n/2<7<N. We then call the algorithm recursively, 
trying to match up the first N /2 vertices among themselves, 
and the last N/2 among themselves. Here is a program 
describing the first level of recursion. 


for k 0 to N /2-1 do 
for 1+-0 to N/2-1do 
j  N/2+4+(i+k) mod (N /2); 
if not Matched [i | and not Matched | | 
and {1,7}¢€E then 
add edge {1, 7} toM; 
Matched |i |-true ; Matched [j |+true ; 
end if 
end for : 
end for k 


Assume that we have a fully connected DCM with N 
processors. This algorithm is trivially converted to a parallel 
algorithm by simply performing the inner loop in parallel 
with N/2 processors on each level of the recursion. It is 
easy to see that the active processors (e.g. the first N /2 at 
the first level, the first N/4 and the third N//4 at the 
second level, etc.) do not have any memory conflicts among 
themselves. Moreover, since active processors only try to 
match with inactive ones, we avoid the undesirable situation 
where PE; is trying to match with PE, which in turn is 
trying to match with PE; .. Note that the algorithm is truly 


distributed in the sense that each processor only needs to 
know the identities of those other processors that it wants 
to match up with. 


The recurrence relation for the parallel time required 
by the algorithm is 


T(N) = eN /2+ T(N/2) 


where c is some constant. The first term arises because the 
inner loop (the for 7) can be carried out in constant time 
using N/2 processors, and the second term is the time 
needed for the recursive calls. These calls are carried out 
independently in parallel. This is easily solved to obtain 


T(N) = O(N). 


4.2. Memory Access Algorithm 


During a computation there will be times when each 
processor wishes to access several (possibly many) locations 
and at each access process the data therein. Often it does 
not matter in what order the accesses occur or in what order 
the data is processed. Each processor will have a list of 
locations it wishes to access, and therefore a list of memory 
modules along with the locations in each module it wishes to | 
access. An efficient parallel program will arrange the 
accesses so that no two processors conflict at a module. 

Formally, a memory access map is a directed graph 
G=(V,E) with |V|=FP. Associated with each 
directed edge e € E there is an integer weight w, > 0. 
There are no multiple edges; however, loops (i.e. edges of the 
form (v,v)) are permitted. Each vertex v € V represents a 
processor. There is an edge e =(1, 7) € E if PE; needs to 
access M j We times. In addition we associate a set X;, of 
elements, |X; |= w,, with the edge e =(i, 7). These 


sets are not really a part of the memory access map, but are 
kept locally in each processor. They represent items that 
PE; needs to process. As part of the processing of these 
items it is necessary for PE, to access M;. The order in 
which the items are processed is irrelevant to the overall 
algorithm. That is, there do not exist processors PE, and 
PE; and items zt € X;,, y € X in such that PE; must 
process x before PE; can process y. 

If the set of edges E'' represents a set of accesses to 
memory modules, then the domain(£’) is the set of proces- 
sors involved in those accesses. Formally, if E'’ C E , where 
G =(V, E) is a directed graph, then the domain of E' is 
defined by 


domain(f') = 
{i€V | there exists 7 € V such that (i,j)€ E'}. 


We are now ready to describe the memory access pro- 
cedure. Each of the steps is executed in parallel by each 
processor. Throughout the algorithm only one processor is 
ever active in a given local memory at the same time. The 
only information that each processor PE; initially needs is 
the sets X;,; for j i. 


Let 7 be a threshold. We will determine its exact value 
later. 


PROCEDURE: ACCESS MEMORY. 


[A1] Considering only edges whose weights are larger than 
the threshold 7, find a maximal matching M in the 
current memory access map. 

[A2] Find the minimum weight w of any edge in the match- 
ing. Broadcast w to all of the processors. 

[A3] Each PE; with 7 € domain(M) executes the following 
code. (If ¢ is not in domain(M) then PE, is idle.) 

let 7 € V be such that (¢, j)E M; 

for k+1to w do 
choose an item z € X; 
X;; X;; - {x }; 
process z ; 

end for 


j? 


[A4| Each processor now updates the part of the memory 
access map contained in its local memory. To do this, 
set w,+-w, — w for each edgee EM. 


[A5] Repeat steps [A2]-|A5] until the weight of every edge is 
less than the threshold r. 
|A6] Each PE; executes the following code. 
for j3+-0 to P-1 do 
for k+1 tos do 
choose an item z € X; Pp medP 
(if there is one); 
X; i4jmodP oe Xj §4j modP Ses 
process 2 
end for 
end for 
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4.3. Analysis 
We now analyze the ACCESS MEMORY algorithm. 


In order to calculate how many iterations there are of steps 
[A1]-[A5], consider any vertex v in the memory access map. 
Since we always have a maximal matching of the memory 
access map for edges whose weights are greater than 7, dur- 
ing an iteration either a vertex v is matched, all of v’s 
neighbors are matched, or v has no incident edge with 
weight greater than 7. In the first case, the weight of v will 
be reduced by w >7,; in the second case, the weight of all its 
neighbors is reduced by w>r. Thus the total number of 
iterations of steps [A1]-[A5] involving vertex v is at most 
the sum of its weight and the maximum weight of any of its 
neighbors, divided by 7. Since each processor has weight at 
most [N/P], there are are at most 2 [N/P |/r iterations. 


Step [Al] is the maximal matching algorithm discussed 
above, and each invocation takes time O(P). So all 
together, step [Al] takes O(N/r) time. Step [A2] involves 
the standard minimizations and broadcast operations, and 
at each iteration takes time O (log P ); hence, the total time 
required for [A2] is O ((N log P.)/(P 7)). 

Reasoning as above, the total number of iterations of 
the for loop in step [A3] involving vertex v is at most the 
sum of its weight and the maximum weight of any of its 


neighbors. This is O(N/P ) iterations. 
Step [A4] takes constant time at each iteration. 


Step {A6] iterates through all the memory modules, 
spending O(r) time at each module. Thus step [A5] takes 
O (P 7) time. 

By summing the costs of all of the steps, we find that 
the parallel running time of the entire ACCESS MEMORY 
algorithm is O(N/P + N/r+ P17). This is minimized for 
t= O(VN /P_), making the total time O(N/P + VNP ). 


5. The Parallel Prefix Algorithm: Details 


5.1. The Algorithm 
We now give a detailed description of the fully con- 

nected DCM parallel prefix algorithm outlined in Section 3. 

ALGORITHM: PARALLEL PREFIX ON A LINKED LIST 

(DCM) 

[P1] First we construct the memory access map as follows. 
Each processor partitions its N/P data items into P 
groups. The items in group 7 are those whose next 
links reference local memory M, of processor PE,. 
Group j for PE, is denoted by X37. Put v5; = 
Module z will contain wy; for allO<j <p. 


t7 } 
[P2] Each processor now calls the procedure ACCESS 
MEMORY defined in the preceding section. To process 
an item t € X;,, the following code is used to pair the 
item with the next item. All of the items are initially 
undeleted (i.e. Delete [2 | = False). 

Value [Succ [x |] :== Value [z] 0 Value [Succ [z }]; 

Succ [Pred |x |] :== Succ [z |; 

Pred |Succ |x |] := Pred [z |; 

Deleted [x | :—= True; 
Note that a second invocation of ACCESS MEMORY 
is required to adjust the A.Pred values. The matching 
automatically ensures that two adjacent elements will 
not both be updated at the same time. 


[P3] For each item in the original linked list, either it has 
been paired with another item, or its predecessor and 
successor have been paired. Thus the size of the col- 
lapsed list is no more than 2N /3 + O(1). Now repack 
the undeleted items in order to ensure that the items of 
the collapsed list are distributed evenly among the pro- 
cessors. The repacking procedure is discussed 
separately below. Then call the algorithm recursively 
on the collapsed list. 


Upon return from the recursive call, we must unpack 
the list, ‘‘unpair” the items that were paired in step 
[P2], and correctly compute the parallel prefix for the 
original list. In order to do this we will need to keep a 
record of the items that are paired together in step [P2] 
and where they are moved in step [P3]. An easy 
modification of the repacking algorithm itself can be 
used to do the unpacking. Using our additional infor- 
mation, it is straightforward to obtain the parallel 
prefix for the original list from the parallel prefix for 
the collapsed list. 


[P4] 


Next we describe an algorithm to accomplish the 
repacking needed in step [P3]. The similarity of the repack- 
ing routine to the entire parallel prefix algorithm should be 
apparent. 


5.2. Repacking a Linked List 
ALGORITHM: REPACK A LINKED LIST (DCM). 


[R1] Construct a memory access map as follows. Each pro- 
cessor counts the number of list items in its local 
memory. For PE; we denote this quantity by N;. 
Each processor then sends to every other one a message 


indicating the number of items that it has. Now com- 


P 
pute the sum N'= J'N,. If N; > N'/P then PE, 


i=1 
will be a sender. If N; < N'/P then PE; will be a 
recewer. Let S denote the set of senders, R the set of 
receivers. Order the sets S and R arbitrarily but 
deterministically, e.g. by processor index. A sender 


PE; has an  ezcess of items, defined by 
eé; =N,-N'/P; a receiver has a deficit, 
d; = N'/P —N;. We now define the bipartite 


digraph that will serve as the memory access map. 
The set of vertices will be SUR. A directed edge will 
always go from a vertex in S to a vertex in R. Each 
edge is labeled with a weight. We determine the edges 
and their weights as follows. Let PE, be the first 
sender, and let PE; be the first receiver. Add the edge 
(2,7) to the digraph. Label it with the weight 
w = min{e, , d,; }. We now readjust the excess 
€, + @ — w and also readjust the deficit 
d, — d, —w. If e; =0 then PE; is not considered 
further. Similarly, if d; =O then PE; is not con- 
sidered further. We repeat this process with the 
remaining senders and receivers until there are no more 
processors to consider. Each PE; that is a sender will 
select Wi; items arbitrarily and put them in the set 
X35 for each edge (7, j) in the digraph. (Note: The 
sets Xs; constructed here are distinct from the ones in 
step [P1].) 
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[R2] Now call the procedure ACCESS MEMORY to move 
list items from the senders to the receivers. To process 
an item zt € X,,;, PE; does the following: 


tj? 
let y denote a new location in M 7 
A ly] — A [2]; 
Nezt [y | — Nezt {zx ]; 
Next [x | — ¥; 
mark z as “‘moved’’; 


We must now readjust the next links. To do this we 
follow the links in parallel. This actually involves 
another application of the ACCESS MEMORY pro- 
cedure; however, it is exactly the same as used before 
where every processor had to examine the successor of 
each node. When we follow a link and find that 
Nezt|z| is marked as moved, then we_ set 
Next |x | « Next [Nezt |x|]. This is done for all of the 
data items, including the newly created copies, but 
excluding those items marked as moved. Fig. 2 illus- 
trates how a list item is moved, and how the next links 
are then readjusted. 


[R3] 


5.3. Analysis 


In this section we will obtain the time bound for our 
parallel prefix algorithm. First note that step [P1] of the 
algorithm can easily be accomplished in time 
O(N/P +P). Each next link may be viewed as an 
ordered pair (1, a), where 7 designates a local memory M; 
and a specifies an address within M;. Each processor sorts 
on the i coordinate of the next links. This takes time 
O(N/P +P) using a bucket sort with P buckets. Con- 
struction of the first maximal matching requires O (P) time 
as describe previously. 

Step [P2], the ACCESS MEMORY routine, takes time 
O(N/P +VNP ). 

Let N' denote the number of items in the new (col- 
lapsed) list. The purpose of the repacking in step [P3] is to 
ensure that the items of the new list are distributed evenly 
across the processors, i.e. N'/P per processor. The repack- 
ing routine takes time O(N/P + VNP ). To prove this, 
first note that it is easy to see that the memory access map 
defined in step [R1] has at most P vertices and P edges, 
and the method of construction clearly takes time O(P). 
For step [R2], a maximal matching can be found in time 
O(P) and at least one edge is (effectively) eliminated at 
each iteration of steps |Al]-|A5] of ACCESS MEMORY. So 
the total time spent finding maximal matchings is O(P 7”). 
Now we can apply our theorem again. In this case the 
weight of each node is its excess or deficit, which must be at 
most N/(2P). Thus the total time spent moving items is 
O(N/P.), and so steps [R1] and [R2] of the repacking take 
time O(N/P +VNP ). Finally, by a similar analysis it 
can be shown that step [R3] of the repacking takes time 
O(N/P +VJNP ). The memory access map needed for 
that step is similar to the one constructed in step [P1]| of the 
parallel prefix algorithm. 

It is readily seen that after step [P2], all of the links in 
the list will have been considered. It follows that for every 
node in the list, either it has been paired with its predeces- 
sor or its successor, or else both its predecessor and its suc- 
cessor (with the obvious exception for the head and tail of 
the list) have been paired. Thus the length of the collapsed 


list is no more than 2N/3 + O(1). Let Tp(N) denote the 


time required by the entire parallel prefix algorithm on a list 
of size N. Then we have the recurrence relation 
N 2N 
Tp(N) < (N+ VNP ) + Tp(2%) 
for some constant c. We stop the recurrence when the 
number of items is less than P. At this point, the algo- 
rithm for one item per processor can be used to solve the 


problem in O (logP) time. The solution to the recurrence 
thus becomes: 


TN = ois + VNP )). 


For N >> P we have that Tp(N) = O(N/P), so that 
the algorithm is totally efficient for N large relative to P. 


In fact it is easy to see that the algorithm is totally efficient 
for N = 2(P°). 


6. Remarks 


For large P a fully connected DCM is not feasible. It 
is well known, however, that some bounded degree connec- 
tion machines, e.g. the Shuffle-Exchange machine, can simu- 
late a fully-connected DCM at the cost of only an O (log P) 
factor in the running time. Nevertheless, due to the P 
term in the running time, our algorithm as it stands is not 
practical for large P. Thus the obvious open problem is to 
find an efficient parallel algorithm for parallel prefix on a 
linked list that will run on a DCM and that is totally 
efficient for N smaller than @(P°). 
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Abstract. An algorithmic model and a methodology to evalu- 
ate the effectiveness of replication, pipelining, and local 
parallelism in the implementation of multiple-instance algo- 
rithms are presented. In the class of algorithms considered, 
instances are divided into groups with dependences between 
corresponding instances of consecutive groups. The imple- 
mentations use a variable number of identical operation units, 
and the methodology permits the selection of the most 
efficient combination of concurrency techniques for a given 
execution time. The methodology is illustrated with an exam- 
ple, and has been used to obtain implementations for the 
Singular Value Decomposition computation. 


1.- Introduction 


Replication, pipelining, and local parallelism are well- 
recognized methods used, separately or in combination, to 
improve the speed of execution of digital systems. For a 
given computation, there are many alternatives in the use of 
these concurrency techniques to satisfy performance and cost 
constraints. In this paper, we study the effectiveness of those 
techniques for a class of computations encountered in a 
variety of important applications. This class has the following 
characteristics: 


¢ The overall computation corresponds to the execution of 
many instances of a certain algorithm. This characteristic is 
the basis for the use of replication and pipelining. The in- 
stances are not totally independent but can be divided into 
groups of independent instances. This property limits the de- 
gree of replication and pipelining that can be used in an im- 
plementation, and complicates the tradeoffs between these 
techniques. 


¢ An instance of the algorithm is described by a directed 
graph (where nodes represent subcomputations and arcs 
correspond to precedences among the subfunctions), without 
loops nor conditionals excepting those required to detect the 
end of the computation. 


¢ The class of implementations we consider uses only one 
type of operation unit. This unit can be single function or 
multiple function. 


As an example of computations that belong to the class 
outlined above, we can mention those that are described by 
expressions of the form 


A,” =A,)~! (+,*) BY! 


where A and B are terms of any complexity, usually involving 
matrices. This includes, for instance, matrix multiplication, 
LU-decomposition, transitive closure, and singular value 
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decomposition) (SVD), which are frequently used in signal 
processing applications [1,2,3]. The main recurrences in 
these algorithms are procedures to update the values of a ma- 
trix using a modifier term that may depend on the current 
value of the matrix. These algorithms have been considered 
suitable for systolic array implementations [4, 5, 6]. 


Our goal for the class of computations described is to 
reduce the execution time for a given number of operation un- 
its. We are interested in identifying those implementations 
that offer better efficiency in terms of speed/cost. Since the 
algorithms of interest are compute bound and have implemen- 
tations with negligible communication delay [4,5, 6], only the 
computation time and the cost of the operation units are the 
relevant parameters for the design. 


Our design methodology consists of top-down decompo- 
sition and bottom-up grouping of the nodes in the graph of an 
algorithm. Our results show that: 


¢ for low throughput (or equivalently, up to a speed-up equal 
to the number of independent instances), replicated or pipe- 
lined implementations with efficiency equal to 1 can be 
achieved. 


¢ for larger throughput, it is necessary to use combinations of 
the concurrency techniques, including the local parallelism in 
the graph of the algorithm. If portions of the graph have 
varying degrees of parallelism, pipelined implementations can 
be more effective because they require the addition of opera- 
tion units only to those stages that exhibit greater local paral- 
lelism. 


We have used this methodology to evaluate implementa- 
tion alternatives for the singular value decomposition (SVD). 
It is shown in [7] that significant improvement in efficiency is 
possible by using a combination of pipelining and local paral- 
lelism, with respect to what is achieved in the replicated im- 
plementation (i.e. a linear systolic array) proposed in [5]. 


Previous research in the evaluation and selection of con- 
currency techniques includes [8] and [9]. However, these 
works deal with particular implementations (two-dimensional 
arrays, general purpose architectures, respectively), which are 
different from the ones sought here. Other related researches 
present formal models for the analysis of algorithms and ar- 


(a) In the SVD, data dependencies are with two instances of the previous 
group instead of one. However, the model can be easily adapted to 
account for such situation. 


chitectures [10,11]. These approaches provide formal 
descriptions of hardware structures and algorithms, but as- 
sume the existence of an array of processors and attempt to 
map algorithms onto it, without evaluating other possible 
concurrent implementations. A combination of concurrency 
techniques is described in [12], which presents a two-level 
pipelined systolic array to perform convolutions. Such 
scheme deals with one specific algorithm and does not pro- 
vide a methodology to select, among the alternatives for con- 
currency, an implementation for any given algorithm. 


This paper is organized as follows. In Section 2, we 
present an algorithmic model for the class of computations of 
interest and a methodology for the design and evaluation of 
implementation alternatives. In Section 3 and 4 we discuss 
the characteristics of the concurrency techniques when ap- 
plied to this class of algorithms, and we evaluate different im- 
plementation alternatives using only one or a combination of 
the techniques. 


2.- Algorithmic Model and Methodology 


In this section, we formalize a model for the class of al- 
gorithms considered, and define a set a performance and cost 
measures for implementations using concurrency. We also 
describe the methodology for the design and evaluation of 
such alternatives. 


To design the system for a given computation it is — 


necessary to describe the algorithm in a suitable form. 
Several models have been used to represent the dependences 
between subfunctions, including conditionals and loops 
[13,14]. We use a directed graph in which nodes correspond 
to subcomputations and arcs indicate the precedences 
between these subcomputations. For the class of algorithms 
considered here, it is sufficient to use AND conditions as 
shown in Figure 1. 


a 


Figure 1 - Dependence Graph Elements 


The primitive subcomputations used in the algorithm 
can have any level of complexity. Depending on this level 
(also called the granularity of the algorithm), implementa- 
tions with different degrees of concurrency can be obtained. 
It is clear that larger concurrency can be obtained for finer 
granularity, which would indicate that it is always convenient 
to consider the representation with the finest granularity. 
However, this can lead to descriptions with large number of 
nodes, making the design of the system complex and unstruc- 
tured. Therefore, it is convenient to resort to a structured 
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design, in which one begins with the algorithm consisting of 
a relatively small number of nodes and then refines each of 
the nodes into subalgorithms. This top-down approach has 
the limitations that it is possible to loose some potential con- 
currency. The bottom-up approach is not totally satisfactory 
either, because the use of concurrency in the implementation 
of the node depends on how critical its execution time is in 
the overall algorithm. Consequently, the design process con- 
sists of several iterations until a satisfactory solution is found. 


Model of the Algorithm and the Implementation 


From the considerations in Section 1, we can formalize 
the following model: 


¢ The overall computation requires that an algorithm be exe- 
cuted for M instances. These instances are divided into 
groups of r each, where instance x of group y depends on in- 
stance x of group y—1, as shown in Figure 2. 


¢ The algorithm is described by a directed graph, in which the 
nodes are indivisible for a level of the implementation (i.e. 
the implementation of a node cannot be split across different 
processors or stages within a processor). Each node 
corresponds to a subfunction of the algorithm, and may ex- 
ploit its internal concurrency using more than one operation 
unit. Therefore, for each node there might be more than one 
alternative implementation. To reflect this, the node is 
specified by a set of values 1;(j) corresponding to its time of 
execution in an implementation with j operation units. 


= ae 


y 


Figure 2 - Dependences between Groups 
of Instances 


¢ Only one type of operation unit is used to execute all nodes. 
These operation units may be single function or multiple 
function and are considered indivisible. The use of a single 
type has implications in the design, since in such a case pipe- 
lining a sequential implementation of an algorithm requires 
additional units. (In contrast, if each node of the algorithm 
used a different type of unit, pipelining would not increase 
the number of those units). 


¢ An increase in the number of operation units used to com- 
pute a node reduces its computation time at most proportion- 
ally to the number of units. Therefore, 


nenh2tyG) , b> 


For example, if a node corresponds to five independent opera- 
tions as shown in Figure 3, then three or four operation units 
require two time units to compute the node, and the imple- 


mentation with four units is not advantageous; however, the 
use of five units reduces the computation time to one. 


Node 


Idle unit 


— 


Figure 3 - Tradeoffs in Computation Time 
and Number of Operation Units 


Performance, Cost Measures, and Design Objective 


Alternative implementations are compared using the fol- 
lowing performance and cost measures: 


¢ t: Time of execution of the computation (all M instances of 
the algorithm). 

¢ N: Total number of operation units. 

¢ SU =t,,/t: Speedup of the concurrent implementation 
with respect to a completely-sequential implementation 
(with time t,,), which uses only one operation unit to execute 
all nodes. 

° E =(tos Nos M(t N) : Efficiency in a concurrent implementa- 
tion with respect to the reference system. For our case, 
Nes = 1. 


Since we assume that there is only one type of operation 
unit, and since the computation time decreases at most pro- 
portionally to the number of units, the speedup of any alter- 
native is less or equal to N and its efficiency is less or 
equal to 1. 


In terms of these measures, the main objective of the 
design can be described as selecting the implementation that 
for a given speedup (or computation time) has the largest 
efficiency (or uses the minimum number of operation units). 


3.- Characteristics of Replicated, 
Parallel, and Pipelined Systems 


We apply now the concepts of replication, pipelining 
and local parallelism to the class of algorithms described ear- 
lier. We evaluate the characteristics, performance, and cost 
measures for different implementations, using only one of the 
concurrency approaches in a given system, and provide a 
comparison of the resulting measures. In the next section, we 
will look at systems which use a combination of approaches. 


To illustrate the choices that can be made we use the fol- 
lowing example. We consider an algorithm that is executed 
for M = 104 instances and with dependences in groups of 
r =8. At the topmost level in the design methodology, this 
algorithm appears as one node which requires only one opera- 
tion unit and its computation time is 20M (completely - 
sequential implementation), as shown in Figure 4a. If higher 


throughput in the computation is desired, it is possible to 
decompose the node into subcomputations and then study the 
effectiveness of the concurrency techniques in the implemen- 
tations of the decomposed algorithm. This process is repeated 
until a desired throughput is achieved. 


7= 104 3/1, 2/2, 1/3 
r=8 
2/1, 1/2 
8/1 5/1 
4/2 a 3/2 
3/3 2/3 
2/4 2/4 
Figure 4a 6 Me 
Topmost level view Figure 4b - Decomposed 
Graph 
Figure 4c 
Node 5 


Figure 4 - Example Algorithm with 
Concurrent Capabilites 


Consider now that the algorithm is described by Figure 
4b. In this decomposed graph, each node has several alterna- 
tive implementations with different number of operation un- 
its. This data was obtained from an analysis of each node, 
searching for internal parallelism and devising possible im- 
plementations for them. Figure 4c shows the structure of 
node 5 resulting from such analysis. The alternative imple- 
mentations are indicated by the descriptors t/n next to each 
node (¢: computation time, n: number of units). 


Sequential Implementations 


The sequential implementations are considered first, 
since they form the basis for some of the others. An imple- 
mentation of an algorithm is sequential if only one node 
(subfunction) is executed at a time. To obtain this type of 
implementation the graph of the algorithm must have a total 
ordering of the nodes, which can be obtained by adding pre- 
cedences. Since nodes may use more than one operation unit, 
we call t,,,(7) the computation time of the sequential imple- 
mentation that uses j operation units. The following expres- 
sions describe the performance and cost measures for the 
sequential implementations: 


Comput. Time t,.,(/)=M ¥4() 
i 
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be 
Speedup SU.) = G) 
lseq 


kes 


Efficiency E,,,U/)= 
me lseq eR Cay ) J 


Table 1 includes the performance achievable with this 
scheme for the algorithm used as example. 


Replicated Systems 


For our purposes, a replicated implementation of an 
algorithm performs several instances of the computation 
simultaneously, using identical and separate hardware 
resources (processors) for each of the simultaneous in- 
stances. Figure 5 shows an example of a replicated imple- 


mentation of an algorithm. Due to the dependencies between 


instances in the class of algorithms considered, the number of 
instances executing simultaneously should be less or equal to 
r, the number of instances in a group. However, instances 
x+1, x+2, .... 7 in a group may be computed simultaneously 
with instances 1, 2, ..., x—1 in the following group. 


Consequently, if the hardware required to process an in- 
stance (a processor) is replicated P times, the execution time 
of the M instances, for P <r, is 


bse Y) 
trep UP) = [MIP |——— 


since the instances are performed in sets of P, excepting the 
last set which might be smaller than P. In particular, if P =r 
then all instances in a group are processed at once and 


trepU if) = tlt? 


Proc. 1 Proc. P 


Figure 5 - Replicated Implementation 
of an Algorithm 


The speedup and efficiency with respect to the 
completely-sequential implementation are 


_M ; 
F 7 Los _ MIP 2 
ee UP) MPT 
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For the important case in which M>>r2P, these ex- 
pressions become 


SU,ep UP) =P SU seqV) 


Ee UP) = E seq Yj) 


Consequently, the maximum speedup is r SU,.¢(j) and 
the efficiency is the same as that of the sequential implemen- 
tation that is being replicated. Therefore, replication of the 
completely-sequential implementation is an optimal solu- 
tion for speedup less or equal to 7, since efficiency E = 1 is 
preserved. For larger speedup, it is necessary to replicate a 
sequential implementation with j units, which has lower 


efficiency. 
mabe 
VP) 


SU, ep 
(Vj .2) 


SU rep 
(7 8) 


Table 1 - Replication of Sequential Implementations 


Table 1 depicts this alternative for the algorithm in Fig- 
ure 4, for different values of j and P, up to the maximum re- 
plication P =r =8. 


Pipelined Systems 


In a pipelined implementation, an algorithm is divided 
into stages and different instances are executed simul- 
taneously at different stages [15]. To perform the partition- 
ing into stages, the algorithm should be sequential (i.e. pre- 
cedences among nodes are such that a total ordering of nodes 
exists). Since each of the nodes in the graph is viewed as in- 
divisible, two restrictions arise: 


¢ A stage is composed of one node or a set of consecutive 
nodes. This partitioning is done (adding delays if necessary) 
so that the resulting stages have an (approximately) uniform 
delay. 


¢ The maximum possible number of stages in the pipeline is 
equal to the number of nodes in the graph. 


The number of stages S is also limited by the number of 
independent instances, such that S <r. In terms of the 
number of stages S and the stage delay t,, the pipelined im- 
plementations are described by 


Comput. Time: =[S +(M-1)] ts 


‘pipe 


los 


Speedup: SUpipe = 1S +(M-1)] ts 


Assuming that the implementation that uses j operation 
units is pipelined, then the stage delay with j units is 


t 6(j)2 “atl? 
Therefore, 
MS bos MS : 
Urive US)S +(M-1) teg(i) S+(M-1) SU seq) 


In these expressions, the equality is achieved when the parti- 
tioning of the sequential implementation can be done perfect- 
ly, that is, into stages of uniform delay. 


The total number of operation units is the sum of the un- 
its used in each stage. Since we are pipelining the sequential 
implementation with j units, j units per stage are used, which 
results in an efficiency with respect to the completely - 
sequential implementation of 


ee. oe 
S +(M-1) 


Los = 
Lipe GV S ) J S 
In this case, the efficiency of the sequential implementation is 
reduced as a result of the startup time of the pipeline. 


Enipe US) S E seqV) 


Of special interest is the situation in which M>>S. In 
such case, the speedup and the efficiency tend to 


SUpipe US) <8 SU seq) 


E vive U,S)s E seq GV) 


For perfect pipelining (no delay added to the stages) these ex- 
pressions are identical to those obtained for replicated sys- 
tems. Consequently, this alternative would be preferred to re- 
plication when the latter has implementation problems (such 
has interconnection complexity), since pipelining only re- 
quires communications between stages. However, pipelining 
is limited by the number of nodes in the graph of the algo- 
rithm. 


Larger efficiency using pipelining can be obtained if all 
stages do not need the same number of units. Consider for ex- 
ample the algorithm shown in Figure 6, in a sequential imple- 
mentation which uses three units. If such algorithm is pipe- 
lined, the stages in the middle do not need three units but 
only one (since fewer operations are involved), without af- 
fecting the computation time. As a consequence, the result- 
ing efficiency is higher than what was obtained when all 
stages had the same number of units. 
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operations in a node 


Figure 6 - Saving Operation Units ina 
Pipelined Implementation 
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Table 2 - Pipelining of Sequential Implementations 


Table 2 shows what is achievable with this approach in 
the algorithm described in Figure 4. The last column in this 
table indicates the number of units which are not needed for 
each one of the implementations. 


The pipelined implementations require registers between 
stages. The addition to the cost and to the execution time that . 
these registers produce is assumed to be negligible. This 1s 
true if the cost and delay of the stages are much larger than 
those of the registers. Also, the control of the system be- 
comes somewhat more complex since each stage has to be 
controlled independently and the delays of the stages have to 
be made equal. 


Systems with Local Parallelism 


We consider that an implementation of an algorithm 
has local parallelism when it exploits the parallelism 
present in the graph to perform independent nodes con- 
currently. Consequently, the time of execution of the 
sequential implementation might be reduced, but this may re- 
quire additional operation units. 


The speedup depends on the characteristics of the graph 
and on the scheduling of the nodes. An optimal schedule has 
to be devised to obtain the maximum speedup for a given 
number of operation units. In general, the determination of 
this schedule requires an exhaustive search, so several subop- 
timal heuristics and upper and lower bounds have been 


developed [16]. However, for algorithms with few nodes the 
exhaustive search is possible and convenient. 


Since the improvements in computation time are, in the 
best case, proportional to the number of units, the speedup 
and the efficiency are bounded by 


SU pai)Si + E pai) <1 


Without performing the actual scheduling it is not possible to 
know whether the parallel implementation (with j operation 
units) is more efficient than the corresponding sequential one. 
That might be the case in computations with few instances or 
algorithms with dependences between groups of few in- 
stances. 


This scheme is described in Table 3 for the example. A 
scheduling was performed for each value of j, which resulted 
in some implementations where nodes are executed in parallel 
(marked with *). The table also indicates the implementation 
chosen for each node. 


In this section we have evaluated implementation alter- 
natives for a given algorithm, using only one concurrency 
technique. The results obtained for the example algorithm 
are presented in Figure 7, where they are also compared with 
implementations using combinations of concurrency ap- 
proaches, which are discussed in the next section. 


Implementation of Nodes 
l 2 3 4 5 


* : nodes in parallel 


Table 3 - Implementations Using Local Parallelism 


4.- Combinations of Replication, 
Local Parallelism and Pipelining. 


In an implementation it is possible to combine two or all 
three of the approaches discussed previously. The characteris- 
tics of these combinations are described now and the 
corresponding performance and cost measures are evaluated. 


Replication and pipelining 


In this case the pipelined processor is replicated. Be- 
cause of the necessity of having enough independent in- 
stances, for the class of algorithms considered here the total 
number of processors and stages combined is limited so that 
PS <r. The following two possibilities exist: 


¢ Replication of the pipelined completely-sequential imple- 
mentation. This alternative produces a speedup of PS and 
has efficiency 1 for large M and values of S that result in per- 
fect pipelining. In such a case, this is equivalent in speedup 
and efficiency to the implementation that uses only replica- 
tion. Therefore, its only advantage is that it requires fewer 
processors than the replicated implementation (which may re- 
plicate processors up to r, eventually creating realization 
problems such as a complex interconnection among the pro- 
cessors). 


¢ Replication of the pipelined implementations that use more 
than one operation unit in some stages. These implementa- 
tions increase the speedup of the single pipelined processor 
and maintain its efficiency. They are therefore suitable for 
higher speedups than that available with replication of the 
completely-sequential implementation. 


For the example discussed, this scheme corresponds to 
implementations that are replications of the alternatives in 
Table 2, with PS <8. Table 4 shows, for each possible 
number of stages, the configuration that offers the highest 
efficiency with that many stages. 


E, ép 


Ipipe 


(S,P) 


fe [Pei [pea [pan [poe 


Table 4 - Replication of Pipelined Processor with 
more than one Unit per Stage 


Replication and graph parallelism 


A processor which uses graph parallelism is replicat- 
ed in this implementation. It provides an increase of speedup 
with an efficiency equal to that of the implementation using 
only graph parallelism. Usually this efficiency will be 
significantly smaller than 1. Consequently, this scheme is 
only effective if the speedup cannot be achieved using a more 
efficient technique. 


For the example discussed, this approach corresponds to 
implementations which are replications of the alternatives in 


_ Table 3. The maximum speedup achievable in the example 
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with this scheme and the corresponding efficiency are 
SU, ep ipar (P = 8, j = 8) = 40.0 


E rep par (P = 8, j = 8) =0.63 


Pipelining and graph parallelism 


In this case, one or more of the stages of the pipeline 
uses graph parallelism (i.e. it executes more than one node 
concurrently). The partitioning into stages has to be modified 
(with respect to the implementation using pipelining only) to 
obtain stages of equal delay. 


This scheme might be effective in increasing the 
efficiency of the pipeline, since it can help to get stages of 
equal delay and to tailor the number of operation units to the 
exact requirements of the stage. The number of stages is res- 
tricted by the number and the characteristics of the nodes. 
Nodes may be rearranged (preserving the dependences, of 
course) to achieve stages of (approximately) equal delay. The 
smallest stage delay possible is determined by the node with 
the longest computation time. The maximum number of 
stages is determined by the number of nodes in the critical 
path of the graph. 


For the example discussed, Table 5 shows alternative 
implementation for different stage delays. Since the critical 
path in the graph traverses three nodes, up to three stages are 
possible. 


ts Total S bs $y ne Sy 
a 


me NOW hh NN ~)] 


Table 5 - Pipelining with Parallelism from the Graph 


All three approaches 


A possible application of this alternative is to replicate 
the processor obtained by the use of a combination of 
pipelining and graph parallelism. For the analysis of its ef- 
fectiveness the same considerations apply as those discussed 
in the section on pipelining and graph parallelism, and on re- 
plication and pipelining. 


For the example under discussion, the pipelined imple- 
mentations in Table 5 are replicated as long as PS < 8. Table 
6 shows the possible implementations. This approach pro- 
duces implementations with high speedup and high 
efficiency. The system with highest speedup has the follow- 
ing parameters: 


SU (S =3, P =2, jior = 40) = 39.22 


Table 6 - Implementations with All Techniques 


E(S =3, P =2, jig, = 40) = 0.98 


where j,; 18 the total number of units in all processors and 
stages. 


Figure 7 illustrates the largest speedup achievable for 
different number of operation units, using either one con- 
currency technique or a combination of them. From the figure 
we infer that the selection of a particular implementation 
depends heavily on the characteristics of the algorithm and 
the number of operation units available. Depending on the 
characteristics of the algorithm, combinations of the different 
concurrency techniques might result more convenient than re- 
plication or pipelining of a completely-sequential processor. 
However, such conclusion is only possible after an evaluation 
of the alternative implementations of the algorithm. 


Conclusions 


We have identified replicated and pipelined implementa- 
tions for computations that consist of multiple instances of a 
basic algorithm. We have concentrated on the case in which 
the multiple instances can be divided into groups, and there is 
a dependency between the corresponding instances of con- 
secutive groups. Moreover, the implementations considered 
use only one type of operation unit. 


We have devised a methodology to analyze such algo- 
rithms, using a dependence graph as the description tool. 
Such methodology leads to the evaluation of different imple- 
mentations for an algorithm, at any level of decomposition. In 
this methodology, at a particular level one looks at all possi- 
ble alternatives using either one or a combination of the con- 
currency techniques, searching for those cases which offer 
the highest efficiency. The analytical results obtained in this 
paper should be useful to facilitate the identification of those 
alternatives with higher efficiency. It might seem that this 
approach implies an exhaustive search for every possible im- 
plementation and is too costly in terms of the effort involved. 
However, that is not the case since the structured methodolo- 
gy allows to keep the number of nodes under evaluation at a 
given time restricted to a reasonable quantity. Therefore, it is 
possible to perform an intelligent search in the selection of 
the alternatives, which reduces the number of cases to evalu- 
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Figure 7 
Speedup for Implementations 
with Highest Efficiency 
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Our results demonstrate that the selection of the most 
efficient implementation is strongly dependent on the algo- 
rithm that is implemented. This is particularly true when the 
dependence graph of the algorithm shows widely varying de- 
grees of concurrency at different steps in the computation. 
Furthermore, we have been able to establish conditions for 
the concurrency techniques to be more effective in producing 


the highest speedup with a given number of operation units. 
Our results show that, for the class of algorithms considered, 
replication (or pipelining in some cases) alone is convenient 
up to a certain speedup value, with the limit imposed by the 
data dependences in the computation. For higher throughput, 
it is necessary to use combinations which include the local 
parallelism in the graph of the algorithm. 
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Figure 8 - Speedup and Throughput in Singular Value 
Decomposition for a 40 by 40 Matrix 
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We have applied this methodology to the analysis of al- 


-ternatives for a Singular Value Decomposition (SVD) proces- 
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sor [7]. In particular, we have used it to compare pipelined 
implementations with graph-parallelism with respect to a 
linear systolic array (i.e. replication of a sequential imple- 
mentation using graph parallelism) proposed for such compu- 
tation [5]. Figure 8 depicts the difference in throughput ob- 
tained with those implementations. The plots show that the 
linear array is convenient only for lower throughput, but 
higher speedup is achieved with better efficiency in the pipe- 
lined architecture. 
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Abstract--The HAP is a MIMD’ type _ highly 
parallel - processor with 4096 PEs that basically 
uses NNM connection. It is capable of exceedingly 
high data transfer capability and reliability. 
Data transfer capability is upgraded by 
multi-layering PE arrays and utilizing its upper 
layer to transfer data in parallel to and from the 
lower layer, and also to reduce the inter-PE data 
transfer delay. For reliability upgrading, 
automatic high-speed system reconfiguration during 
failure is realized by a new fault-tolerant 
configuration and a new parallel diagnosis method 
that uses neighboring PEs. Thus, the HAP can be 
expected to gain high system performance in 
proportion to its number of PEs (max. 16 GIPS, 1.8 
GFLOPS) and to realize the high reliability of 
nearly 1 availability and 1 year MTTF. 


1. Introduction 


Researches into parallel processors are being 
carried out on a worldwide scale [1]-[4], to meet 
the increasing needs for high performance 
computers. Especially approaches that use 10° -10' 
or more processing elements (PE) are receiving 
attention [5][6]. This is being stimulated by 
recent progress in LSI technology. In such a 
highly parallel processor, upgrading data transfer 
capability and reliability appear to hold the most 
promise for developing a practical machine, 
though, development of a parallel processing 
algorithm for each application is presupposed. 

The system performance of a parallel processor 
is evaluated on the total time from initial data 
supply to processed data output. It is 
progressively limited by the data transfer time to 
and from all the PEs, rather than their processing 
time, as the number of PEs increases and therewith 
the overall processing power increases. 
Accordingly, the most important problem concerns 
data transfer to and from all PEs. The next 
important problem is the data transfer delay 
between PEs. This is because the maximum 
internode distance (1 + the number of PEs used as 
a relay) increases with the number of PEs in any 
network-type parallel processor [7], and the data 
transfer delay increases accordingly. 

In considering reliability, imagine a _ system 
consisting of 10‘PEs whose failure rate is 10‘Fit. 
This means only a 10 hour mean time between 
failures (MTBF). In this case, . one repair is 
needed on an average of every 10 hours, and if it 
is time consuming, the availability is lowered. 
Thus automatic failure recovery and shortening 
recovery time become important. 

Conventional researches rarely touch on_ such 
problems. To cope with them, we researched the 
architecture of a system with the nearest neighbor 
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mesh (NNM) connection selected to be suitable to a 
highly parallel structure. From those studies, we 
have developed a highly parallel processor, the 
hierarchical array processor (HAP). A small scale 
version is under trial at this time. 

The HAP is a MIMD type processor with 4096 PEs 
designed for scientific calculation and speech, 
picture and other recognition processing 
applications. The previously mentioned problems 
are handled with a hierarchical PE array structure 
that utilizes its upper layer for data transfer, 
together with a new fault-tolerant configuration 
and parallel diagnosis of PEs. 


2. System Architecture 


2.1 System Configuration and Data-Transfer 

In order to cope with the data transfer 
problems, we adopted a hierarchical PE array 
structure similar to the EGPA [8], namely a large 
scale array with a small scale array above it. 
The number of PEs in the smaller’ § array is 
approximately the square root of the large one's. 
By making use of the small array to accomplish 
data transfer to and from the large PE array and 
between PEs in the array, realization of a data 
transfer rate that matches the large array's 
processing power and reduction of the inter-PE 
data transfer delay are attempted. Hence, even if 
there is an increase in the number of PEs, a high 
System performance that is not limited by data 
transfer capability can be realized. 


2.1.1 System Configuration 

Figure 1 shows the system configuration of the 
HAP. In the HAP, multiple users are assumed in 
order to make full use of the processing power of 
all the PEs. Namely, it is properly used as the 
back-end processor for a number of user computers. 


1) Configurations and Roles of Each Block 

The PE array consists of a maximum of 4096 
(64x64) PEs, and executes parallel tasks. The 
control PE array (cPE array) has a maximum of 64 
(8x8) control PEs (cPE).. Together with the data 
I/O mechanism, it performs the input-output of 
data to the PE array. It also relays the inter-PE 
data transfers. Besides these, hierarchical 
parallel tasks can also be executed. The PEs and 
the control PEs are basically microcomputers. A 
system management processor (SMP), using a general 
purpose computer, controls the whole system. The 
data I/0 mechanism performs the input-output of 
data to user computers and cPEs, and also data 
buffering. 


2) Physical Connection among PEs 
PEs inside the PE array are both physically NNM 


and torus-connected, in consideration of the total 
data transfer capability / hardware quantities for 
connection between PEs, and the easy expansion to 
a physical structure. Therefore, each PE is 
connected with its four nearest neighbor (north, 
south, east and west) PEs. Although 
torus-connection slightly increases the inter-PE 
wiring length, it has the merit of reducing the 
maximum internode distance to half, compared to 
that of only an NNM, and is therefore applied. 
The cPE array also has the same connection scheme 
as the PE array. In inter-layer connection, such 
as the connection between PEs and cPEs, called 
lower PEs and upper PEs, respectively, a bus is 
used to reduce the amount of connecting hardware. 


3) Logical Coupling between PEs 


Logical coupling via memory is used between 
PEs, cPES and both, in order to simplify program 
description and debugging (Fig. 2). Consequently, 
each PE can, as part of its own memory, directly 
access the memory contents of its neighboring PEs. 
Besides these, a cPE can also directly access the 
memory of the PEs to which it is connected. In 
this circumstance, when a particular PE is 
specified, only that memory can be accessed, but 
when not specified, the memories of all PEs can be 
accessed. 


4) Bypass of the cPE Array 


The cPE can be bypassed during the debugging of 
a PE program or during processing that does not 
require it. During such circumstances, the SMP 
and the PE array are directly logically-coupled. 


2.1.2 Data Transfer 
In the operation of a parallel processor, the 
program and the required initial data are supplied 


to the PEs in the first phase. Then, tasks 
expanded in parallel are executed in the PEs with 
necessary data transfer between the PEs. Finally, 
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Fig. 1 Configuration of HAP System 
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the processed data are collected by the user 
computer. That is, there are usually four types 
of data transfer in a parallel processor, namely, 
program load, initial data supply, inter-PE data 
transfer and result collecting. Since system 
performance is evaluated by the total processing 
time, which includes these data transfers, 
improvement in its transfer capabilities increases 
the value of a parallel processor. In the HAP, 
rapid data transfer is realized through the 
following approach. 


1) Program Load 

Load distribution, instead of function 
distribution, is used to improve the performance 
in highly parallel processors. The programs for 


all PEs are basically identical, so they are 
broadcast to all cPEs and PEs by the _ SMP. 
Moreover, independent program load to each PE is 


when 
are different for part of the PEs. 


also possible, in consideration of occasions 
the programs 


These are realized through memory accesses (write 
operations) to the lower PEs or PE by the upper 
PE. 


2) Initial Data Supply and Result Collecting 


Since the data to be processed by each PE is 
different, it is transferred from the SMP in 
serial fashion. This is a potential bottleneck in 
system performance. Therefore, in the HAP, data 
elements are supplied in parallel through the 
cPEs. Specifically the initial data buffered in 
the data  I/0 mechanism is transferred 
simultaneously to all the cPEs, then each cPE 
transfers it to the PEs that are coupled to it. 
Result collecting is basically similar, except the 
direction of data flow is reversed. These data 
transfers are DMA transfers by cPEs. 


3) Inter-PE Data Transfer 


A parallel processor with NNM connection is 
very suitable for processing problems where 
inter-PE data transfer occurs locally or regularly 
(e.g. solving Poisson's equation by the ODD-EVEN 
SOR method, or a Fast Fourier Transform (FFT) ) 


Table 1 Inter-PE Data Transfer Method 


Maximum Maximum 
Method Summary Internode Transfer 
Distance Parallelization 


Same directional transfer 
a) Routing in PE Array repeated by 2 JN-2 N 
all PE’s 
b) Array Relaying Transfer relaying other 
PE’s in PE Array VN N 
c) SMP Relaying 
zt 1 


d) Hierarchical | 
Relaying +2 Nec 


: Number of PEs 
Number of cPEs 

(4][9]. However, it is not suitable for problems 
where transfers occur irregularly between PEs with 
large internode distances, e.g., logic simulation. 
This is because the maximum internode distance is 
still as large as YN (N:the number of PEs) in this 
kind of parallel processor even if 
torus-connection is employed together with the 
NNM; therefore, much time is needed for these data 
transfers in such a problem. 

To make it more suitable for the latter type of 


Transfer relaying the SMP 
(cPEs are bypassed) 


Transfer from PE to cPE, 
relaying in cPE array, and 
then from cPE to PE 


N 
Ne : 


problem, a new data transfer mode utilizing the 
cPE (Table 1d) in addition to the usual mode 
(Tables ia-ic) is provided to reduce the maximum 


internode distance and the inter-PE data transfer 
delay in the HAP. This mode is used for data 
transfer between PEs where the internode distance 


is greater than YN/Nc (Nc:the number of cPEs). 
Hence, for the HAP, wherein N=4096 and Nc=64 
(Fig. 1), the maximum internode distance is 10. 


This is shorter than that (equal to 12) of a hyper 


cube type [7] parallel processor with an equal 
number of PEs. Moreover, this mode can be’ used 
together with b), thus helping to improve the 
transfer rate. In the HAP, any of the _ several 


data transfer modes shown in Table 1 can be used, 
depending on the type of problem. These data 


transfers are realized through the repetition of 
memory access to the neighboring PE, memory 
accesses to the lower PE by. the upper PE or both 
of them. 


2.2 Fault-Tolerant Configuration 
1) Redundancy Configuration and its Inter-PE 


Connection Network 


There have been a number of proposals [10][11] 
for establishing redundancy in an  NNM 
configuration. However, we use a new redundancy 
configuration that features simple inter-PE 
connection and easy switching control. Figure 3 
shows its configuration schematically. One row 


and one column of spare PEs are provided for an 


nxn PE array. If the number of faulty PEs in 
the column does not exceed one, the column is 
considered good, thus n good PEs, out of ntl PEs, 
are available. If there are more than i faulty 
PEs, the column is considered faulty, thus n good 
columns, out of n+i1 columns, are available, 
forming an n x n good PE array. 

With the fabrication of a switching circuit 


inside each PE, the switching of a faulty PE in 
this redundancy configuration is realized through 
the inter-PE connection network (called an IX 
network due to its shape) shown in Fig. 4 a). In 
the IX net, each PE is physically connected with 6 
neighboring PEs. The switching of a faulty PE or 
a faulty column is done through bypassing them, as 
shown in Figs. 4 b) and c), respectively. That 
is, four connections are used out of a PE's six 
possible connections, and the IX net becomes 
functionally an NNM. 


2) Effect of Application of Redundancy 
Configuration 
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The result of the application is shown in 
Fig. 5. Here [Degree of MTTF improvement G] = 
[MTTF of the redundantly configured system] / 
[MITF of the nonredundantly configured system], 
where MTTF (Mean Time to Failure) is defined as 
the elapsed time from when all PEs are placed in 
good condition to when all spare PEs are  used-up. 
Figure 5 shows that this redundancy configuration 
has a better effect than simply duplicating each 
PE, in spite of the smaller amount of redundancy, 
causing G to exceed JN (N:the number of PEs), in 
the area where n is relatively small (<10). In 
the HAP, this configuration is used on each group 
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Fig. 5 MTITF improvement by Two-Level 
n-out-of-n+1 Fault-Tolerant Array 


of PEs that are connected to a cPE. Hence, if the 
PE failure rate is 10°Fit, the MTTF of the HAP, 
excluding the SMP and the data I/O mechanism, 
would be about 1 year. 


3. Processing Element (PE) 


In the realization of a highly parallel 
processor with 4096 PEs, there is the necessity to 
miniaturize the PE. Though it would be ideal to 
fabricate the whole PE on a_ single LSI chip, 
market-available microprocessors, RAMs and_ gate 
arrays are used for the following reasons: 

i) It is difficult to fabricate the PE for a 
MIMD machine, in which large memory is 
essential, on one LSI chip, even with the 
present advanced LSI technology. 

ii) Fabrication of a memory-less PE on one LSI 
chip does not have much effect on 
miniaturization. 

iii) Existing compilers and other software can 
be utilized without having to develop’ them 
when. market-available microprocessors are 
used. . 

Moreover, PE and cPE have the same configuration 
thus reducing the number of LSI types that need to 
be developed for the HAP. 


3.1 Configuration and Function 
Figure 6 shows the configuration and table 2 
shows the specification of the PE. 


1) Central Processing Unit (CPU) 
Any of Intel's 80186, 80286, or 80386 


processors can be used as the CPU. It executes 
fixed point arithmetic and controls the whole PE. 


2) Arithmetic Processing Unit (APU) 


This is the co-processor for floating point 
arithmetic. Any of Intel's 8087, 80287 or 80387 
processors can be used as the APU in a combination 
with the CPU shown in Table 2. 


3) Memory (MEM) 


Besides the storage of PE programs and data, 
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the MEM is used for various types of data 
transfer. From the point of view of upgrading PE 
performance, the MEM should be configured as two 


independent memory banks to avoid contention 
between CPU and APU memory accesses and data 
transfer memory accesses. However, in the PE of a 
MIMD machine, there are fewer of the latter type 
than the former (e.g. less than tenths in the 
ODD-EVEN SOR method). Accordingly, degradation of 
PE performance due to contention can be ignored 
and the MEM is configured as one bank. This makes 
miniaturization of the PE and expansion of the 
memory area used for data transfer possible. 


4) Memory Control Unit (MCU) 


The MCU arbitrates the contention between 
memory accesses of the CPU and APU, and memory 
accesses due to data transfers, and it transmits 
data transfer requests and the several control 
requests to the CCU. In addition, it provides the 
interface between the microprocessors and the 
memory, namely the CPU and the APU and the MEM. 

The MCU is configured so that it can be used by 
various types of microprocessors such as the CPU 
and the APU shown in Table 2. It is realized on a 
l-chip gate array packaged in a pin grid array 
(PGA) case with 176 pins. 
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Fig. 6 Configuration of Processing Element (PE) 


Table 2 PE Specification 
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Data Transfer 


Communication Control Unit (CCU) 


The CCU performs the data transfers between 


PEs, cCPES and the SMP and provides the control 
that is needed for them. Packeted imformation 
(Fig. 7) is asynchronously, time-divided and 
bidirectionally transfered. Every interface, 


namely, north, south, east, west, upper and lower, 
has four data lines with bus structure. Moreover, 
all the interfaces between the PEs 
bus-arbiter function so that any PE, cPE or SMP 
can operate as a master or as ae Slave. 
Additionally the CCU also performs the control 
concerning the synchronization of PEs and the 
start, stop or other relevant control (when used 
as a cPE). These controls will be discussed in 
the next section. The CCU is realized by the same 
type of gate array as the MCU. 


6) Bus Interface (B.INT) 


It is used for interfacing general busses such 
as the multi-bus, the VME bus and others. The cPE 
is connected to the data I/O mechanism through it. 


3.2 Control Method 
1) Fundamental Control Scheme 

It is necessary to have controls for starting, 
stoping, data transfer and synchronization of PEs 
in a parallel processor. These controls are 
initiated by one instruction to ease program 
development, and are realized in hardware to 
provide high speed execution in the HAP. Figure 8 
shows the fundamental control scheme where all the 


control requests of the controlling PE (including 
the SMP and the cPE) are done in the form of a 
label access to variables. These become 


pseudo-memory accesses during processor operation. 
Based on this access imformation, the control 
circuits recognize the control request and 
generate the necessary control information. 

The control information is transfered to the 
controlled PE (or PEs) through a physical link 
such as data lines or control lines. The 
controled PE uses this control information to 
realize the various control operations through the 
generation of interrupt vectors, interrupt 
operations and execution of the required 
interruption handling routines. 

However, controls that demand rapid execution, 
such as the synchronization of PEs (in this case, 
the controlling PE is also the controlled PE) are 
realized by controlling the "Ready" signal of the 
CPU. Specifically, when synchronization requests 
are generated by every PE, the CPU is in the wait 
condition for the "Ready" state. Once the 


possess 4a. 
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concurrence of synchronization is detected, the 
"Ready" signal is generated and the CPU is shifted 
to the execution of the next instruction. With 
this control, the synchronization of all the PEs 


is realized in less than one microsecond in the 
HAP. 


2) Flexible Synchronization Mechanism 

A flexible synchronization mechanism utilizing 
the hierarchical structure is realized in the HAP 
(Fig. 9). It is the sync-mask-register specifying 
the return of the synchronous signal to its own 
layer or its propagation to the upper layer. The 
propagation of synchronous signals is controlled 
with this register, and the synchronization of all 
PEs including the cPEs and the SMP, local 
synchronization of PEs that are connected to the 
cPE, and other functions are possible. This 
mechanism is expected to be more useful in some 
recognition processings (especially pattern 
matching in their processings), such as character 
recognition, speech recognition and others where 
the hierarchical parallel algorithms are often 
used. Furthermore, the synchronization between 
neighboring PEs (or cPEs) can easily be realized 


through the setting and referring of shared 
varibles such as flags, semaphores and others, 
since the couplings are via memory, as shown in 


Fig. 2. 
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Fig. 9 Flexible Synchronization Mechanism 


4, Performance Estimation 


The basic goal in developing the HAP is to 
realize a system having its performance 
proportional to the number of its PEs. System 


Per tonpences P,, is given by the CXPRESSLODs 

P,, = PoNa, 
ahane Po is the processing capability of a single 
PE, N is the number of PEs and wis the operating 
efficiency of one PE. @ is defined as the ratio 
of the performance per PE with N PEs operating in 
parallel to Po. aw is dependent on the 
parallelization algorithm for the problem, with 

= 1 in the ideal case. The assumptions are that 
the total processing quantities do not vary with 
the parallel expansion, the overhead of the 
parallel processor, i.e., data transfer and 
synchronization, can be ignored, and the loads are 
equally divided among all PEs. Under such 
conditions, the peak system performance of the 
HAP, using an 80386 and an 80387 as the CPU and 
the fay respectively, is 

= 4 MIPS x 4096 x 1~16GIPS 
ee fixed point arithemtic or weneral data 
processing) 
= 0.45 MFLOPS x 4096 x 1=1.8 GFLOPS 
(for floating point arithmetic). 

However, @ is usually not equal to 1. In the 
HAP, the factors that lower & from the hardware 
point of view are sufficiently coped with by the 
improvement in data transfer capability mentioned 
in 2.1.2 and by the speed-up in synchronization 
mentioned in 3.2. An example of this effect is 
shown in Fig. 10. This is for the case of solving 
first order simultaneous equations having M 
unknowns using Gauss' elimination method. The 
main factor in the lowering of w is the method of 
supplying initial data (i.e. coefficients of the 
equations) to the PEs. In the HAP, this is 
performed in parallel by the cPEs. This improves 
@ up to the solid line shown in the same figure. 
Hence, the HAP can be expected to have a higher 
system performance than the usual NNM model ona 


number of problems, including the aforementioned 
ones. 
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5. System Auto-Recofiguration 


Automatic, speedy system reconfiguration is 
realized in the HAP to improve its availability 
and extend its maintenance period. That is, in 
the HAP, a faulty PE is promptly detected, and an 


NNM connected array excluding it is automatically 
reconfigured. This is realized through the use of 
the newly proposed parallel and 


diagnosis 
Switching control methods. 


1) Parallel Diagnosis of PEs 


This method employs a 
technique using the diagnosis 
from neighboring PEs to judge 
(called NV diagnosis). Figure 11 shows’ the 
concept of NV diagnosis. This diagnosis is 
carried out on all the PEs simultaneously using 
the following procedure: 

i) The diagnosed PE runs the test program inside 
itself. 

ii) The test result is sent to the 4 neighboring 
PEs. A comparison between the actual and the 
expected result is done in both the 
neighboring PEs and itself. 

iii) The neighboring PEs_ send 
results back to the diagnosed PE. 

iv) Using the results for a decision by 
majority, the diagnosed PE judges its own 
condition. 

A faulty condition of the diagnosed PE can be 
known through comparison between the actual and 
the expected result in the neighboring PEs_ since 
the test result will differ from the one expected. 
Moreover, even if one or two neighboring PEs are 
faulty, so long as 3 out of 5 (4 neighboring PEs 
and the diagnosed PE) are in good condition, the 
faulty condition of the diagnosed PE can be 
detected through these 3. Consequently, assuming 
that no more than 2 out of every 5 PEs are faulty, 
autonomous, parallel eeaenoetss is possible within 
the PE array. 
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2) Switching Control Method 


In the IX net in 2.2, a switching control to 


exclude the faulty PE and _ column is necessary 
(Figs. 4b and c). This control is done as 
follows: 


i) The control signals, Fr and Fc for indicating 
the existence of a faulty column anda faulty 


PE, respectively, are transmitted in the 
manner shown in Fig. 12 a). Figure 12 b) 
shows the generation circuit for these 
signals. 


ii) The interface direction and bypass of the PE 


are determined and_ then the switching is 
performed in each PE using these signals 
together with the condition of the PE itself 


(Fpe) and the condition of its own column 
(Fclm). 
Once a faulty PE is detected, the switching is 
automatically done within the PE array. 
faulty column 
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Fig. 12 Propagation and Generation of Switching 
Control Signals 
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3) Auto-Reconfiguration Procedures 


The system reconfiguration is done at the time 
of power on, or during system reset when a PE 
fails. Initially, all PEs are considered as good 
and diagnosis of all PEs, except spare ones, is 
carried out. As a result, if faulty PEs are 
detected, their flags are set (Fig. 11). Then an 
NNM connected array, excluding the faulty PEs, is 
automatically reconfigured by the method in 2). 


This procedure is executed three times at most 
before system reconfiguration is completed 
(Fig. 13). In the HAP, the maintenance period can 


be extended to the MTTF defined in 2.2 with an 
availability of nearly 1 by means of this’ system 
auto-reconfiguration. 


6. Programming for HAP 


A lot of research [12]-[14] has been done to 
develop a parallel processing oriented language 
that makes the abstraction of parallelism in the 
problem easy. However, the present automatic 
abstraction of parallelism tends to be at low 
levels, such as_ the instruction level or the DO 
loop level or others. We believe that to achieve 
an effective improvement in performance, which is 
the primary goal of a parallel processor, 
attention must be paid to the higher level of 
parallelism in the MIMD machine, but that’ the 
abstraction of these parallelisms still requires 
the involvement of man. Consequently, in the HAP, 
the programmer must be conscious of the parallel 
structure corresponding to the level of the 
program written. 


1) Software Structure 


In order to allow performance 
ease-of-use, and also in consideration of the 
utilization of the HAP as a back-end processor, 
the software shown in Fig. 14 is provided. To the 
HAP user, only writing the program for his 
computer by utilizing the library program, not 
being conscious of the HAP hardware, is required. 
That is, the user's job is expanded to parallel 
tasks and subrouted at the library level. | 

Specifically, a parallel processing algorithm 
suitable to the job is developed at this level, 
and then the programs for executing the algorithm, 
divided into a PE program and an SMP program, are 
written. The main function of the PE program is 
the parallel alogrithm, and the SMP program is for 
program load, control of PEs and other functions. 
When assigning general processing to the cPE, 
aside from data transfer and synchronization, the 
writing of a cPE program is necessary too. 
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Fig. 14 Software Structure 


SMP Program PE Program 


Procedure SMP; 


begin 
: (Program Load) 
s_prog_load(’ PE’); Procedure PE; 
(PE Start) : 


s_PEs tart; begin 
: (Initialization) 
s_sync (PE_init) 3<——_______- s_sync (end_init ; 
for N=] to PE_numer do (Initial Data Supply) (Wait) 
s_data_load(’ PE_file(N)’); —-——> 
(Synchronization by SMP) 


(Synchronization by PEs) 


(Process) 
s_sync (sync_mode) ; 
; (Inter-PE Synchronization) 
(Synchronization by PEs) (Process) 
<< —————_ s_sync(end) ; 
(Result Collection) 
Cae 


s_PErestart; 


s_sync (PE_end) ; 
for N=] to PE.number do 

s_data_get(’ PE_file(N)’); 

: end. 

end. 


Fig. 15 Example of Program Description (PASCAL) 


However, these programmings are easy (Fig. 15) 
since these can use the subrouted control 
programs, and_ the fundamental controls are 
described by single instructions, as mentioned in 
3.2. The control programs are written as to the 
various controls mentioned in 3.2. 


2) Program Development Environment 
In consideration of library program development 
using a general computer (not parallel), procedure 


calls and function calls are used instead of macro 
instructions in order to reduce dependency on the 
machine and language used. 

For programming, high level languages such as 
PASCAL, C, Ada, FORTRAN and others can be used 
without any modification. 


7. Conclusion 


The configuration of a MIMD type 
parallel processor, the HAP, with 4096 PEs is 
discussed. The HAP appears capable of ensuring 
practical reliability and data transfer capability 
that matches the processing capability of its PEs, 
which are considered the main problems in high 
parallelization. 

The problems of data transfer are solved by 
adopting a hierarchical PE array structure, 
namely, a large scale array with a small scale 
array above it, and utilizing the latter to 
transfer data. Specifically, the small PE array 
is used to parallel the data transfer to and from 
the large PE array and to reduce the inter-PE data 
transfer delay by relaying it. With these 
approaches, the HAP is expected to attain a system 
performance near the peak value of 16 GIPS, 1.8 
GFLOPS (in the case of using 80386 and 80387 
processors) over a broad range of applications. 

Reliability problems are solved with the 
application of a fault-tolerant configuration 
where one row and one column of spare. PES are 
provided to each n x 07 PE array and automatic 
switching is done in PE and column units. Also, 
it applies a newly proposed parallel diagnosis 
method, called NV diagnosis, that uses neighboring 


highly 


[1] Charles L. Seitz "The Cosmic Cube" 


PEs. Fast, automatic system reconfiguration is | 
realized through these features, and thereby the 
maintenance period (equal to the MTTF) is about 1 
year (with a PE failure rate of 10° Fit) with an 
availability of approximately 1. 

A small scale version with 256 PEs and 16 cPEs 
is under fabrication. Each PE consists of 13 
LSIs, namely 80186 and 8087 type processors, 9 
DRAMs (256kbits/chip) and 2 gate arrays (PE size 
9cm x 6cm x 3cm). Testing of its capability for 
general scientific calculation and various types 
of recognition together with an overall evaluation 


is scheduled. 
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Abstract 


In this paper, Mesh-Connected Computer (MCC) algorithms 
for computing several properties of a set of, possibly intersecting 
rectangles are presented. Given a set of n iso-oriented rectangles, 
we describe MCC algorithms for determining following proper- 
ties: (i) The area of the logic "OR" of these rectangles (i.e. the area 
of the region covered by at least one rectangle). (ii) The area of 
the logic "AND" of the rectangles (i.e. the area of the region 
covered by two or more rectangles). (iii) The largest number of 
rectangles that overlap. This solves the fixed-size rectangle place- 
ment problem, i.e. given a set of plan points and a rectangle, find a 
placement of the rectangle in the plane so that the number of 
points covered by the rectangle is maximal. (iv) The minimum 
separation between any pair of a set of non-overlapping rectangles. 
All these algorithms can be implemented on a VnxVn MCC in 
O(Nn ) time. The best known algorithms for the above problems 
are sequential and have optimal O (n log ) time complexity. 


I. Introduction 


A two-dimensional Mesh-Connected Computer (MCC) con- 
sists of a number of identical processors arranged in a two- 
dimensional array with interconnections between every pair of 
horizontally and vertically adjacent processors. Each processor 
has a fixed number of registers and is capable of performing arith- 
metic and boolean operations. The MCC operates as a single- 
instruction stream, multiple-data stream (SIMD) computer in 
which several processors execute the same instruction in parallel 
on different data items. The simplicity of the inter-processor com- 
munication pattern and the economical layout afforded by an 
MCC have resulted in actual implementations in recent years 
[1,2,3]. 


Several MCC algorithms for problems in diverse computa- 
tional areas have been discovered. Thompson and Kung [4] and 
Nassimi and Sahni [5] presented fundamental MCC algorithms for 
sorting, which form the basis of data routing techniques [6]. Algo- 
rithms for solving matrix computations [7], solving graph- 
theoretic problems [8,9,10,11,12,13,14] and more recently for 
problems in computational geometry [15,16] have also been 
presented. 


In this paper we present MCC algorithms for computing 
several interesting properties of a set of n, possibly overlapping, 
iso-oriented rectangles. (An iso-oriented rectangle is one whose 
sides are parallel to the coordinate axes). These problems belong 
to the class of rectangle intersection problems in computational 
geometry that has been widely studied for its applications in VLSI 
design [17], computer graphics and architectural data bases [18]. 
The specific problems that we cover are as follows. 


Given a set of n iso-oriented rectangles determine: 
1. The area of the logic "OR" of these rectangles, that is the area of 
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the region that is covered by at least one rectangle. 


_2. The area of the logic "AND" of these rectangles, that is the area 


of the region that is covered by two or more rectangles. 


_ 3. The maximal number of rectangles that overlap. This also 


solves the fixed rectangle placement problem, which asks one to 
determine the placement of a given rectangle in the plane so that it 
includes the largest number of a set of given planar points. 

4. If the rectangles are non-intersecting, the distance between the 
closest pair of rectangles. 


The algorithms presented in this paper employ a divide and 
conquer technique, based on the "separational principle" for planar 
eis proposed by Giiting [19], and require O(vn ) time on a 

nxn MCC. Upto constant factors, our algorithms are optimal 
on the "standard" MCC model being used, and compare favorably 
with the known sequential algorithms which have O (nlogn ) time 
complexity. 


In sections II through V of the paper we present our algo- 
rithms for the problems mentioned earlier. In the discussion we 
assume familiarity with techniques for sorting a set of records dis- 
tributed evenly among the processors (see [4]) and for performing 
parallel random access reads (RAR) or writes (RAW) on an MCC 
(see [6]). 

Il. “OR” Area Reporting 


A rectangle with its sides parallel to the coordinate axes is 
called an iso-oriented rectangle. Given a set of n iso-oriented rec- 
tangles, the "OR" area reporting problem is to find the area of the 
region that is covered by one or more rectangles. 


Consider the 2” vertical segments that make up the n rec- 
tangles. The set of horizontal lines passing through the points at 
the bottom and the top of those segments partitions the plane into 
horizontal strips. Each strip is bounded in the vertical dimension 
by two horizontal rectangle edges (not necessarily belonging to 
the same rectangle), and each horizontal edge of a rectangle coin- 
cides with a strip’s boundary. The area of the region covered by 
the logic "OR" of the set of rectangles is the sum of the covered 
area in each strip. The y-interval covered by each strip is simply 
the difference in the y-coordinates of the horizontal lines bounding 
the strip; so what we are interested in is only the x-interval 
covered by each strip. We solve the problem efficiently by using a 
divide-and-conquer approach as described below. 


Sort all the vertical segments by their x-coordinates in non- 
decreasing order. Divide the plane into two slabs by a vertical 
line, L, that partitions the set of vertical segments into two equal- 
sized subsets so that every segment to the left (respectively right) 
of the dividing line L, has an x-coordinate less than (respectively 
greater than or equal to) the x-coordinate of L. 


Iteratively partition each slab obtained into a left and right 
slab in a similar manner, until each slab finally contains one verti- 
cal segment. Each slab (except for the two at the end) is bounded 
by two dividing lines, which we refer to as the Jeft boundary and 
right boundary of the slab. Assume that the left boundary of the 
leftmost slab to be a line passing through the left most segment 


and the right boundary of the rightmost slab to be a line passing 
through the rightmost segment. Figure 1 illustrates an example of 


the partitioning for eight vertical segments. 


SLABS: 0 1 2 3 


4 5 6 7 


(b) Partitioning of Vertical 
Segments into Slabs. 


Figure 1 


The algorithm proceeds to merge together adjacent slabs in a 
binary tree fashion. All merges at a level of the computation tree 
are done in parallel, and the computation is completed after logon 
iterations. Every merge step merges together slabs that lie to the 
left and right respectively of a common dividing line. Let us 
denote the left and right slabs being merged as L and R respec- 
tively. Let ly, by, .... Ly GSlix1, be the y-coordinates of the seg- 
ments inZ and 7, Pro, ..., Wp 1; S741, the y-coordinates of the seg- 
ments inR. 


The horizontal lines through /;, =1, ..., k, partition the slab L 
into k horizontal strips, where the i-th strip of L, 1<iS<k-1, 
denoted by </;,/;,;> consists of the region between J; and /;,; and 
the k-th string (dummy) has width zero. In a similar fashion, the 
horizontal lines through 7;, i=1, ..., k, partition R into k horizontal 
strips. 


Immediately prior to the step which merges L and R, the 
algorithm would have computed for each strip in L (and R), the 
portion of the total area that is contained in that strip. The area 
contributed by strip <J;,J;,;> in L is maintained implicitly by the 
pair 1;.width, 1;.length, where l;.width for L is (lj; - J;) and 
l;,.length for the strip has been computed just prior to the current 
merge step. Similar definitions hold for a strip <7;,rj,;>inR. 


xX 
divider 


min 


Note: Shaded areas represent the area of the strip 
that is part of the total area. 


1,-length =b r,-length =a 
1,-length = b+td r,-length =¢ 
1,-length =d r,-length =c¢ 
1,-length = 0 r,-length = 0 


Fig. 2(a) Prior to merge of L and R 
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Fig. 2(b) Following the merge of L and R 
into S. 


Figure 2 illustrates a possible situation where k=4. The slabs 
L and R are merged to form a slab S, bounded on the left by the 
left boundary of L with x-coordinate x,,;, and on the right by the 
right boundary of R with x-coordinate Xngy. 


Slab S contains all the line segments in L and R. Let s;, 
i=1, ...,2k, 5; S Sj,;, be the ordered list of y-coordinates of seg- 
ments in §. Obviously, sj, ..., 52, is obtained by merging together 
the ordered lists of points J), ..., J, and ry, ..., 7, corresponding to 
segments in L and R respectively. That is, the strips are refined in 
the merge step. To simplify the presentation of the merging algo- 
rithm, we present some definitions. The strip <s;,5;,;> of S, 
which lies between the horizontal lines passing through s; and 
Sj, is called a _ right_originating strip (respectively 
left_originating strip) if s; = r; (respectively s; = 1,), for some 
j, iSjsk. If u and v are vertical segments of the same rectangle, 
then u (v) is a partner of v (wu). If the x-coordinate of u is less 
(respectively greater) than that of its partner v, then u is a left 
(respectively right) segment of the rectangle. A segment in S is 
said to be a left_open segment if it is the right segment of a rectan- 
gle, lies in R and does not have a left partner in S. A segment in 
S is said to be a right_open segment if it is the left segment of a 
rectangle, lies in L and does not have a right partner in S$. A strip 
<S;,Si41> iS a left_open (respectively right_open) strip if it is 
crossed by a left_open (respectively right_open) segment. 

At the end of the step that merge L and R, we should deter- 
mine s;.length, the length of the strip <s;,5;,;> that is to be 
included in the overall area. This is computed by determining 
(s;.length), and (s;.length)p, which are the portions of s;.length 
that lie to the left and right of the dividing line respectively, and 
then taking their sum. 


The algorithm which computes s;.length for each s;, i=/, ..., 
2k is presented in Algorithm 1. 


Algorithm I: 
for each strip <s;,5;,;>, i=1, ..., 2k-1, do 
if <5;,5;,1> is a left_originating strip, where s=I; 
then 
begin 
if <s;,5;,;> is a left_open strip, then 
/* the entire portion of the strip in L must be included */ 
(s;.length), = Xgivider - Xmin 
else /* no change in the portion covered in L */ 


(s;.length), = 1;.length 
if <s;,5;,1> 18 a right open strip, then 
/* the entire portion of the strip in R must be included */ 
(s;.length)p = Xmax - Xdivider 
else /* no change in the portion covered in R */ 
/* Determine the strip <7,,7,,1> in R that previously 
covered the region now covered by <s;,5;,1> */ 
/* r, is the point in’), ..., 7, such that 
lp <Si< Tp */ 
(s;.length)p = rp.length 


(**) 


end 
else /* <s;,5;,1> iS a right_originating strip, where 
sp=7;*/ 
begin 
if <s;,5;,;> 18 a right_open strip then 
(sj.length)p = Xmax - Xdivider 
else 
(s;.length)p = r;.length 
if <s;,5;,1> 18 a left_open strip then 
(sj. length), = Xdivider ~ Xmin 
else 
/* let L, be the point in |, ..., J, 
such that L, <Sj< Lot */ 
(s;.length), = l,.length 
end 
(s;.length) = (s;.lengh), + (s;.length)p 
end 


We now describe the MCC algorithm for the OR-Area 
reporting problem. The PEs are assumed to be indexed in shuffled 
row-major order [4]. 


MCC Algorithm 

/* Initialization */ 

1. Sort the vertical segments by their x-coordinates into non- 
decreasing order on the mesh in shuffled row-major order. 


Each PE contains one segment. 


2.  /* Determine left and right boundaries of current merge 
regions */ 
PE(i) executes: 
ifi iseven x,,,, = x-coordinate of segment in PE(i+1) 
Xmin = x /* x-coordinate of segment in PE(1) */ 
ifi isodd <x,,j, = x-coordinate of segment in PE(i-1) 
Xmax =x /* x-coordinate of segment in PE(i) */ 


3.  /* initialize length covered by each segment */ 
/* PE(i) maintains the strips corresponding to the top y, and 
bottom y, points of the segment it contains */ 
y;,.length =0 /* dummy strip */ 
if left segment then y,.length = Xmg, - xX 
else y,.length =X - Xmin 


We now present details of the MCC implementation of the 
merge step corresponding to Algorithm 1. 


Immediately prior to the merge step, points corresponding to 
segment in L and R are available in separate adjacent submeshes, 
two points per PE, sorted by y-coordinates. 

Each point has with it the following information: 


(i) the length of strip covered by it, in variable length; 

(ii) a flag bottom/top indicating if it is the top or bottom point 
of a vertical segment; 

(iii) a flag left/right indicating if it is the left or right segment of 
a rectangle; 

(iv) a flag full indicating if the partner of the edge has been 


found or not; 
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(v) the co-ordinates of the point; 
(vi) the id. of the rectangle to which the point belongs. 
Each point also has the following global information: 
(i) Xdivider> *min and Xmax 
(ii) the variable local_rank which is the position of the point in 


the sorted (by y-coordinates) list of points in its half. 


Steps executed by a PE: 


1. Merge the sorted lists of points in the two submeshes into a 
single sorted list, (using the y-coordinates as the key,) into 
non-decreasing row-major order. Ties are broken by the id. 
of the rectangle to which the point belongs. This ensures 
that points corresponding to segments that are partners will 
be adjacent in the sorted list. Let global_rank denote the 
position of a point in the merged list. 


2. Each PE checks if the point adjacent to a point within it in 
the merged list has the same rectangle id. If so it sets full to 
1. 


3. | Determine whether each strip in the merged list is left_open 
or right open. We explain the steps needed to determine if 
it is left open (symmetrical steps will determine if the strip 
is right open). Basically, to determine if <s;,5;,)> is 
left_open we count the number of right segments in R whose 
bottom y-coordinate is less than s; and top y-coordinate 
exceeds 5;. 

Every point that is the top (respectively bottom) of a 
left open segment (true iff full = 0, it belongs to R and is a 
right segment) sets a local counter, count to 1 (respectively 
-1). By finding the sum of all "counts" that precede the 
point, every point can determine whether it lies in a 
left open segment or not (if the sum is greater than 0, it 
does, else it does not). 

Since the points are sorted in row-major order, this is 
easily accomplished by a row-sweep that obtains the sum of 
counts within a row, followed by a column-sweep, which 
determines the sum up to the beginning of a row, followed 
by a final row-sweep, which distributes this partial sum to 
each PE in that row. 


4. Execute the steps described in Algorithm 1. 

The only information that is not locally available to a PE 
corresponds to the case marked by a "**" in Algorithm 1. 
However, each PE can in constant time determine the index 
of the PE which contains the desired information as follows. 
Since local_rank is the position of the point among points 
only in its sorted list (say A) and global_rank is the position 
of the point in the list obtained by merging lists A and B, the 
difference (actually global_rank - local_rank - 1) gives the 
position of the desired point in the sorted list B. By per- 
forming one RAR from the PE with index (global_rank - 
local_rank - 1 + Base address of the block of PEs containing 
the sorted list B ), a PE can obtain the required information. 
(We presented a detailed description of such a scheme for 
other MCC algorithms earlier in [16].) 


The time needed to find the area of the "OR" of a set of n 
iso-oriented rectangles on a ¥nxVn MCC is O(Vn ). The sorting 
in the initialization step can be done in O (Vn ) time. Every merge 
of two sets of k vertical segments takes place in a sub mesh of size 
no larger than 2Vk x 2Vk. Step 1, the mergings of two sorted 
lists, can be done in O(vk) time. Row and column sweeps 
required in Step 3 also take no more than O (vk ) time. Finally, 
the RAR required in Step 4 also can be done in O (Vk ) time. The 
total time, 7(n) is therefore T(n) <c(Wn + Vn /24+ Vn /4+...4 1) 


which is O (Vn ). 


Il. “AND” Area Reporting 


The "AND" of two rectangles A and B is a third rectangle 
which includes the region that is overlapped by A and B. Givena 
set of m iso-oriented rectangles, the "AND" area reporting problem 
is to find the area of the region which is covered by at least two 
rectangles. 


We present an algorithm to solve this problem on a Vn xVn- 
MCC in O(Vn ) time. The algorithm follows the same general 
principle of the algorithm for reporting the OR-area detailed in 
section II, differing in the merge step which combines vertical seg- 
ment on opposite sides of a dividing line. In fact, this algorithm 
may be considered a generalization of the "OR" problem, in that 
the solution to the latter is also obtained simultaneously. 


Consider a strip </;,/;,;> in L, the region to the left of the 
dividing line currently being merged as in section II. 


Assume that up to now the algorithm has computed the two 
quantities J;.length and l;.overlap for each strip <l;,J;,,;> in L. 
[length is the quantity computed by Algorithm 1 and represents 
the region in </;,/;,;> covered by at least one rectangle. /;.overlap 
represents the region in </;,/;,;> covered by two or more rectan- 
gles (i.e. the overlapped area) and has been computed in the previ- 
ous iteration. 


In a similar manner, the corresponding quantities r;.length 
and r;.overlap for all strips <r;,r;,;> in R have been computed. 


The merging of regions L and R partitions the merged 
region S into strips <s;,5;,;>. Algorithm 2 below presents the 
computation of s;.overlap using the values of J;.overlap , l;.length , 
r;.overlap , r;.length already computed. 


Algorithm 2 
for each <s;,5;,;> in parallel do 
if <5j,5;,1> 18 a left_originating strip where s; = l; 
then begin 
if <s;,5;41> is a left_open strip then 
/* entire portion of strip </;,l;,;> that was part 
of some rectangle is now in the overlapped area */ 


(1) (s;.overlap), = l;.length 
else /* no change in overlapped area L */ 
(2) (s;.overlap), = l;.overlap 
if <s;,5;,;> 18 a right_open strip then 
/* Let <r,,rp11> be the segment in R that is now 
crossed by <s;,5;,1> */ 
(3) (s;.overlap)p = rp.length 
else 
(4) (s;.overlap)p = rp.overlap 
end 


s;.overlap =(s;.overlap), + (s;.overlap)p 
else /* <5s;,5;41> iS a right_originating strip */ 
begin 
/* symmetrical code */ 
end 
end 


Note that we also need to update s;.length during this merge 
step, using Algorithm 1. Fig. 3 shows the situation corresponding 
to (1) and (3) of Algorithm 2. 


sa | 
1 = = iy 
HI 
ee, a ~ Fel 
1 ae oS net 
J lp d —~ine 
—-—~—--4J-—r 
po P 
(a) Before Merge 
l,.overlap = atc Fr poverlap = 0 
1,-length = at+bt+c+d Tp-length =p 
right_open 
segment 
left_open 
= _ segment 


(b) After Merge 
Case 1: (s,-overlap), = atbtct+d 
Case 3: (s,-overlap), =p 


(s, overlap) = atbtctd+p 


Figure 3 


The MCC algorithm is similar to that in section II, and com- 
putes the AND-area in O (Vn ) time. 


IV. Maximum Overlap Problem 


Given a set of n iso-oriented rectangles, the maximum over- 
lap problem is to determine the largest number of rectangles that 
overlap, i.e. include a common region. We present below an 
O (Nn ) time algorithm to solve the problem on a Vz xvVn MCC. 
At the expense of some additional book keeping, the algorithm 
can be modified to report a region of maximal overlap as well. 


Just prior to the merge step that we are about to describe, 
each strip </;,/;,;> in L has computed /,.max_overlap, equal to the 
maximum number of rectangles that overlap some region in strip 
<Ijl41>. Similarly, each strip <r;,7rj,;> in R has computed 
r;.max_overlap. To determine the maximum number of overlap- 
ping rectangles that overlap a strip <s,;,5,,,;> in the merged region 
S, we determine two quantities viz. (s,max_overlap), and 
(s,max_overlap)p which are the maximum number of rectangles 
overlapping a point in strip <s,,5,,;> that lies in ZL orin R respec- 
tively. s,.max_overlap is then the larger of these two computed 
quantities. 


Algorithm 3 below presents the details of the merge for a 
left_originating strip <SpS,4,;>. A Symmetrical set of steps would 
apply if <s,,5441> Was a right_originating strip. 


Algorithm 3 
/* assuming <S,,5441> is a left_originating strip, 5~=l; */ 


1. Find the number of left_open segments crossing <S,,5,41>-. 
Denote this by nz, 
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Each such left_open segment (which by definition belongs 
to R ) must overlap all rectangles crossing </;,l;,;>. Thus the 
maximum number of overlapping rectangles in the portion of 
<Sj5j41> that lies in L is computed as follows. 


(s;.max_overlap),, = (l;.max_overlap) + nz; 

3. Find the number of right_open segments crossing <S,,5j41>. 
Denote this by np. This region prior to the merge lay in strip 
<lpoVpti>> where ‘ps Li< Tp+l- 

(s;.max_overlap)p = (rp).max_overlap) + np; 
5. 5;.max_overlap = Max [(s;.max_overlap),,(s;.max_overlap)p\; 


At the end of the last merge step, the algorithm has com- 
puted for each strip, the maximum number of rectangles overlap- 
ping within that strip. The algorithm is completed by finding the 
largest of these quantities over all the strips. As before the time 
complexity of the algorithm is O (Wn ) on ie xVn mesh. 


An related problem in computational geometry is the fixed 
size rectangle placement problem. Given n points in the plane, 
and an iso-oriented rectangle of fixed size, the problem is to find a 
placement of the rectangle in the plane so that the number of 
points covered by the rectangle is maximized. This problem is 
equivalent to the maximum overlap problem. Let the center of the 
rectangle be the intersection of its two diagonals. Generate n iso- 
oriented rectangles of the size of the given rectangle and let each 
be centered at one of the given points. Find the region of max- 
imum overlap for the rectangles, i.e. find the common region 
which is covered by the largest number of rectangles. Note that if 
a rectangle A covers rectangle B’s center then rectangle B can 
certainly cover A’s center since A and B are of the same size (see 
Fig. 4). The given rectangle should be placed centered at a region 
of maximal overlap. 


eH 


Figure 4 


V. Minimum Separation Between Rectangles 


Given a set of n non-overlapping iso-oriented rectangles, the 
minimum separation problem is to determine the distance 
between the closest pair of rectangles. The distance between two 
rectangles is defined to be the minimal (Euclidean) distance 
between all pairs of points, one belonging to each rectangle. If p 
and q are points on two non-overlapping rectangles P and Q 
respectively, then there are only three possibilities by which p and 
q can determine the distance between P and Q, namely : 

(i) p lies on the left (right) vertical segment of P and q on the 
right (left) vertical segment of Q, and p and q have the same y- 
coordinate. 

(ii) p lies on the top (bottom) horizontal segment of P and q on 
the bottom (top) horizontal segment of Q, and p and q have the 
same x-coordinate. 

(iii) p is the northeast (northwest, southeast, southwest) corner of 
P and q is the southwest (southeast, northwest, northeast) corner 
of Q. In this case no points of P and Q have either a common x- 
or a common y-coordinate. 


We refer to the distance between P and Q defined by the 
each of the three cases above as the horizontal, vertical and diago- 
nal separation respectively. The minimum separation between the 
rectangles is the smallest of the horizontal, vertical and diagonal 
separations between all pairs of rectangles for which the 
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corresponding separation is defined. Our algorithm to compute 
the minimum separation uses a divide-and-conquer strategy along 
the basic lines of the previous algorithms. 


Assume inductively, that we have computed 6, and dp, the 
minimum separation between all pairs of rectangles (possibly 
incomplete) in the left and right regions, L and R, being merged. 
In the merge step, we compute 55, the minimum separation 
between rectangles in region S$ as follows: Let 5 = Min[6,,5zl. 
We determine whether (a) the horizontal separation, (b) the verti- 
cal separation or (c) the diagonal separation between any pair of 
rectangles in S is less than 6. If so we update 6, to the smallest of 
the three values computed above, else 5; = 5. We outline the main 
steps in the algorithm. 


Horizontal Separation 

We need to identify pairs of segments, one in L and the 
other in R, which might have a horizontal separation less than 56. 
Since the rectangles are non-overlapping, the only possibility for 
this occurs if a right vertical segment in ZL and a left vertical seg- 
ment in R intersect a common horizontal strip, and the segment in 
L (respectively R) is the rightmost (respectively leftmost) seg- 
ment crossing that common strip. (See Fig. 5(a).) 


To implement the merge step, we therefore need to maintain 
for each strip in L (respectively R) information about the right- 
most (respectively leftmost) edge in the strip. If in a refined strip 
we find that the rightmost segment in L and the leftmost segment 
in R crossing that strip are left-facing and right-facing respec- 
tively, we computes the horizontal separation between the two rec- 
tangles equal to the difference in the x-coordinates of the segment 
in R and the segment in L. To complete the merge, the rightmost 
(respectively leftmost) segment in the refined strip is updated to be 
the rightmost (respectively leftmost) segment that crossed the strip 
in R (respectively L). Finally, determine 65,, the minimum of the 
horizontal separations computed for each strip. 


Set d; = Min{6,6,]. It is easy to see that these steps may be 
implemented on an MCC in a manner similar to the previous three 
algorithms. 


Vertical Separation 
We need to identify pairs of segments one in L and the other 


in R , which define two rectangles whose vertical separation is less 
than 6. 


Left~segment 
in R 
(a) 


Right-open 
Segment 


Left-open 
Segment 


(b) 


Figure 5 


Observe first that, since the rectangles are non-overlapping 
any strip must be bounded by either the top and bottom horizontal 
edges of the same rectangle or by a pair consisting of the top and 
the bottom edges respectively of two different rectangles. A strip 
of the second type can influence the minimum vertical separation 
if and only if the horizontal edges of the two rectangles that bound 
the strip share a common x-coordinate. Consider a strip <s;,5;,1> 
in the merged region S, where s; = 1; and s;,; = ry. The strip 
defines a vertical separation of (s;,;—s;) between two rectangles if 
and only if either J; is a right_open segment, or r; is a left_open 


segment (see Fig. 5(b)). A symmetrical set of conditions holds if . 


Sj=T; and Si+1 = Ly. 


In the merge step of the algorithm, for each refined strip 
<Sj,5;41> Wwe determine whether it satisfies the conditions stated 
above. If so, we determine the vertical separation in the strip to be 
equal to s;,; - 5; Finally, update 5; = Min[3z,5,]. The implemen- 
tation of this step on the MCC is simpler than all other cases con- 
sidered so far, and does not require any random-access read. As 
before 5, is the minimum of these computed values for all the 
strips. 


Finally, we present the algorithm for examining the diagonal . 


separation between the rectangles. 


Diagonal Separation 


We need to identify pairs of segments one in L and the other 
in R, which define two rectangles whose diagonal separation is 
less than 6. 


In the merge step we check whether any northeast (respec- 
tively southeast) point in L is within a distance 5 of a southwest 
(respectively northwest) point of R. A crucial observation that 
enables efficient implementation of the step is that all the points of 
the same type (e.g. all the southwest points in R or all the 
northeast points in L ) obey the sparsity restriction, in that the dis- 
tance between any two such points will be greater than 5. This 
follows because the distance between two such points must always 
exceed the minimum separation between the rectangles computed 
so far which in turn is at least 5 (see Fig. 6). Let us focus only on 
the northeast points in L and the southwest points in R. (Similar 
considerations hold for southeast points in L and northwest points 
in R.) We need to consider only those northeast points in L and 
those southwest points in R that lie within 5 of the dividing line 
and determine if any such pairs of points one from L and one from 
R, are closer than 6 to each other. Since all the points under con- 
sideration in R and L are at least 6 apart, for the density of planar 
point packing [20], it follows that for any northeast point in L, 
there can be at most a constant number (four in this case) of 
southwest points in R which can be closer to the point in Z than 
the minimum separation between rectangles computed so far, i.e. 
5. Thus, for a northeast point in L we only need to examine the 
distances between it and at most four southwest points in R to 
determine if the pair from L and R are closer than the minimum 
distance between rectangles found so far. This step of the problem 
thus reduces to a variant of the closest pair problem for a set of 
planar points. We can use the merging step of the MCC algorithm 
presented in [16] to determine the closest pair of points consisting 
of a northeast point from ZL and a southwest point from R. We 
very briefly outline the steps in the implementation. Sort all the 
northeast points in the left slab and all the southwest points in the 
right slab which are within a horizontal distance of 5 from the 
boundary together. Record for each point, its global_rank, which 
is its position in the sorted list. Then sort the northeast and 
southwest points separately, and record for each point its 
local_rank, which is its position in its own sorted list. The index 
of a processor which contains one of the desired points on the 
other side is then obtained as described in Section II. The other 
three points are in processors immediately adjacent to this. 


Find the minimum over all the diagonal distances computed 
above and set it equal to 6,. Finally set d; = Min[8s,6,]. 


Each of the three merging steps can be implemented in 
O(Nk ) time where k is the number of segments being merged. 
The overall time complexity of the algorithm is therefore O (vn ). 


PP, > Py° > ) 


Pj 


90° < / P4°P, < 270° 


P,P; > Vp ,0" 4p 50° >6 
Figure 6 
VI. Summary 


We have presented MCC algorithms for several rectangle 
intersection problems. Given a set of n iso-oriented rectangles, 
we presented algorithms to determine the area of the logic "OR" 
and the logic "AND" of these rectangles. We then presented an 
algorithm for determining the maximum overlap for the set of rec- 
tangles, which also provides a solution for the fixed-size rectangle 
placement problem. Finally, we described an algorithm for com- 
puting the minimum separation of a set of non-overlapping iso- 
oriented rectangles. All the algorithms require an optimal O (vn ) 
time on a Vn: re MCC with constant storage per processor. 


In conclusion, we note that there are several other rectangle 
intersection problems similar to these considered in this paper 
(e.g. determining for each rectangle the number of rectangles that 
intersect it), that can be efficiently solved on an MCC using the 
“strip refinement" strategy. One interesting problem needing 
research is to determine an efficient means of reporting all pairs of 


_ intersecting rectangles using a Mesh-Connected Computer. 
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Abstract An architecture called m*-mesh is 


proposed here. This architecture augments mesh 


structure with shared memories and multiplexed multiple | 


buses. The approach to the design of this architecture is 
presented. The routing capability of this architecture is 
discussed and its performance is analyzed. Routing 
algorithms for two point 
communication and broadcasting are proposed. It is 


shown that any permutation can be performed on 


performing permutation, 


m2-mesh in 2N!/2 steps, as compared to 3(N1/ 21) steps 
for mesh structure. Communication between any two 
processors, row broadcasting, column broadcasting can 
We also 
compare the performance of m?-mesh with mesh, mesh 
with 
broadcasting and tree structures. 


m’-mesh to problems such as semi-group computations, 


r 


all be done in 2 steps on this architecture. 


single broadcasting, mesh with multiple 


The applications of 


algorithms with nested loops, linear algebra problems are 
discussed. 


1. Introduction 


A parallel computer achieves speedup over a 
uniprocessor system by executing several tasks on 
different processors simultaneously. The degree of 
speedup depends primarily on the parallelism of the 
program and the performance of the parallel computer. 
The communications between different problem tasks are 
performed by the interconnection network of the system. 
If the of a_ problem 
"match" the system’s interconnection structure, then the 


communication requirements 
problem can be solved quickly. Otherwise, extra time 
has to be spent on data routing. Thus, often the 
performance of a parallel computer depends primarily on 
it’s data routing capability. Several approaches can be 
taken to increase the performance of parallel computers. 
Most common are to design a system with a powerful 
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- shared memory modules. 


interconnection network (19, 2], to design efficient 
parallel memories [9], and to design effective parallel 
algorithms for a architecture [21, 15]: 
Recently researchers proposed the use of small size 
systems to balance computation and communication 
[10, 12], or finding an efficient way to map algorithms to 


computers [1, 6, 11]. 


particular 


One of the most frequently mentioned parallel 
computers is the mesh-connected computer (MCC). An 
MCC has the advantages of simple topology and good 
scalability. Because of this, many results on the 
application of MCC have been reported, and among 
these are: sorting (22, 15], image processing [18], and 
graph algorithms [16]. However, previous researches 
showed that the communication time in MCC tends to 
dominate the total execution time. For example, the 
multiplication of two N}/2 by N!/2 matrices takes 
O(N?/ 2) routing steps when executed on a Ni/2 x N12 
MCC [3]. Regarding the routing capability of MCC, it’s 
shown that in general 3(N1/ 21) steps are required to 


perform any permutation [17]. Thercfore, for general 
application, mesh connected computers suffer from the 


limitation of their routing capability. 


In this paper, we propose an architecture based on 
mesh structure, called multiplexed-multiple-bus mesh, or 
simply m2-mesh. The system consists of both local and 
We will show that by using 
this structure, many algorithms can be_ executed 
efficiently. The performance of this architecture 1s 
compared with mesh, mesh with broadcasting, tree 
structures. It can be seen that this structure performs 
more efficiently for a broader range of applications than 


MCC, with reasonable additional cost. 


2. M?-mesh Architecture 


2.1. Comparision of two models of SIMD 
architectures 


A mesh connected computer with N processors (N is 


a perfect square) is shown in figure 1. 


Figure 1: A mesh connected computer. 


The processors are numbered in row major ordering. 
PE(i) is connected directly to PE(i+1), PE(i+N!/ 2). 


except the boundary PEs. The near neighbor 
interconnection is a simple but useful topology. 
Algorithms requiring frequent interactions between 


problem tasks can be mapped to this architecture 


naturally. The other feature of MCC is its good 
scalability, i.e., it’s easy to expand from small size MCC 
to large size. However, for problems requiring global 
communication, mesh interconnection does not perform 
Considering the data routing time only, 2a 
permutation takes 3(N1/ 2.1) mesh. 
Broadeasting from one PE to any other PE takes 
2(N}/ 2.1) steps. In general, since the diameter of mesh is 
gn}/ - normally aN} 2) of routing steps are required for 
one routing function. So, if an algorithm takes O(f(N)) 


well. 


steps on 


computation steps, and one routing function has to be © 


performed between two computation steps, the worst 
case execution time could be O(n}! 2f(N)), instead of 
O(f(N)) (because o(n?/ 2) routing steps are required for 
each of the O(N}! ?) data routing functions). Thus, the 
performance of an MCC could be greatly impaired by its 
communication capability. On way to overcome this 
problem without changing the system architecture is 
using mapping techniques. In [11] it was shown that any 
permutation can be performed on a MCC in at most four 
steps if the data are loaded properly. Thus, any 
algorithm which single type of 
permutation can always be achieved in constant data 


requires only a 
routing time. However, if the permutation requirement 
of the algorithm changes from time to time during 
execution, this mapping technique may not perform well. 


The other way for overcoming the commununication 
limitation of an MCC is to reconsider the original 
problem and to devise a better algorithm for the MCC. 
For example, the bitonic sorting algorithm on an MCC 
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designed in [15] is achieved by this approach. This 
approach involves the design of parallel algorithms, 
which apparently is not an easy task for many users. 
Also, it suffers from lack of universality, because every 
algorithm has to be considered separately. 


Our approach in this paper is to enhance the 
performance of MCC with reasonable additional cost. 
The idea of this newly proposed m*-mesh architecture 
comes from the following observations. A mesh 
connected computer is really a multiprocessor system 
with local memory scheme. That is, every processor has 
its own private memory. Any two processors wishing to 
exchange information with each other have to use the 
mesh interconnection. This type of architecture can be 
modelled by the architecture shown in figure 2. 


1/0 
Data & Instruction 


Data bus 


Control 


Control bus 

| eh 

2A Dene os 
Interconnection network 


—-— ew ew ee oe oe a 


Figure 2: An SIMD machine with local memory scheme 


However, since the longest path between any two 
processors 1s ant 2 long communication delay is 
inevitable. This problem is basically caused by the fact 
that local mamory scheme is used. Should two far- 
separated processors share the same memory module, 
then the communication time will not depend on their 
physical location. Since mesh structure has its merits, 
our strategy here is to include both local memory and 
shared memories. First consider the array processor 
model in figure 3 as used in [9]. In this architecture, 
both local memory and shared memory schemes are used 
here. Moreover, the interconnection between processors 
is separated from the inteconnection between memories. 


There are four interconnection networks in this 
architecture, each one may have different topology. 


Figure 3: An SIMD with both local memory and shared 
memory system. 

Every PEM contains a processor with its own local 
memory. The global memory modules M are shared by 
all processors. Thus interprocessor communication can 
either through processor-processor 
interconnection (PP), or through the shared memory. 
Compared to model shown in figure 2, this scheme is 
more powerful. 


be done 


2.2. The m7-mesh architecture 


The idea in last section is adopted in our proposed 
architecture, which is shown in figure 4. 


The system consists of N processors and N shared 
memory modules. Processors are interconnected by an 
N1/2 by Ni/ 2 mesh, with local memory attached to each 
processor. Processors are identified by their two 
dimensional coordinates. Shared memory modules are 
numbered similarly. M(i,,J,) and M(i,,j.) share a row 
bus if i, = l,, and similarly, M(i,,3,); M(i,,J5); PE(i,,J3) 
share a column bus if ),=)o=)3. In terms of the model 
from figure 3, our architecture uses mesh for processor- 
processor communication and processor-memory and 
memory-memory communications 
buses. 


are done through 


The memories used here have special organization. 
Each memory module is divided into N!/2 banks. The 
kth bank of M(i,j) is denoted as M, (i,j). 
accesses work in the following fashion: 


Memory 
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1. PE(i,j) can read every bank of every memory 
module in the jth column, ie., PE(i,j) can 


read M,(1,j) Vk,l,0 < k,l < N¥U?_1. 


2. PE(i,j) can only write to M{j,j), the ith bank 
of M(j,)). 


3. When a data item is written to M((j,j) by 
PE(i,j), it’s written through all the ith bank 
of the memories in the same row, i.e., M((j,k) 


Vk,0 << k < NY2_], 


Since PE(i,j) can only write data to M({j,j), only one 
of the processors in the same column can write at a 
particular time. From this point of view, the processors 
in the same column are time multiplered on the column 
bus. However, processors in different columns can write 
to shared memory simultaneously. This means that the 
multiple column buses can be used concurrently. So, 
this architecture is called m?-mesh. The write through 
scheme allows data to be transferred between processors 
efficiently, as will be explained in the next section. 


3. Routing strategy and routing scheme 


Since m*-mesh contains mesh structure as a subset, 
the advantage of mesh is kept. Our routing stategy is 
that if 
processors, 
exploited. For long distance routing, buses and shared 
memory are used. This can be done by the 
implementation of two routing instructions: G{(i,j) and 
L(i,j), where G(i,j) is global communication, and L(i,j) is 
local communication. In the following, we will show that 
many data routing functions which mesh cannot perform 


well can be achieved by m?-mesh efficiently. 


communication is between near-neighbor 


local communication through mesh is 


1. Two point communication: One way to evaluate 
the routing capability of an interconnection network is 
the communication time between any two processors. 
The communication from PE(i,,j,) to PE(i,,j,) requires 


li,-ipl + li,-Jgl steps on mesh, which is 2(N1/ 2.1) in the 
By using m*-mesh, this can always be 


achieved in two steps. First, PE(i,,j,) writes to 
Mj Grid): then PE(i,,j,) reads back from Mj (pio): 


worst case. 


2.Column broadcasting: PE({i,j) sends data to 
PE(*,j)’s. This can be done by first writing data by 
PE(i,j) to M({j,j), then by activating all PE’s in the 


column, all the PE’s in the same column read it back 
simultaneously. Two steps are sufficient. 


3.Row broadcasting: PE(i,j) writes data to PE(i,*)s. 
This can be done in two steps: First, PE(i,j) writes to 
M((j,j), then, every PE(i,k), 0< k < N1/?_1, read data 
from M((j,k). 


4.Permutation: Any permutation can be done in 
gn}/2 steps. Assume permutation P maps PE(i,j) to 
PE(p,q). We use the notation p = r(P(i,j)), q = ¢(P(i,)), 
where r and ¢ indicate the row and the column number 


of a PE respectively. 
Similarly, i = r(P7!(p,q)), and j = c(P7!(p,q)). The 
routing algorithm for permutation P is as following: 


Routing algorithm for permutations 


/* Permutation algorithm for P : PE(i,j) --> PE(p,q) * 
for n = 1 to Ni/2 


cobegin /* row major loop */ 
PE(n,i) -> M (ii) 0<i< N'?-1 

coend 

for n = 1 to N!/2 

cobegin /* column major loop */ 


/* j = x(P"\(a,i), k = e(P™(n,i))* 
PE(n,i) <-M(ki) O<i< NI/2 21 
coend 


Notice that if more than one column broadcastings 
occur concurrently, they can be done simultaneously. 


The 
different. 
sources (or destinations) of some row broadcastings are in 
the The situation 
broadcasting conflict. For example, in figure 5 (a) row 


situation for concurrent row broadcasting is 


A column bus contension will happen if the 
same column. is called row 
broadcasting conflict occurs, because to send data from 
PE(0,0) to PE(0,2) through global communication path, 
PE(0,0) has to write to M(0,0) which will conflict with 
PE(2,0)’s_ writing to M{(0,0). these 
broadcastings should be done sequentially. However, 


In general, 


there are two ways to get around it. The first way is to 
skew the source (in case of source conflict) one step to 
the right (or left) through local link if possible, then the 
row broadcastings can be done concurrently (see figure 5 
(b)). 


then this method is not efficient. The second way is to 


However, if the conflict involves too many sources, 


map the problem tasks in a transposed manner, then 
every row broadcasting changes to column broadcasting, 


311 


and no conflicts will occur (see figure 5 (c)). This 
method is particularly suitable to algorithms in which 
only row broadcasting conflict exists. We will show the 
usefulness of this technique later. 


00 02 


(¢) 
Figure 5: Row broadcasting conflict. 


4. Performance analysis 


The performance of m?-mesh is studied by 
comparing it to other mesh related networks. 
4.1. m?-mesh vs. mesh 

For problems requiring extensive information 


transfers, m?-mesh is not necessary better than mesh. 


For example, ANY *) time is required to solve sorting 
problems on MCC, without 
broadcasting [20, 8]. 
algorithms m?-mesh performs much better than mesh, as 


no matter with or 


However, in general for ordinary 
can be seen below. 


1.Communication between any two processors: This 
requires 2(N1/ 2.1) steps on mesh, but 2 steps on 


2 


m“-mesh. 


2.Broadcasting: On mesh, this requires O(N}/ ey, but 


on m? 


-mesh, 2 steps are sufficient. 

3.Permutation: m?-mesh can perform permutation 
easier and faster. The lower bound for any permutation 
on mesh is 3(N/ 2.1) steps. On m*mesh, any 
permutation can be done in on}/2 steps. Moreover, the 
system overhead is different. For mesh, performing 
different permutations may require different routing 
algorithms. Besides, in each routing step, a masking 
function has to be computed, because some processors 
may need transfer data, some may not. Therefore the 
overhead is very large. In the case of m?-mesh, our 


routing algorithm proposed in the last section is 
universal, i.e., independent of permutations. Also, 
system overhead is reduced significantly; because during 
routing, processors are activated row by row regularly. 


4.2. m*-mesh vs. mesh with single broadcasting 


The mesh with single broadcasting interconnection is 
proposed in [20]. It’s easy to see that m?-mesh can 
simulate mesh with single broadcasting. So we have the 
following lemma. 


Lemma 1: Any algorithm taking @({n) time on mesh 
with single broadcasting can be executed on m”-mesh in 
O(n) time. 

For some algorithms, m?-mesh performs better. 


This can be seen in the next section. 


4.3. m?-mesh vs. mesh with multiple broadcasting 


The comparison between the two architectures is 
done in two aspects. | 


1.Permutation: Mesh with broadcasting structure 
than mesh in terms of 
permutations. So, in general 3(N2/ 2.1) steps are required 


doesn’t perform better 


to perform permutation on mesh with multiple 
broadcasting. Also, the control overhead is large and the 
routing algorithms are complex. For m?-mesh, on}/2 
steps are required. Also, the overhead is low. 
2.Broadcasting: For row _ broadcasting, column 
broadcasting, or one to all broadcasting, both networks 
perform equally well. When row broadcasting conflict 
happens, m2-mesh is worse. In the worst case, if row 
broadcasting conflict involves n}/2 rows, then m?-mesh 
could be N?/? steps slower. 


technique mentioned in section 3, this can be prevented 


However, by using the 


in some cases. 


4.4. m*-mesh vs. tree 


We 
broadcasting to simulate tree operations. 


can use row. broadcasting or column 


Therefore, 
using the same argument as in [8], we have the following 
lemma: 
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Lemma 2: If n data items are distributed in n 
different rows with one in each row, then any non-trivial 
semi-group computation of n data items can be 


performed in O(logn) steps. 


5. Applications 


The m*-mesh is an architecture with a varity of 
data routing capabilities. So its potential is promising. 
For algorithms requiring lots of information flow (such as 
sorting), m?-mesh can perform at least as well as mesh. 
So any parallel algorithms designed for mesh can be 
adopted by m?-mesh. - However, if global 
communications are required, m*-mesh is much better 
architecture. In the following, we will discuss some 
application examples. Other applications can be probed 


by using the technique described here. 


5.1. Semi-group computations 


This type of include 
maximum, minimum, sum of N data items, etc.. 


computations finding 
We 


have the following theorem for these problems. 


Theorem 1: A semi-group operation of N data 
items can be performed in O(N 6) time in a m?-mesh. 


Proof: An algorithm for semi-group operations 
taking O(N 1/ 5) in 2-MCC with multiple broadcasting was 
porposed in [8]. Our proof exploits the same algorithm, 
except that we have to prove that every step of that 


algorithm can be simulated here. We will not list the 
original algorithm here. Based on that algorithm, the 
following two arguments are sufficient to prove this 
theorem. 


(1) The operation of finding the Max of a block of 
data within N1/6 x N!/6 pR’s using local communication 
can be executed on m*-mesh in exactly the same way as 


in a mesh with multiple broadcasting architecture. 


(2) The 
broadcasting operations, 


algorithm exploits concurrent row 
but no concurrent column 
broadcasting. Therefore we can transpose the original 
problem tasks so that no row broadcasting conflict will 
occur. For example, the following communications are 


required in that algorithm: 


nt Besides, the criteria for considering tradeoff among 
row computation time, communication time and_ local 
broadcasting - memory size of each processor will be different if 
broadcasting capability is provided by the architecture. 
Therefore, m*-mesh can be used to solved algorithms 
with nested loops more efficiently than by using mesh 

architecture. 


Example 1: Convolution computation. Given two 
sequences h(k) and x(k), k = 1, 2, ...., n, the convolution 
of h and x is defined as 


By transposing the assignment of problem tasks, the 


communications are = follows: 
N 


column y(i) = ae h(k)x(j-k) j = 1,..., n. 
broadcasting. 
This algorithm can be implemented by the following 
Fortran-like program: 


+ DO 10) = 1,n 
N DO 10k =1n 
NB y(j) = h(k) * x(j-k) + y(i) 
10 CONTINUE 


This way the row broadcasting conflict can be prevented. 
So from that algorithm, O(N 1/ 5) time is sufficient. 


The dependence matrix for this algorithm is 
5.2. Algorithms with nested loops 


D=j|011 
Algorithms with nested loops cover a broad range of ‘ 0 1 
algorithms in signal processing, linear algebra, graph 
theory, and others. Examples of this type of algorithms Without using broadcasting, we found that the 
include Finite Impulse Response (FIR) filtering transformation 
algorithm, matrix multiplication, LU decomposition, T, ss 4 
transitive closure algorithm, etc.. The mapping of this 01 


type of algorithms to systolic array and array processor 

has drawn a lot of attention recently (7, 14, 13, 5]. One 

of the most effective way to map this type of algorithms = 1 i 
101 


is optimal, and the transformed matrix Dj is 


is via algorithm transformation [14, 13]. In [10], the 
algorithm transformation technique is applied to 


mapping algorithms to array processors with mesh The number of computation steps M, and the 
interconnection. One important conclusion there is that number of data routing steps R, using this 
if a given array processor has broadcasting capability, transformation are 

then sometimes algorithms can be mapped _ to M. = 2n 

architecture with shorter execution time. For example, : 

consider an algorithm with dependence matrix D. The R, = 2n-] 

way to map algorithm to mesh architecture is by finding respectively. 

a transformation T such that the transformed algorithm 

has a shorter execution time. The transformation T Now, consider another transformation 

transforms D to D! (=TD). One of the constraints in t=lo 1 

solving T is that every element in the first row of D/! | l | 


should be greater than zero. However, if the system has 


broadcasting capability, this constraint can be released. Using this transformation, the transformed dependence 


(For details of reason for this argument, please see [10].) matrix is 


313 


D=[101 
011 
YAR 


The number of computation steps and data routing 
steps are 


M, =n 
R, = n-l 
respectively. 


Therefore, this transformation results in a faster 
algorithm. However, from D, it can be seen that the 
value of h vector must be _ broadcasted during 
computation. Therefore, this is not a valid 
transformation for mesh without broadcasting capability, 
but is a good one for the m?-mesh. 


5.3. Parallel linear algebra algorithms 


From the analysis in previous sections, there are 
mainly two things that m?-mesh performs better than 
mesh architecture: 

communication 


1. The between any two 
2 


processors on m*-mesh requires only two 
steps. However, it could be as worse as 


2(N}/ 2.1) steps on mesh. 


2. m?-mesh has broadcasting capability. 


In the following, we will show that these two 
characteristics match directly with the requirements of 
many parallel linear algebra algorithms. 


Example 2: Gauss-Jordan algorithm for computing 
a solution X of AX = B. Assume A is N by N, and B is 
N by M, this algorithm can be described as follows: 


for } = 1 step 1 until N do 
row i <-- rowi- (a;;/a;,)row ,(l<t< Nix 3); 


Xp < a;;/a;;, (l<i< NN+1<j7< N+M). 


In the jth step of the algorithm, row } is broadcasted to 
all other rows. This can be done in M?-mesh in two 
routing steps only. Then all rows can_ perform 
computations simultaneously. Therefore this algorithm 


can be solved on m?-mesh efficiently. 


Example 3 C-K algorithm for triangular system 
problem. [4]. This algorithm, proposed by Chen and 
Kuck, can be described as follows: 


rowi=(rowi)/a. 1 <i<WN 
for } = 1 step 1 until N-1 do 


; é 2j-1 j 
row i= rowi- )> kaj %is—prowt—k 
jtl SC i<N 
G=4ny, j2PLSISN 


Assume the elements in one row of the matrix are 
mapped to one row of processors. When j = j,, the 
processors at the ith row need information from 
processors at row i-j,, row i-j,-1, up to row 1-2) ,-1. 
Hence, the communication cost is 2j yi. However, on 


m-mesh, this can be achieved in 2 steps by using 


broadcasting. Overall, the communication cost of this 
algorithm on mesh is 


J, = 142+ ....+ N-l = O(N’) 


While on m?-mesh, only O(N) steps are required. 


6. Conclusion 


In this paper we proposed a multiplexed-multiple- 
bus mesh parallel computer system. The system not only 
retains most of the useful characteristics of mesh 
connected computers, but also augment them with 
efficient broadcasting and permutation capabilities. This 
architecture doesn’t increase too much the hardware of 
The only significant cost is that N 
shared memory modules are required. However, with 
today’s VLSI technology memories are _ rather 
inexpensive. We have shown that m?-mesh has a much 


each processor. 


more powerful routing capability than mesh. We also 
showed that a wide range of applications are possible. 
More applications can be explored by using the 
architecture characteristics of m?-mesh and the idea of 
mapping technique discussed in this paper. 
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Abstract -- We present a simple analytic model for the block- 
ing probability of circuit switched multistage interconnection net- 
works (MINs) when the set of requests submitted during each cycle 
is a randomly selected permutation. This model provides a quanti- 
tative measure of a network’s permutation capability. We present 
an analytic model for the number of conflict free permutations 
realizable in a multipath network, and we present an analytic 
model for the expected number of cycles required to realize an 
arbitrary permutation in a network (i.e., the universality of a net- 
work). These models and techniques solve many interesting 
theoretical problems in the realm of interconnection network 
modelling. 


However, these models also have practical aspects; they give 
a quantitative measure of the connectivity of a network, and they 
give an approximate figure of merit of a network’s ability to sup- 
ply data vectors to array processors in SIMD array processor archi- 
tectures. 


iL Introduction 


The permutation capability of an multistage interconnection 
network (MIN) can be loosely defined as the blocking probability 
of the network when the requests submitted during each cycle 
form a randomly selected permutation. The rearrangeability of a 
MIN can be defined as the ability to realize an arbitrary permuta- 
tion in one cycle [4]. The universality of a MIN can be loosely 
defined as its ability to realize an arbitrary permutation in multi- 
ple network passes (where each pass requires one cycle) [21]. 


Non-blocking (rearrageable) switching networks have been 
the subject of extensive theoretical research. Of the known rear- 
rageable networks, large crossbars are far too expensive, and the 
various MINs that meet this criteria ([4][14][29]) require precom- 
puted global routing for each individual permutation. The neces- 
sity for precomputed global routing makes these networks imprac- 
tical, especially if the permutations are not known at compile 
time. 


Many suboptimal (nonrearrangeable) MINs with simple dis- 
tributed routing algorithms, such as banyan networks [8] and the 
Augmented Data Manipulator (ADM) [1], have been analyzed for 
aspects of their permutation capabilities or universalities. How- 
ever, to date “there does not exist any satisfactory technique 
which compares and quantifies permutability of ADM, Banyan and 
other MINs” [2]. 


An analytic model that yields the blocking probability of a 
banyan network when the request pattern submitted during each 
cycle is a randomly selected permutation (and when the usual 
decentralized routing algorithm for banyan networks is used) has 
been presented in [27]. We use the techniques presented in [27] to 
develop an alternate analytic model, which will be used 
throughout this paper. The development of this alternate analytic 
model illustrates the generality of the technique described in [27]. 


Using this analytic model, we derive an accurate approxima- — 


tion for the number of conflict-free permutations realizable by a 
multipath MIN (i.e., the number of permutations realizable in one 
pass). Previously, no general techniques have been proposed for 


estimating the number of conflict free permutations in multipath 
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MINs. Lower and upper bounds on the number of conflict free 
permutations in the AMD have been presented in [1], and a com- 
plex analysis for the exact number of conflict free permutations in 


the ADM has been presented in [15][16]. Our technique is simple 


and general, and can be used to obtain models for arbitrary net- 
works such as the ADM and the many variations of banyan net- 
works, using various decentralized routing algorithms. 


We present an accurate analytic model for the expected 
number of network passes required to realize an arbitrary permuta- 
tion, i.e., the universality of a MIN, when simple distributed rout- 
ing algorithms are used. Upper bounds for the expected number of 
network cycles required to realize an arbitrary permutation in cer- 
tain banyan networks (when precomputed global routing is used) 
have been presented (2](28]. It has been shown that by using 
precomputed global routing, the omega network [12] can realize an 
arbitrary permutation in 3 passes [21][25]. However, the fastest 
known algorithm for the routing requires O (log*,N ) time on an 
array processor {25]. A simpler routing algorithm that results in a 
universality of 6log./V-1 passes for a variation of the omega net- 
work has been presented in [25]. Our analysis is fundamentally 


different from these in that we assume a simple distributed routing 


algorithm is used; in the first pass, all requests in the permutation 
are submitted to the network. Requests that are not established in 
one pass must be resubmitted in. the next pass, etc. The modelling 
technique we use is very general, and can be used to create models 
for the universality of other MINs using various decentralized rout- 
ing algorithms. 


While these models and techniques solve many interesting 
theoretical problems, they also have practical aspects. The 
number of simultaneous, conflict free connections possible in a net- 
work (when the requests form a permutation) is a quantitative 
measure of the connectivity of a network. “The connectivity of a 
MIN is critical with respect to the overall performance of a large 
parallel system” [2]. 


The universality of a network has often been used as an 
approximate figure of merit for the suitability of the network in 
Single Instruction Multiple Data (SIMD) array processor architec- 
tures. These machines operate in synchronization on numeric 
applications that can be “‘vectorized”. When the processor array 
accesses a vector in unison, the resulting request distribution is 


_ (usually) a permutation. ‘One of the most significant factors in 
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determining the performance of an array processor is the system’s 
capability for providing data vectors at a rate matched to the pro-~ 
cessor rate” [12]. The system’s capability for providing data vec- 
tors is closely related to its universality (or indirectly to its permu- 
tation capability), and hence ‘“‘the permutation capability of a net- 


work is extremely important for the efficient operation of a super- 


system” [2]. 

Section 2 includes the necessary definitions. Section 3 
presents a simple analytic model for the blocking probability of 
MINs when the set of requests submitted during each cycle are a 
randomly selected permutation. Section 4 presents an analytic 
approximation for the expected number of conflict-free permuta- 
tions realizable by a multipath MIN (in one cycle). Section 5 
presents an analytic approximation for the expected number of 
network cycles required to realize an arbitrary permutation, i.e., 
the universality of a MIN. Section 6 summarizes the paper and 


Py My 
P 
N-1 Mi 


(a) 


Fig. 1: a) an interconnection network connecting N 
processors and M memories. b) a 2°X2® banyan 
network connecting 8 processors and 8 memories. 


contains some concluding remarks. 


2. Definitions 


In a multiprocessor system, an tnterconnectton network is 
used to connect: N processors to M memories, as shown in fig. La. 
The crossbar network is non-blocking for all permutations, and 
hence is a candidate for array processor systems, but it is far too 
expensive for large N. Unique path multistage interconnection 
networks (untpath MINs or banyan networks [8]) provide a reason- 
able alternative to large crossbars. A square k* Xk" banyan net- 
work of size N consists of log, N stages, where each stage consists 
of N/k crossbar switches of size k Xk, as shown in fig. 1b. 
Banyan networks have a number of desirable features; their cost 
grows only moderately as the network size increases and they use 
very simple, distributed routing algorithms [8]. 

Two fundamental criteria in the selection of an interconnec- 
tion network are fault tolerance and performance. Banyan net- 


works can be extended in many ways to significantly improve their 


fault tolerance, and such improvements may result in performance 
improvements. Since any practical multiprocessor system will 
almost certainly require a fault tolerant interconnection network, 
it is important to have analytic models that are general, and that 
can be used on these extended networks, as well as on the unipath 
networks. 


2.1. Fault Tolerance 


Banyan networks can be extended in many ways to 
significantly improve their fault tolerance. Assuming link failures 
are the predominant failure mechanism [19], each link in a net- 
work can be replaced by 2 links, as shown in fig. 2a. When each 
link is replaced by d links, we call the resulting network a d- 
dilated network [10]. If total switch failures are a predominant 
failure mechanism, then the d links in a dilated network that 
replace one link in the unipath network can actually be distributed 
among a number of the switches in the next stage [23]. In a 2- 
augmented network, one link leads to the same switch as it would 
in a unipath network, and the other link leads to a functionally 
equivalent switch. These networks are called augmented networks 
[23]. A 2-augmented delta network is shown in fig. 2b. The solid 
lines represent links leading to the same successor as they normally 
would, and the dotted lines represent links leading to equivalent 
successors. (The regular distributed routing algorithm is slightly 
modified; see ([23]). Throughout this paper, we do not distinguish 
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stage 1 stage 2 


(b) 


stage 3 


between dilated and augmented networks (their performances are 
similar [23,26]); denote either of these multipath networks as 
p-path MINs, where p is the path multiplicity. 


Banyan networks can also be replicated to improve their 
fault tolerance. We consider only dilated (or augmented) networks 
in this paper; see [26] for an analysis of replicated networks. 


2.2. Performance Under Random Distributions 


We now define the operating environment of MINs con- 
sidered in this paper. We assume that the network is synchro- 
nously circuit switched; a system clock defines a cycle in the net- 
work; during the beginning of any cycle, a processor issues a 
request with probability u (0). Assume (for now) that requests are 
randomly and evenly distributed over the memories, and are 
independent from cycle to cycle. For each request, the network 
tries to establish a circuit switched connection from the processor 
to the desired memory. The routing algorithm is very simple; in 
every stage, a switch that receives a request examines a particular 
bit in the destination, selects an appropriate output link, and for- 
wards the request out on that link [8]. If the network has multiple 
paths (i.e., p > 1), then assume the switch randomly selects a free 
link from the p appropriate links, and forwards the request on 
that link. One reason for the popularity of these networks is their 
simple distributed routing algorithms. 


A connection may not be established because p +1 or more 
requests compete for p output links at a particular switch; we say 
that lank contention has occurred; p requests are selected randomly 
and forwarded, and the others are blocked, and must be resubmit- 
ted during the next cycle. Hence, during one cycle only a fraction 
of the submitted requests will have their connections established. 
Let pb be the blocking probability of a network; if N requests are 
submitted in one cycle, then N-pb requests are expected to be 
blocked during that cycle. Under these operating assumptions, 
(i.e., circuit switched MINs with random request distributions) 
numerous models exist for the blocking probability of a network 
[10] [22]. 


2.3. The Universality of an Interconnection Network 


A very important practical operating assumption is to sup- 
pose N requests are submitted during each cycle, where these N 
requests form a permutation of the memory module indices (i.e., 
the requests are uniquely distributed). This situation occurs in 
Single Instruction Multiple Data (SIMD) array processors executing 
numeric algorithms. Data skewing algorithms are used to map the 
data (i.e., vectors or matrices) into the memories in such a way so 
that when the set of N processors issues a request for a vector, the 
resulting request pattern is a permutation (of the memory module 
indices). One approximate figure of merit for a network’s ability 


Fig. 2: a) a 2-dilated 2*2° delta network. b) a 2- 
augmented 2°x2° delta network. (Delta networks 
[22] are a subclass of banyan networks [8}). 


to supply data vectors is its universality. However, most permuta- 
tions generated by an array processor are highly structured (i.e., 
they are not randomly selected permutations) {12] and a better 
figure of merit would be a model.that yields the universality of a 
MIN when certain permutation classes are used. Lawrie has shown 
that the omega network can realize many permutation classes fre- 


quently arising in array processors [12]. Other architectures where: 


the permutation patterns are not necessarily highly structured 
apparently include MIMD parallel reduction machines and logic 
machines [9]. 


In this paper, we present a simple analytic approximation for 
the blocking probability of a network when the requests submitted 
during each cycle form a randomly selected permutation. We 
present an analytic approximation for the number of conflict free 
permutations realizable by a network (i.e., the number of permuta- 
tions that can be realized in one cycle), and we present an analytic 
approximation for the expected number of cycles required to real- 
ize an arbitrary permutation, i.e., the universality of a MIN. 


3. The Permutation Capability of Multistage Interconnec- 
tion Networks 


We present a conceptually simple analytic model for the per- 
formance of generalized p-path k* Xk” banyan networks, under 
either random or unique memory request distributions, in this sec- 
tion. It is first necessary to derive an analytic model for the block- 
ing probability of generalized p-path k* Xk" MINs under the 
assumption of random, uniform request distributions. This model 
is then modified (in the next sub-section) to yield the blocking 
probability under the assumption of unique memory request 


distributions (or simply under unique distributions). 


3.1. Analysis of Multistage Interconnection Networks 
Under Random Distributions 


In this section, an analytic model for the blocking probability 
of generalized p-path k® Xk” MINs under random distributions is 
presented. 


The assumptions for the analysis are as follows: 1) during the 
beginning of each cycle every processor issues a request with pro- 
bability u (0); 2) requests are randomly and uniformly distributed 
among the memories; and 3) unsatisfied requests in any cycle are 
ignored, and a new set of requests is issued during the next cycle 
subject to 1) and 2). 


We assume that any physical link between stages ¢ and 
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t+1 carries one request with probability u(?), the aggregate link 
utilization. Hence, the probability that a switch in a particular 
stage receives ¢ requests can be modelled as a_ simple binomial 
distribution. Since each p-path k Xk switch has pk physical 
inputs, the probability that ¢ requests arrive at the switch in stage 
m +1 is given by 


(PE) wu (my (1 (m J)? 


The probability that 7 of these requests select a particular logical 
output port is given by 


(D(z) (4) 


In general, due to path multiplicity the first f stages will 
not block, where f is given by 
logs P| 


{f= 
The aggregate link utilization u(f ) after stage f 


u(0)/p. 


In general, after any stage m-+1 we are interested in p(t), 
the probability that p logically equivalent links carry ¢ requests. 
p(t), for O<* <p, can be calculated as follows 


is simply 


Forj <p: 


p (=F (1) u (mi (1-u(m))*~ (*) Bl [-z] oo 


For j =p: . 
[-F| (1.2) 


After each stage m+1, we need to calculate the aggregate 
link utilization u(m +1), as follows 


s(m=> 


t=0 


pk 


rE (*)e(myo-eimy 3 ()[F 


p(t )-é 

1.3 
| Ps (1.3) 
The blocking probability of a network of size N >2/ +! 


(assuming each memory services at most 1 request in each cycle) is 
then 


_ , p(0 
Bae) 


The comparison of this model with simulations of various 
augmented and dilated networks can be found in [26]. The model 
is exact for unipath networks, but is only an approximation for 
multipath networks. Simple refinements of this model can also be 
found in [26]. 


(1.4) 


3.2. Analysis of Multistage Interconnection Networks 
Under Unique Distributions 


We can now adapt this model to yield the blocking probabil- 
ity of the network under the assumption of unique memory request 
distributions, using the technique presented in [27]. The blocking 
probability under unique distributions is the probability that an 
issued request will block, when the requests form a randomly 
selected permutation. 


The assumptions for the analysis are as follows: 1) during the 
beginning of each cycle every processor issues a request with pro- 
bability u(0); 2) requests are uniquely and uniformly distributed 
among the memories; and 3) unsatisfied requests in any cycle are 
ignored, and a new set of requests is issued during the next cycle 
subject to 1) and 2). 


Consider the 2°2° banyan network shown in fig. 1b. An 
output link from the first stage can reach 4 different memories, 
and a switch in the first stage can reach 8 different memories. The 
first request arriving at a switch in the first stage selects an output 
link with probability 4/8. A second request arriving at the same 
switch selects the same output link with probability 3/7, and 
selects the other output link with probability 4/7. 


The preceding technique can be generalized; consider the 
probabilities of requests selecting outputs at a particular p -path 
k Xk crossbar switch in stage m of the banyan network. Each 
output link leads to k*~” memory modules. The first request 
selects a particular output link with probability k*~" /k"-"+*!. 
Given that the first request selects a particular output link, then 
the second request selects that same output link with probability 
(k n—m -1)/(k"~™ +1_4), 

As a notational convenience, define function p_s(s,r,m) as 
the probability that a request at a switch in stage m will select a 
particular output link given that s requests have already selected 
that link, and that r requests have already selected other links. 


berm. 


,fors <<k"*-™ rts <k*-™-} 


es oe -~8-—r 
P_s (3,7 ,m )— 0, otherwise 


In a similar manner, define function p_ns(s,r,m) as the 
probability that a request at a switch in stage m will not select a 
particular output link given that s requests have already selected 
that link, and that r requests have already selected other links. 


(k-1)k"-™ — 


a an , for r <(k-1)k"-™ 


,6tr<cRe-mti 


p_ns (s.r ,m = 


0, otherwise 


Assuming that the events occurring at the inputs of each 
switch in stage m+1 of the banyan network are independent, 
(which will result in a slightly pessimistic blocking probability), 
then the probability that : requests arrive on k input links of a 
switch in stage m +1 is given by 


(2) = (m )F (1a (m yh 


Given that + requests arrive at a switch in stage m-+1, the 
' probability that 7 of these requests select a particular output and 
that 1-j of these requests do not select that same output is given 


(3) 


Hence, eq. (1) can easily be modified to yield the blocking 
probability of p-path k® Xk" banyans under unique distributions. 
Due to path multiplicity, the first f stages will not block, where 


f is given by 
f= flog, P | 


t—j a | 


Ti peetejoan) Th sa (j,r,m) 


s =0 r=0 
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Under unique distributions, the last | stages will not block, where 
! is given by 


i= logs p | +1 


We need a value for u(f ), the aggregate utilization on links leav- 
ing stage f; since the first f stages did not block, let 
u(f )=14(0)/p. In general, after any stage m-+1 we are 
interested in p (1), the probability that p logically equivalent links 
carry t requests. p(?), for O<1t <p, can be calculated as follows 


For 7 <p: 
p (= E ("7s (m )' (1-u (m))PF>* - (2.1) 
(; 5) TT o9( z Om) TT ns (t,z,m) 
For 7 =p: 
pS (PE) (m)! (ow (m (2.2) 
o(i vie. (z ,0,m )TT pms (7,2 ,m) 


After each stage m-+1, we need to calculate the aggregate 
link utilization u (m +1), as follows 


ay Con Pe puss (2.3) 
#=0 
The blocking probability of the network is then 
pb =1- u(n—l) (2.4) 


u(f ) 


This model is compared with simulations in the next sub- 
section. 


3.2.1. Comparison of Model with Simulations 


We now compare this simple analytic model with a number 
of simulations, for various unipath and multipath MINs. 


Fig. 3 illustrates the blocking probabilities for various aug- 
mentations of k" Xk" banyans of size N_ under unique distribu- 
tions. (The pb of dilated and augmented networks, with the same 
path multiplicity, differ by only a few percent [26]; these simula- 
tions happen to be for augmented networks, although the model is 
more accurate for dilated networks). These figures indicate that 
the model is reasonably accurate. 


There are two major sources of error in this model. First, 
the analytic model for the pb under the assumption of random, 
independent requests is only an approximation itself, and has an 
error of a few percent. Secondly, when the model is adapted for 
unique distributions, the assumption that requests arriving at the 
inputs of a switch are statistically independent introduces some 
inaccuracies. However, more refined analyses can take these con- 
siderations into account (see [26}). 


Hence, eq. (2) can be used as a quantitative measure of a p- 
path k" Xk" banyan network’s permutation capability. The 
techniques presented here are general, and hence models for other 
MINs are easily created. 


4. On the Expected Number of Conflict-Free Permuta- 
tions in Multipath MINs 


A simple technique for determining the exact number of 
conflict free permutations in unipath 2" X2" MINs has been 
presented in [1]. The extension to unipath k" Xk" MINs is sim-. 
ple and is presented here. 


In a unipath MIN with N processors and N memories, there 
are exactly N links between each stage. If a permutation (of N 
requests) is conflict-free, every link must carry one request. Hence, 


pb 


5 
log.N 


Fig. 3: (a) pb for various augmentations of 2" x2" 
networks with N sources under unique distributions. 
(b) pb for various augmentations of 4° X4"  net- 
works with N sources under unique distributions. 
(solid curves are simulations with 95% confidence in- 
tervals shown; dashed curves are analytic.) 


each k Xk switch will have & requests arriving at its inputs, and 
must have k requests leaving on its outputs (so that no requests 
block). Each switch can therefore perform k! mappings of its 
inputs to its outputs, i.e., there are &! distinct settings for each 
switch. Since there are N/k switches per stage, and log, N 
stages, then the exact number of conflict free permutations in a 
unipath network is simply 


(k !) exp [6 N (3) 

In a multipath network, the analysis is much more difficult. 
Since there are more than N links between each stage, each link 
does not necessarily carry a request, and each switch does not 
necessarily have k requests arriving at its inputs. The previous 
model can be adapted to estimate the number of distinct switch 
settings in a multipath MIN (see [20]), however this estimate “does 
not come close to estimating the number of permutations realiz- 
able by the [gamma] network. Estimating the actual number of 
‘permutations seems to be much harder, and is left as a challenging 
open problem” [20]. In this section, we present an analytic model 


for the expected number of conflict free permutations in multipath 
MINs. 


Let X; be the event that ¢ requests block in one cycle, given 
that N requests which form a permutation are submitted initially. 
In this section, we wish to find an analytic model for the probabil- 
ity distribution function pdf [6] of X;, for O<i#<N. Let P(X;) 
represent the pdf of X;. 

The simplest approach for the pdf is to assume that all 


events are independent, and that P(X;) is simply binomially dis- 
tributed; 


P(X; )=(F) p55 (1-8 yh (4) 


Fig. 4 illustrates the observed P(X;) versus the computed 
values for various networks. (Note that we used the pb obtained 
from simulations when computing these pdf ’s.) 

From fig. 4, we observe that the independence assumption is 
very accurate for the multipath banyans, and reasonably close for 
the unipath banyans. We can actually test the quality of the pdf 
by using the x? goodness of fit test [6]. At a 5% level of 
significance, the hypothesis is rejected for most unipath networks, 
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pb 


(b) 


5 
log.N 


and accepted for all multipath networks that we tried. 


Given that the pdf is reasonable for multipath networks, we 
can accurately estimate the number of conflict-free permutations 
in these networks N, as follows; 


N, = N!-(1-pb \ (5) 

For unipath networks the pdf is less accurate, primarily 
because the independence assumption is not valid. A slightly more 
complicated multinomial distribution that explicitly models the 
effects of the statistical dependence is required for the pdf of uni- 
path networks (see [26}). 


In general, we are not interested in simulating a network to 
obtain a measurement for the pb used in eq. (5). Eq. (2.4) gives a 
reasonably accurate estimate of pb , which can be used in eq. (5). 


5. The Expected Number of Network Cycles maar to 
Realize an Arbitrary Permutation 


In this section, we develop an analytic model for the 
expected number of network cycles required to realize an arbitrary 
permutation, 1.e., the universality of a network. The universality 
of various networks has been examined by numerous researchers. 
When global (i.e., precomputed) routing algorithms are used, the 
universalities of some networks have been established, and upper 
or lower bounds for others have been established. When distri- 
buted routing algorithms are used, the only results to date are 
simulations for the universality [15][31)]. 


A proof for an upper bound for the number of cycles required 
to realize any arbitrary permutation in unipath 2" x2" banyans 
has been presented in [2]. The basic idea was to count the number 
of requests that must use a specific link in the network in the 
worst-case, i.e., if the worst case permutation would require m 
requests to use a specific link, then the upper bound for the 
number of cycles is m. The upper bound on the number of cycles 
in a 2” X2" banyan was shown to be 2 ln/2] An identical bound 
(but using a slightly different technique) has been derived in [28]. 


However, the implied assumption in this bound is that the 
critical link will be utilized by a request in each cycle, and that 
the request that used it will indeed be established. To ensure this 
assumption would require precomputed routing for each individual 
permutation to optimally schedual requests over the critical link. 


If the usual simple, distributed banyan routing algorithm is 
used, this upper bound does not apply (it will be exceeded occa- 
sionally). This situation occurs when a request uses the critical 
link during a cycle and is blocked at a later stage in the same 
cycle, hence requiring the use of the critical link again. Numerous 
examples of this situation can be found for networks of size N >8. 
The correct upper bound (assuming a distributed routing algo- 
rithm) must also consider the conflicts that may occur at the 


P(x.) 


P(X.) 


5 6 7 


log. i 


Fig. 4: a) P(X;) for unipath 2* x2" banyans. b) 
P(X;) for unipath 4" <4" banyans. c) P(X;) for 
2-augmented 2” X2” banyans. d) P(X;) for 2- 
augmented 4” X4" banyans. (solid curves are simu- 
lations; dashed curves are analytic.) 


switch after the critical link, and is simply 2-2!"/2! [26]. 


In this section, we present an analytic model for the expected 
number of (synchronous, circuit switched) network cycles required 
to realize an arbitrary permutation. The accuracy of this model 
depends on the accuracy of the pdf for X;. As we will show, the 
model is very accurate for multipath networks, and optimistic for 
unipath networks. 


The analytic model is very simple; we use a time-varying 
markov model with N+1 states, denoted S;, for 0<:<N, as 
shown in fig. 5. We also maintain a counter ¢ , which represents 
how may network cycles have been executed; initially, ¢ —0. 
Each state 5; represents the probability that + requests are still 


unsatisfied after ¢ cycles and must be resubmitted in the next net- | 


work cycle. This model has one source state, Sy, and one sink 
state, So. Initially, all N requests in a permutation are 
unsatisfied, so Sy'==1, and all other states are 0. A variable E, 


represents the expected number of cycles required to realize any - 


permutation, and E, =0 initially. 


For each network cycle, we update every state, as follows. 
Consider updating state S;; this state is the probability that ¢ 
‘requests remain unsatisfied, and this state can be reached from any 
state S;, where 7 >¢ and where S; >0. The physical events that 
correspond to this state transition are simple; the processor array 
submits 7 requests in one cycle, ¢ of them are blocked, and the 
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P(x,) 


(d) 
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rest are accepted. The blocking probability of the network, when 
j requests are submitted, can be computed from eq. (2.4). The 
probability that ¢ of these requests are blocked can be computed 
from eq. (4). Hence, the state transition rates for the entire mar- 
kov model can easily be computed during each network cycle. 


During each cycle ¢ ( ¢ =1,2,3....), some number of 
requests j are satisfied (by reaching So); these contribute the 
value c:j/N to the expected number of cycles E,. After a 
number of cycles (iterations), the model will converge to a stable 
state (in absence of numeric errors, the steady state will have Sq 
= 1, and all other states equalling 0). The process can be ter- 
minated when EF, remains relatively constant, and the resulting 
E. is the expected number of network cycles required to complete 
an arbitrary permutation. 


Fig. 6 illustrates the analytic results and simulations for vari- 
ous networks. 


N-1 J : 0 
blocked requests 


Fig. 5: markov model for an N XN MIN. 


CE on >, a ~~) 


Fig. 6: a) E, for various p-path 2" x2" banyans. 
b) E, for various p-path 4" <4" banyans. (solid 
curves are simulations with 95% confidence intervals 
shown; dashed curves are analytic.) 


Fig. 6 indicates that the model is very accurate for multipath 
networks, and optimistic for large unipath networks. However, 
most practical networks will almost certainly have some path mul- 
tiplicity, hence these models can be used. 


A more accurate analytic model for the probability distribu- 
tion function of unipath networks can be found in [26] (this model 
accounts for the effects of the statistical dependence of requests 
arriving at one switch). This more refined pdf model can be used 
in this markov model.to create a more accurate approximation for 
the unipath case [26]. 


Perhaps the most suprising observation is the performance of 
the 4-path networks. The 4-path 2” 2" network has a blocking 
probability of about .007 at N =1024; with such a low blocking 
probability, we would expect that the average number of passes 
required to realize a permutation would be about 1. However, fig. 
6 indicates that two passes (on average) are required at N 1024. 
Similarly, the 4-path 4" 4" banyan has a blocking probability of 


about .004 at N=1024, and yet still 2 passes (on average) are 


required to realize an arbitrary permutation. 


The true performances of p-path networks must reflect on 
the bandwidth of the links used in the network. We assume that 
each link in a p-path k* Xk* network has 1/p times the 
bandwidth compared to a link in the unipath k* Xk". (The 
number of pins available on an integrated circuit is limited: by 
doubling the number of logical input/output links, each link will 
have half as many pins, and hence will have approximately half 
the bandwidth.) Hence, assume that each pass in a p-path net- 
work takes p times as long as a pass in the unipath network 
(when the switch degree remains the same). 


From fig. 6 it is clear that a 2-path network actually outper- 
forms a unipath network by about 25% (for switches of fixed 
degree 2 or 4). Consider a network of size N=1024 made with 
2X2 switches. The unipath network will require about 8 cycles to 
realize an arbitrary permutation, and the 2-path network will 
require about 3 cycles (each about twice as slow) to realize an arbi- 
trary permutation. However, this estimate can be pessimistic: The 
unipath network will require about 8 memory access times 
whereas the 2-path network will require only about 3 memory 
access times. If the path establishment and data transfer delays 
are insignificant components of the network cycle time (i-e., the 
network cycle time is dominated by the memory access time), then 
the performance improvement could be much higher. Hence, the 
true performance improvement should also consider the memory 
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access times, plus time spent on path establishment and error 
checking, etc. 


In addition to offering significant performance improvements 
(when the realization of arbitrary permutations is the performance 
criterion), the 2-path network is also significantly more fault- 
tolerant [23]. 


6. Conclusions 


We have presented a simple analytic model for the blocking 
probability of circuit switched MINs when the set of requests sub- 
mitted during each cycle is a randomly selected permutation. In 
general, such models are created by adapting an analytic model of 
a network’s blocking probability when the requests are assumed to 
be random and independent. The techniques are general, and 
hence analytic models for the performance of arbitrary MINs, 
under the assumption of unique distributions, are relatively easy to 
create. 


We have presented a general technique to determine an ana- 
lytic approximation for the expected number of conflict free per- 
mutations in multipath MINs (when simple distributed routing 
algorithms are used). This technique can be used for multipath 
networks such as the Gamma network [20] and the ADM [1]. No 
such general technique has been presented in the literature previ- 
ously, although lower and upper bounds (and in one cases an exact 
number) on the number of conflict free permutations realizable in 
certain MINs have been presented. The number of conflict-free 
permutations realizable by a MIN when distributed routing algo- 
rithms are used is a quantitative measure of the connectivity of 
the network. 


We have presented a general technique to determine an accu- 
rate analytic approximation for the expected number of passes 
required to realize an arbitrary permutation in a network, i.e., the 
universality of a network, when simple distributed routing algo- 
rithms are used. 


An interesting result concerning network performance was 
observed. A 2-path network (of degree 2 or 4) will require 3 or 
fewer passes on average to realize an arbitrary permutation for 
reasonably sized networks (i.e., network size < 1024). A 2-path 
network will outperform the corresponding unipath network 
significantly, and offer significant fault-tolerance improvements 
aswell. In addition, the performance of a 2-path MIN compares 
very well with the best known universality of 3 passes through a 
unipath omega network using global, precomputed routing 
[21][25][30] (where both networks have switches of degree 2). 


These models solve many interesting theoretical problems in 
the realm of interconnection network modelling. However, these 
models also have a very practical aspect. They give a quantitative 


measure of the connectivity of a MIN. Previously, when compar- 
ing interconnection networks for their suitablility in SIMD array 
processors, numerous techniques have been used. One such tech- 


nique was to estimate the number of cycles one network requires. 


to simulate another (24]. Using the models presented in this paper, 
one can evaluate the average performance of a particular network 
analytically (when realization of arbitrary permutations is the per- 
formance criterion). However, it must be pointed out that real 
SIMD machines typically use a small number of permutation 
classes where the permutations tend to be highly structured, and a 
better figure of merit would be the network’s ability to realize 
those frequently occurring permutations. 


7. References 


[1] G.B.Adams II and H.J.Siegel, ”On the Number of Permuta- 
tions Performable by The Augmented Data Manipulator”, 
IEEE Trans. Comput., Vol C-31, April 1982, pp. 270-277 


D.P. Agrawal, ”Graph Theoretical Analysis and Design of 
Multistage Interconnection Networks”, IEEE Trans. Com- 
put., Vol C-32, July 1983, pp. 637-648 


G.H. Barnes and S.F. Lundstrom, ”Design and Validation of 
a Connection Network for Many-Processor Multiprocessor 
Systems”, IEEE Computer, Dec. 1981, pp. 31-41 


V.E. Benes, ”On Rearrangeable Three-Stage Connecting Net- 
works”, Bell Sys. Tech. Journal, Sept. 1962, pp. 1481-1492 


L.N. Bhuyan and D.P. Agrawal, ”Design and Performance of 
Generalized Interconnection Networks”, IEEE Trans. Com- 
put., Vol. C-32, Dec. 1983, pp. 1081-1090 


LF. Blake, ”An Introduction to Applied Probability”, Wiley, 
1979 


C. Clos, ”A Study of Nonblocking Switching Networks”, Bell 
Sys. Tech. Journal, March 1953, pp. 406-424 

G.R. Goke and G.J. Lipovski, "Banyan Networks for parti- 
tioning multiprocessor systems”, Proc. Ist Annual Symp. 
Computer Architecture, 1973, pp. 21-28 

T. Hsiao-Nan, “RSESS Interconnection Network”, Proc. 
1985 Intl. Conf. on Parallel Processing, pp. 466-473 

C.P. Kruskal and M. Snir, ”The Performance of Multistage 
Interconnection Networks for Multiprocessors”, IEEE Trans. 
Comput., Vol C-32, Dec. 1983, pp. 1091-1098 

T. Lang, "Interconnections Between Processors and Memory 
Modules Using the Shuffle-Exchange Network”, IEEE Trans. 
Comput., Vol C-25, May 1976, pp. 496-503 


[2] 
[3] 


[4] 
[5] 


[6] 
[7] 
[3] 


[9] 


[10] 
[14] 


D.H. Lawrie, -”Access and Alignment of Data in an Array 
Processor”, IEEE Trans. Comput., Vol C-24, Dec. 1975, pp. 
1145-1155 


D.H. Lawrie, ”The Prime Memory System for Array Access”, 
IEEE Trans. Comput., Vol C-31, May 1982, pp. 435-442 


K.Y. Lee, “On the Rearrangeability of 2(log.N)-1 Stage 
Permutation Networks”, IEEE trans. Comput., Vol C-34, 
May 1985, pp. 412-425 


[12] 


[13] 
[14] 
M-D. P. Leland, “Properties and Comparisons of Multistage 


Interconnection Networks for SIMD Machines”, Ph.D. Thesis, 
Univ. Wisconson-Madison, 1983 


{15} 


323 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


[31] 


M-D. P. Leland, ‘“‘On the Power of the Augmented Data 
Manipulator Network”, Proc. 1985 Intl. Conf. on Parallel 
Pro cessing, pp.74-78 


G.F. Lev, N.Pippenger and L.G. Valiant, “A Fast Parallel 
Algorithm for Routing in Permutation Networks”, IEEE 
Trans. Comput., Vol C-30, Feb. 1981, pp. 93-100 


D. Nassimi and S. Sahni, “A Self-Routing Benes Network 
and Parallel Permutation Algorithms”, IKEE Trans. Com- 
put., Vol C-30, May 1981, pp. 332-340 


K. Padmanabhan and D.H. Lawrie, ”A Class of redundant 
Path Multistage Interconnection Networks” IEEE Trans. 
Comput., Vol C-32, Dec. 1983, pp. 1099-1108 


D.S. Parker and C.S. Raghavendra, ”The Gamma Network: 
A Multiprocessor Interconnection Network With Redundant 
Paths”, Proc. 9th Annual Symp. on Computer Architecture, 
1982, pp. 73-80 


D.S. Parker, ‘Notes on Shuffle/Exchange-Type Switching 
Networks”, IEEE Trans. Comput., Vol C-29, March 1980, 
pp. 213-222 


J.H. Patel, "Performance of Processor-Memory Interconnec- 
tions for Multiprocessors”, IKREE Trans. Comput., Vol C-30, 
Oct. 1981, pp. 771-780 


S.M. Reddy and V.P. Kumar, ”On Fault-Tolerant Multistage 
Interconnection Networks”, Proc. 1984 Intl. Conf. on Parallel 
Processing, pp. 155-164 


H.J. Siegel, "Analysis Techniques for SIMD Machine Inter- 
connection Networks and the Effects of Processor Address 
Masks”, IKEE Trans. Comput., Vol C-26, Feb. 1977, pp. 
153-161 


D. Steinberg, “Invariant Properties of the Shuffle-Exchange 
and a Simplified Cost-Effective Version of the Omega Net- 
work”, IEEE Trans. Comput., Vol C-32, May 1983, pp. 444- 
450 


T.H. Szymanski, Ph.D. Thesis, in preparation, Univ. of 
Toronto 


T.H. Szymanski and V.C. Hamacher, ‘‘On the Permutation 
Capability of Multistage Interconnection Networks”, submit- 
ted for publication 


A. Varma and C.S. Raghavendra, ‘Realization of Permuta- 


tions on Generalized Indra Networks”, Proc. 1985 Intl. Conf. 
on Parallel Processing, pp. 328-333 


“The Reverse-Exchange Interconnection Network”, C-L Wu, 
T-Y Feng, IEEE Trans. Comput., Vol C-29, Sept. 1980, pp. 
801-811 


C-L Wu and T-Y Feng, ‘“‘The Universality of the Shuffle- 
Exchange Network”, IEEE Trans. Comput., Vol C-30, May 
1981, pp. 324-331 | 


P-C. Yew, “On the Design of Interconnection Networks for 
Parallel and Multiprocessor Systems”, Ph.D. Thesis, Univ. 
Illinios at Urbana-Champaign, 1981 


Synthesis of a Family of Cellular Permutation Arrays 


By 


A. Yavuz Oruc 


Electrical, Computer and Systems Engineering Department 
Rensselaer Polytechnic Institute, Troy, NY 12180 


Abstract 


Group theory provides a convenient means for 
formulating and solving some nontrivial problems 
in permutation network theory. This paper uses 
group theoretic techniques to provide a unified 


approach for cellular permutation array design. 


More specifically, the networks of Kautz et al 
and Bandyopadhyay et al are shown to be 
geometric reperesentations of iterative coset 
decompositions of symmetric groups. This result 
was used by Oruc and Oruc to develop a 
linear-time set up algorithm for cellular 
permutation arrays. This paper examines another 
application, namely, the design of a family of 
cellular permutation arrays. 


1. Introduction 


Permutation networks have been extensively 
investigated in the parallel processing 
literature. By now, the researchers in the field 
know how to design such networks with 
aysmptotically optimal cell countsf{1], and 
extensive research results have also appeared 
about how these networks are incorporated into 
parallel processing[2]. The objective of the 
present paper is to exhibit some interesting 
relations between a family of permutation 
networks known as cellular permutation arrays, 
and coset decompositions of symmetric groups. 
Cellular permutation arrays were investigated 
Primarily by Kautz et alf3] and Bandyopadhyay et 
alf4]. Unlike optimal permutation networks, 
cellular permutation arrays use redundant cells, 
and thus they are not optimal in terms of cell 
counts. Nevertheless, they have certain 
attractive features. One such feature is that 
the links or wires connecting the programmable 
cells in these networks are local and require no 
crossing. More significantly, it was recently 
shown({S] that an n-input cellular permutation 
array can be set up in Q(n) times whereas the 
best known set up time for an n-input Benes 
network is O(nlogan). 

These properties of cellular permutation 
arrays call for a careful investigation of such 
networks. As an effort in this direction, we 
show that cellular permutation arrays proposed 
in the papers (3,4] can all be characterized as 
iterative decompositions of symmetric groups 
into cosets. A direct consequence of this fact 
iS a parametrized design technique for cellular 
permutation arrays, that is, a design technique 
by which it is possible to specify the cell 
sizes and permutations based on the design 
constraints in question. In the remainder of the 
paper we shall 
elaborate on them in detail. 
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formalize these ideas and 


2. Network Characterization 


We first establish the relation between the 
coset decompositions of symmetric groups = and 
cellular permutation arrays. Let 2; denote the 
symmetric group over the set of symbols 
C€1l,2y...9i}35 LSis<n. It can be shown that(6]: 


Dr = y-yetdy_y (1 1)+2;_1(e i)...+2j;_3¢i-1 i) (1) 


for ali ij; 3SiSn where Zo = {e, (1 2)}, ande 
denotes the identity permutation. Similarly, 


yi = eLj,-ytl 1)Xj4-ytle 1)E5_4---+¢i-1 1)23-4 (2) 


(1) and (2) are, respectively, the iterative 
right and left coset decompositions of £, into 
En-12 Ep—-pe-ee+s Zo. In both cases, the map 
e and transpositions (1 j)s ififj-i1 are the 
right (or left) coset leaders of Z._y in £33 
3ijin. Moreover, these two decompositions are 
the group-theoretic representations of what are 
known as regular and reverse KLW permutation 
arrays(3,6]. Fig. 1 depicts the two networks for 
n=4. If each cell in the either network can 
perform the only two permutation maps possible 
over its inputs as indicated, then one easily 
verifies the firm equivalence between regular 
and reverse KLW permutation arrays and the 
iterative decompositions given in (1) and (2). 

One can also decompose ££, into 35 
esifn-1 by using cycle maps as coset leaders 
instead of transpositions. That iss it can be 
shown that(6]: 


£j-11 l Cveet=li (3) 
and,» 
L4= eb;4t(i i-1)25 4401 i-1 i-C)Xi ye et 

(i i-1...2 LEQ] (4) 


Equations (3) and (4) are the group-theoretic 
representations of what are known, respectivelys 
as regular and reverse BBC networks[4,6]. As an 
example, we depict these two networks for n=5 in 
Fig. e@. The cycle maps shown inside the cells 
with three or more inputs are coset leaders 
while the cell with two inputs in each network 
corresponds to In ={e, (1 2)}. 

A direct implication of the four 
decompositions stated above is that the 
corresponding networks are symmetric, 1.e.; they 
can realize all of &,, and that the networks 
can be set up for an arbitrary permutation in 
linear time. The reader may refer to (5] for a 
detailed description of the set up procedures. 
Another consequence of these characterizations 
of cellular permutation arrays is a parametrized 
design technique which we shall consider in the 
following section. 


36 Network Design 


It should be noted that the KLW and BBC 
networks described above are two extreme 
examples of realizing symmetric groups by 


iterative coset decompositions. They are extreme 
in the sense that while the KLW networks use 
n(n-1)/2 = O(n®) cells, BBC networks need n-1 
or Q(n) cells to construct. Of course, the 
complexity of the cells in the two cases are 
quite different. In any event, however; we may 
be subject to certain physical constraints in 
realizing 2%, by a cellular permutation array. 
For example, it may be cost ineffective to place 
each cell of a KLW network in a separate chips 
and yet it may be infeasible to contain all of 
the permutations of a large cell of ae BBC 
network within a single chip. Thus, what is 
needed is a design technique to realize 
symmetric groups by cellular permutation arrays 
sub ject to the specifications of the coset 
leaders, and the size and the number of cells to 
be used by the network in question. 

We can formalize the above problem as 
follows. Given positive integers n and m3 2smin; 
design an n-input cellular permutation array 
consisting of N(n sm) k-input cells where 2<kim 
and n-1SN(n,m)Sn(n-1)/2. We shall describe two 
constructions both with pseudo triangular 
geometries. 

In the first construction(€7], we shall 
permit cells with 2, 3; 45...; up to k inputs. 
The pseudo triangular permutation network is 
then directly obtained by grouping together the 
permutations of ?@-input cells of a KLW network 
columnwise into @,; 3, ...;3 up to k-input cells. 
As an example, such a network for n=1i1 is 
constructed as shown in Fig. 3. It is easily 
verified that the column of cells with the jth 
vertical input in the network can perform all of 
the transpositions (i j)3 1$i£j-1, and hence the 
symmetricity of the network immediately follows. 
It can also be shown that the number of cells in 
an n-input pseudo triangular permutation array 
which is so constructed is 


N(nym) = (+1) (n-1-(m-1)f/2) (5) 
where f = L n-i/m-1  . Finallys we note that, 
instead of transpositions; one can easily map 
the cycles of BBC networks into the columns of 
this type of cellular permutation array after 
some algebraic manipulations. Whether one uses 
transpositions or cycles is an implementation 


problem which can be investigated independently 
from the considerations discussed here. 

In the second construction we shall assume 
that all cells have the same number of inputs. 
This assumption leads to a pseudo triangular 
cellular permutation array which is depicted in 
Fig. 4 for n=i1. In this case, we must let the 
leftmost cell generate 2, over its inputs. The 
cells in subsequent columns will then need only 
to realize the required coset leaders. The 
reader can again verify that the column with the 
jth vertical input can be made to realize all of 
the transpositions (1 j)s 18$i1£j-1. Thus, the 
symmetricity of the network is guaranteed. The 
construction is easily generalized to n inputs; 
and furthermore it can be shown that such a 
network consists of N(n,m) cells where: 
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N(nom) = (f+1)(n-mti-(m-1)f/2) + g (6) 
where f = L m-mt+i/m-1 and g=1 if n-m+1 mod m-1#40 
and g=0 otherwise. 

The above network constructions underscore 
the degrees of freedom in cellular permutation 
array design. We shall not describe the detailed 


implications of these constructions here. It 
should be obvious how one can manipulate’ the 
formulas in- (5) and (&) to explore various 


design alternatives subject to constraints onn;, 
m and N(n»om). 


4, Summary and Conclusions 

has established the structural 
equivalence between cellular permutation arrays 
and coset decompositions of symmetric groups. 
Based on this equivalence, we have constructed 
two new families of cellular permutation arrays; 
and shown that BBC and KLW networks are special 
cases of these networks. We have also outlined a 
parametrized design technique which can be used 
to synthesize cellular permutation arrays when 
there are certain constraints on the cell sizes 
and permutations. We note that the networks 
described here all have triangular geometries, a 
fact that follows from the linear iterative 
nature of the symmetric group decompositions 
considered in the paper. It will be worthwhile 
to determine whether there exist other 
decompositions symmetric groups which may 


The paper 


of 


lead to new families of cellular permutation 
arrays. 
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Figure 3. Pseudo Triangular Permutation Network, 
Construction 1; n=1l, m4. 


(b) Reverse KLW network 


Figure 1. 4-input Triangular Networks 
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Figure 4. Pseudo Triangular Permutation Network, 
Construction 2; n=11, oF4. 
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Figure 2. 5-input Cascade Networks 
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ABSTRACT 


Fault-tolerant network architectures for multiprocessor 
systems are emerging as an important area of study. Theoreti- 
cal studies of algorithms, simulation of the NYU Ultracom- 
puter architecture and construction of a small prototype 
machine led to a joint proposal by NYU and IBM (Yorktown 
Heights) to implement the architecture in a system with up to 
512 processors, including an implementation of the “Fetch- 
and-Add" shuffle network that is key to the success of this 
architecture. Techniques proposed so far to achieve fault- 
tolerance in interconnection networks cannot support the 
"Fetch-and-Add" primitive satisfactorily. In this paper. we 
present a multistage interconnection network based on the 
Omega network that supports the “Fetch-and-Add" instruction 
and also provides fault-tolerance. Basically, the approach uses 
4X4 switches as switching elements in a multistage network, 
and uses an extra stage of such switches to enable four 
independent paths to be set up between any source and any 
destination. Conventional approaches to fault tolerance in 
interconnection networks involve using only one of the paths 
at a particular time for propagation of a message, with the 
redundant paths being used only if failure is detected in the 
first path. In our scheme, we propose transmisson of four 
copies of a message through the network simultaneously, with 
voting being performed on the copies at the memory network 
interface. The design of the switching elements constituting 
the network is described. Properties of the Omega network and 
a rigorous scheduling discipline enforced in the implementation 
of the switches in the proposed network make it possible to 
send four copies of every message synchronously through the 
network. This allows concurrent correction and detection of 
message transmission errors. 


1. INTRODUCTION 


The current interest in large scale parallel processors has 
motivated a large amount of research on multistage intercon- 
nection networks for parallel processing. Buffered interconnec- 
tion networks are an integral component of several machines 
currently under development, including the Cedar machine at 
the University of Illinois [1], the NYU Ultracomputer [2] at 
the New York University and the RP3 at IBM [3]. 


- With the advent of VLSI technology, it has become cost- 
effective to include a large number of processors and memory 
modules in a multiprocessing system. The hardware system 
organization is determined primarily by the interconnection 
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structure used between the memory modules and the proces- 
sors. Many multistage interconnection networks have been 
proposed such as the baseline [4], delta [5], Generalized Cube 
[6], Omega [7], STARAN flip [8], etc. for parallel/distributed 
systems. The Generalized Cube is representative of these net- 
works in that they are topologically equivalent to it. 


An important criterion for estimating the performance of 
a multiprocessor system is its reliability. To a large extent the 
reliability of the system depends on that of the interconnection 
network. A _ fault-tolerant interconnection network can 
tolerate faults to some degree and still provide reliable com- 
munication between any input-output pair. Some amount of 
redundancy has to be present to achieve fault-tolerance. In 
what is known as_ information redundancy, error 
detecting/correcting codes are used [9]. Such schemes require 
minimal additional hardware but fault-tolerance is limited to 
the data being transferred. Faults in the control portion of the 
network may not be tolerated. Another approach is hardware 
redundancy in which multiple paths are created between the 
inputs and outputs of the network. Multiple paths can be 
created by the addition of an extra stage to the multistage net- 
work as in [10], [11], [12], or by providing redundant links as 
in [13], and [14]. The INDRA network [15] uses a redundant 
Stage as well as redundant links. 


The schemes proposed so far consider the design of the 
interconnection network at a very high level or on a totally 
theoretical basis. Also, the designs proposed so far are not 
aimed at any specific multiprocessor system. In this paper we 
propose an interconnection network suited for the Research 
Parallel Processor Prototype (RP3) project [3] initiated in the 
IBM Research Division in cooperation with the Ultracomputer 
Project of NYU [2]. | = 


The proposed network supports the Fetch-and-Add prim- 
itive of the NYU Ultracomputer. The Fetch-and-Add instruc- 
tion is an interprocessor synchronization operation. It permits 
highly concurrent execution of operating system primitives 
and application programs. The advantage of Fetch-and-Add 
over Test-and-Set and Compare-and-Swap to serialize locking 
and reservation operations in database applications has been 
discussed in [16]. The format of this instruction is F&A(X.a), 
where X is an integer variable and a is an integer expression. 
This is an indivisible operation and it is defined to return the 
old value of X and to replace X by the sum of X+a. The 
Fetch-and-Add operation follows the serialization principle: If 
X is a shared variable and many Fetch-and-Add operations 
simultaneously address X, the effect of these operations is as’ 
though they occurred in some (unspecified) serial order. 


In Section 3, we present in detail the design of the switch- 
ing elements constituting the proposed interconnection net- 
work. In Section 4 we prove certain properties of the network 
which makes it possible to achieve fault-tolerance. 


2. OVERVIEW OF PROPOSED NETWORK 


The proposed interconnection network is based on the 
Omega network [7]. An Omega network with N inputs and N 
outputs consists of n = logg N stages of BxB switching ele- 
ments. A B” xB" Omega network is constructed using BxB 
switching elements and B*B"~! shuffles interconnecting the 
stages. A P*Q shuffle is the permutation of PQ elements 
defined as 

nG)= 


i + 0<i <PQ-1 


i 
7 PQ 

In the proposed network, 4x4 switching elements are used 
(B=4). By adding an extra stage to the Omega network 
obtained as defined above, redundant paths between the proces- 
sors and memory modules are created. 4x4 switches are used 
instead of 2x2 switches so that four redundant paths can be 
obtained. Hence, it consists of (log,N + 1) stages. It has been 
proven in [17] that the four paths created are unique i.e. they 
use independent links and independent switching elements at 
every stage of the network except at stage 0 and stagen (n = 
log4N) of the network. To make the first and last stages also 
fault-tolerant, the 4x4 switch can be modified as explained in 
Sections 3.2 and 3.3. In our fault model, a fault constitutes: 


(1) Complete failure of one switching element. 


(2) Failure of one link between any two stages of the net- 


work. 


Conventional approaches to fault-tolerance in intercon- 
nection networks [10, 13, 14,17] involve the use of disjoint 
redundant paths between any source/destination pair. Once a 
failure of a switch/link in the path of a message is detected, an 
alternate path is selected. The mechanism of detection of a 
failure can be performed either off-line through diagnosis [4] or 
on-line by using error correcting codes [9]. Off-line diagnosis is 
not applicable in case of transient or intermittent failures 
which have been shown to occur more frequently than 
permanent failures [18,19]. An on-line fault 
detection/location mechanism may use coding techniques. 
Simple algebraic codes such as Hamming codes or cyclic codes 
[20] can be used for detecting errors in the address portion X 
of a message (Xa) but not for the data portion a because the 
Fetch-and-Add primitive requires arithmetic operation to be 
performed on the data portion of the message. For the data 
portion, arithmetic codes such as the residue codes and the AN 
codes would have to be used [20]. However, these codes may 
be used for detecting errors but not for correcting them. Even 
if the codes are used only for error detection, implementation 
of the encoding/decoding circuitry is very complex. Besides, 
such circuits would have to be designed to be totally self- 
checking. Considering that such circuits would have to be pro- 
vided in each switching element of the network, it is not cost- 
efficient to adopt this scheme. A less costly technique would be 
to place the encoding circuitry at the PNI and the decoding cir- 
cuitry at the MNI. However, in this technique, once an error is 
detected, the message has to be sent through an alternate path 
and it may be impossible to recover from the deleterious effects 
produced by the erroneous message. Consider what happens if 
message (X.a) changes to (Y.c). This could have several 
effects: 


(1) A wrong value would be returned to the processor from 
which (X ,a ) originated. 


Memory location X will not be updated. 


Memory location Y will be written into when it should 
not have been written into. 


(2) 
(3) 


328 


(4) If there is a message (Y .b) also propagating through the 
network, combining of (X.a) (changed to (Y.c)) and 
(Y 5) may take place when no combining should have 
taken place. As a result, a wrong value would be 
returned to the processor from which (Y 5) originated 
and a wrong value would be written into memory loca- 
tion Y. | 


In case there are several messages propagating through the 
network with destination Y , the effect of combining the 
erroneous message (X.a) (changeg to (Y.c)) with all 
these messages may be disastrous . Recovery from the 
effects of combining in that case may be extremely 
difficult or even impossible since combining of the errone- 
ous message (X.a) with non-erroneous messages may 
have taken place at various stages of the network. 


(5) 


Thus, due to a single fault, several errors may be gen- 
erated. In the proposed scheme, all four paths are used to send 
four copies of a message simultaneously through the network. 
Voting is carried out on the four copies in the Memory Net- 
work Interface. This enables correction of message 
transmission errors due to any number of faults along a single 
path, or, detection of message transmission errors due to any 
number of faults along two paths, as messages propagate 
through the network. . . 


3. DETAILED NETWORK DESIGN 


This section describes in detail the design of the switching 
elements constituting the network. First, message transfers 
Within and between stages is described for stages 1 through 
(n —1) and then for stages 0 and n. 


3.1. Stages 1 through (n —1) 


Each switch has four input and four output ports. One 
queue is associated with each input-output port pair. Hence, 
there are four IN queues associated with each output port 
(Fig.1). These are labeled as INO, IN1, IN2, and IN3. INO is 
connected to input port 0, IN1 to port 1 and so on. When a 
message arrives at an input port it is first determined which 
output port the message is bound for by decoding the relevant 
two bits in the destination address of the message. Then the 
message is placed in the appropriate IN queue of the particular 
output port. Messages are sent to the next stage of the net- 
work depending upon their priority. The priority is esta- 
blished by following a round-robin discipline, starting with 
the queue having the smallest label. This round-robin schedul- 
ing scheme is illustrated by the following example. 


EXAMPLE 1: Messages a,b,c,d,e,f.g.h,i,j arrive 
at various input ports of a switch over a period of five cycles 
and are routed to output port p as shown in Fig.2. For the 
purpose of this example assume that none of these messages 


_ can be combined. Messages a,6 and ¢ arrive in the same cycle 


at input ports 0, 2 and 3 respectively. Suppose this is cycle C. 
Messages d and e arrive in cycle C+1 at ports 1 and 3 respec- 
tively. In cycle C+2 messages f .g.h, and i arrive at input 
ports 0, 1, 2, and 3 respectively and are routed to output port 
p. In cycle C+3 message j, arriving at input port 3, is routed 
to output port p. These messages are sent to the next stage in 
ten cycles in the ordera,b,c,d.e,f.g,h.,i, and j. Note 
that after message c is sent to the next stage, message d is sent 
to the next stage and not message f . Any blank slots in the IN 


queues are skipped while following the round-robin scheduling 
of messages. | oO 


* The possibility of there being several requests to the same memory loca- 


tion simultaneously has been discussed in [21]. 
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Fig. 1. A 4x4 Switch in an Intermediate Stage of the Network 
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Fig. 2. Round-Robin Scheduling Discipline at Output Port of a Switch in the Network 


Implementation of the correct ordering of the messages 
and possible combining of messages is based on the scheme pro- 
posed in [22]. For the benefit of the reader, the scheme pro- 
posed in [22] is now described briefly. The scheme has been 
used for a network consisting of 2x2 switching elements called 
Ultraswitches. Three columns of shift registers called the IN 
column, the OUT column and the CHUTE column are associ- 
ated with each output port of a switch. These columns of shift 
registers are connected as shown in Fig.3. Messages arrive at 
the IN column and shift up one position at each cycle. Simi- 
larly, messages shift down one position in the OUT column at 
each cycle. If a message in the IN column is adjacent to a slot 
on the OUT column that is empty, then it shifts to that slot. In 
addition, this scheme detects a message in the IN queue going to 
the same address as another message already in the OUT queue. 
The message in the IN queue is then placed in the CHUTE 
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OUT 


CHUTE IN 


Fig. 3. Design of the Output Port of the Ultraswitch 


Fig. 4. Design of the Output Port of a Switching Element in the Proposed Scheme 


column. The two messages move synchronously and arrive at 


the combine logic simultaneously. The combine logic detects 
the possibility of combining and combines the two messages so 
that only one message is sent to the next stage of the network. 


In the proposed scheme, at each output port there are four 
IN queues labeled 0, 1, 2, and 3 as described earlier. In addi- 
tion, at each output port there is one OUT queue and one 
CHUTE queue. These queues are connected as shown in Fig.4. 
Possible movements of messages is indicated by the arrows. 
Control signals have been omitted from the figure for the sake 


of clarity. A message arriving at an input port is placed in the 
appropriate IN queue of an output port at level 0. Messages in 
the IN queues then move to the OUT queue, to the CHUTE (if 
combining is possible), or one level up in the IN queue. The 
movements of messages in one cycle can be described by divid- 
‘ing the cycle into three phases: 

Phase One: The messages at level 0 of the OUT queue and the 
CHUTE are sent to the combine logic. The function of the com- 
bine logic is the same as in the scheme proposed in [22]. Mes- 
Sages arriving at the four input ports are placed at level 0 of 
the appropriate IN queues. Messages already in the IN queues 
move to the OUT queue, the CHUTE or the next level up in the 
IN queue. The movement of messages is decided according to 
the following conditions: 


(1) If the OUT queue slot is empty at any level, the message 
at that level in the IN queue wih the smallest label moves 
into the OUT queue at that level. The other messages at 
that level move up one level. 


(2) At every level, if the OUT queue slot is full and the 
CHUTE slot is empty, then the messages in the IN queues 
are compared with the message in the OUT queue at that 
level to find out if any of the messages in the IN queues 
can be combined with the message in the OUT queue. 
From the result of the comparison if it is found that none 
of the messages in the IN queues can be combined with 
the message in the OUT queue at a level, then all mes- 
sages in the IN queue at that level move up one level. If 
one of the messages can be combined, that message is 
moved to the CHUTE at that level. The rest of the mes- 
sages at that level move up one level. If more than one 
message can combine with the message in the OUT queue, 
then among these messages the message in the IN queue 

with the smallest label is moved to the CHUTE at that: 
level. The rest of the messages at that level move up one 
level. 


(3) If OUT queue slot is full at a level and the CHUTE slot is 
also full at that level then all messages at that level move 
up one level. 


Phase Twos Messages in the OUT queue and the CHUTE do 
not move. Messages in the IN queues move to the OUT queue, 
the CHUTE or the next level up in the IN queue. Movement of 
the messages in the IN queues is decided according to the condi- 
tions described for phase one. 


Phase Three: Messages in the OUT queue and the CHUTE do 
not move. Messages in the IN queue may move only within 
the level they are in; they cannot move to a higher level. If 
any message in an IN queue can be combined or moved to the 
OUT queue it is moved to the CHUTE or the OUT queue at its 
own level. | 


| EXAMPLE 2: The movement of messages in the three 
phases is illustrated in Figs.5a and 5b. During four clock 
cycles, messagesa,b,c,d.e,f.g.,h,i and j reach an out- 
put port p. Assume that none of these messages can be com- 
bined. It is easy to extend the example when combining is pos- 
sible. Hence, only IN queues and the OUT queue at output port 
p are shown. Messages a,b,c,d arrive at the same time at 
the four input ports of the switch and are all bound for output 
port p. They are placed in the four IN queues associated with 
output port p at level 0, in phase one of cycle one. In phase 
two of cycle one, message a moves to the out queue, and mes- 
sages b,c,d move up one level. In phase three of cycle one, 
message 6 moves to the OUT queue. Messages c andd do not 
move. This is the end of the first cycle. In phase one of the 
second cycle, message a is sent to the next stage; message 5 
moves one level down in the OUT queue; message c moves to 
the OUT queue; message d moves up one level; messages e and 
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PHASE 1 


PHASE 2 


PHASE 3 


Fig. 5a. Movement of Messages in the Three Phases of a Cycle. 
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Fig. Sb. Mcvement of Messages in thea Three Phases of 2 Cycle. 


f at input ports 0 and 3 of the switch are placed in IN queues 
O and 3 at level 0. Further movements of messages are also 
depicted in Fig.5. Oo 


Synchronization of the the movement of messages in the 
three phases can be achieved by using a suitable clock. When 
different switches receive clock signals by different paths, they 
should receive clocking events at the same time. Synchroniza- 
tion errors due to clock skews .can be avoided by lowering 
clock rates and/or adding delay to the circuits. Another way 
is to take advantage of the propagation delay down a long wire 
by having several clock cycles in progress along its length. 
This behaviour can be simulated by replacing long wires with 
strings of buffers. This will restore signal levels and prevent 
backward noise propagation [23]. 


3.2. Stage 0 


Stage zero is the extra stage of the network. Logically, 
stage zero switches have four inputs and four outputs. A mes- 
Sage arriving at any input port in this stage is broadcast to all 
four output ports of the switch. Thus a message arriving at 
input port 2 of a switch in stage zero is put into IN queue 2 of 
each output port of the switch. It is then sent to stage one as 
described above. 


Physically, the replication of messages can be done by the 
Processor Network Interface (PNI) logic. Thus stage zero 
receives 4N inputs and it consists of N 4x1 switches i.e. one 
switch for each output of stage zero. The four IN queues asso- 
ciated with each output port are in one switch. Thus, each 
switch has four IN queues, one OUT queue and one CHUTE 
queue as shown in Fig.6. The four copies of a message enter 
stage zero at different 4x1 switches and are placed in the 
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Fig. 6. Physical Implementation of a 4x4 Switch in Stage 0 of the Network. 
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appropriate IN queues. Movement of messages and any possi- 
ble combining can go on as in stages 1 through n—1. Essen- 
tially, the 4x4 switch has been broken up into four 4x1 
Switches so that each is on a separate chip. Failure of any one 
of these chips can be tolerated. 


eo eee eeeeeresee eee eeeseeeoeeeeoeereseoseeeseerse ean Ooeaeesneoaneanes 


OUTPUT 
PORT 1 


VOTER 
FOR 
OUTPUT 
PORT 2 


VOTER 
FOR 

OUTPUT 
PORT 3 


Fee eee eee mere rer eee errr eee eee ne Hoe eevee eeeeereneeresererse® 


is COMPARE a 4 


| (lear 
; = ot 


| Peaewel) | 


CHUTE 


IN OUT 


Fig. 8. Output Port of a Switch in Stage n of the Network. 


3.3. Stage n 


Logically, each switch in the last stage of the network 
(stage n ) receives the four copies of a message, one copy arriv- 
ing at each input port. The topology of the network and the 
scheduling of the movement of messages guarantees that the 
four copies arrive simultaneously at the last stage. Voting is 
carried out on the four copies. If at most one copy is errone- 
ous, the correct message is routed to the appropriate output 
port. If two copies are erroneous, the error is detected. 


In the physical design of the network, voting can be car- 
ried out by the Memory Network Interface (MNI). Each 
switch in the last stage has one input and four outputs. The 
physical implementation of a logical 4x4 switch in the last 
stage is depicted in Fig.7. There are N such 1x4 switches. In 
each switch, there are four OUT queues, one for each output 
port. Each output port of a switch has one OUT queue, one IN 
queue and one CHUTE queue. These queues are connected as 
shown in Fig.8. The four copies of a message arrive at four 
1x4 switches. A message arriving at the input port of a switch 
is placed in one of the IN queues depending upon the address of 
its destination. Two messages in the same queue may combine 
if their destination address within a module are same. Each 
output of the N 1x4 switches is hardwired to a voting unit in 
the MNI. There are N such voting units, each being hardwired 
to four outputs of stage n on the network side and to one 
memory module on the other side. The MNI receives the four 
copies, carries out the voting, and routes the message to the 
appropriate memory module. 


4. NETWORK PROPERTIES 


In this section we will show that the four copies of every 
message move synchronously through the network. This will 
be proved by making use of the topology of the network and 
the scheduling discipline described in Section 3. For the pur- 
pose of this section, the switches in stages 0 and (n —1) will be 
considered to be 4x4 switches. 


NOTATION : To distinguish between the four copies of a 
message, we introduce the notation (X.a)°, (X.a)*, (X.a), 
(X .a )3, to denote the four copies of a message (Xa ). 


NOTATION : A switch i in stage s of the network will be 
denoted by (s .i ). ; 

LEMMA 1: All copies of a message enter different switches 
in stages 1 through (n—1) at the same input port of the 
switches. 

PROOF: The inputs and outputs of all the stages are 


labeled from 0 to N-1 in binary as m-bit addresses (m = 
log2N). Suppose a message originates at processor 


Aq 4142 «- Am—2 Im-1 
and the address of its destination 1s 
Xo Xi x2 woe Xm—2 Xm-l: 


The path of the message through the network is traced in Table 
1. It can be seen from Table 1 that for every stage between 1 
through (n —1) of the network, the least two significant bits of 
the address in the input port address column are same for all 
the copies. Since these two bits specify the input port within a 
Switch, the four copies of a message do enter at the same input 
port of different switches in the same stage. 0 


LEMMA 2: In an intermediate stage, if the four copies of a 
message (X ,a ) enter a stage at switches (s i ),(s.j),(s .k), and 
(s 2) at port p and one copy of a message (Y .b ) enters switch 
(s i) at port q, then the other copies of (Y 6) must enter at 
switches (s ,j ),(s .k ), and (s 1) at port q. 
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Table 1 


Address of Input Port 
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Address of Output Port 
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Fig. 9. Tracing the Path of a Message from the Address of its Source and Destination. 


PROOF: This can be proved by first determining the condi- 
tion under which two messages originating from different pro- 
cessors will meet at a switch in a stage s. Consider a standard 
Omega network (no extra stage). Suppose a message originates 
at processor | 

Aq 41 a2 --- Am-2 Im-1 
and the address of its destination is 

XQ %1 XQ --- Xm—2 Xm-- 
The path of the message can be traced by selecting an m -bit 
window at each stage from the concatenated source and desti- 
nation address bits (Fig.9). At succesive stages, the window is 
shifted right by two bits. For two messages to meet at a stage 
the most significant m —2 bits in the windows of the two mes- 
sages for that stage must match. These bits specify the switch 
at which the two messages meet in the stage. From Fig.9 and 
Table 1 ("Address of Output Port" column) it can be seen that 
when an extra stage is added to the network, the m-bit win- 
dow is chosen from 

Ao aq ... An-1 =s XQ X41 Xm-1 

in any intermediate stage, where ** is 00, 01, 10 and 11. The 
four windows corresponding to the four values of ** identify 
the particular copy of a message. Thus, if one copy of two 
messages meet at a switch, the other copies must also meet. O 


LEMMA 3: The four copies of every message exit stage 0 
of the network simultaneously. 


PROOF: From the design of stage 0 switches discussed in 
Section 3.2 we know that for a message (X .a) arriving at an 
input port p of switch (0,i ), its copies (X .a )°, (X.a)+, (Xa }?, 
and (X ,a )* will be placed in IN queue p at each output port of 
switch (0,i) at the same time. The round-robin scheduling 
discipline ensures that all the copies have the same priority in 


_ being sent to the OUT queue or to the CHUTE of the respective 
‘output ports. Hence, they will exit stage zero and enter stage 


one simultaneously. oO 


LEMMA 4: If there is a message at level 1 of the OUT 
queue at an output port, then there will be messages at all lev- 
els less than Z in the OUT queue. 


PROOF: This is proved by considering the movement of 
messages in the IN queues during the three phases of a cycle as 
described in Section 3.1. The following points are noted about 
the messages in the IN queues: 


(1) In all phases of a cycle, messages can move to the OUT 
queue or to the CHUTE. 


In phases 1 and 2, messages may move move to a higher 
level in the IN queue. However, movement of a message 
to the OUT queue or to the CHUTE is given priority over 
moving to a higher level. Thus, if a message can move to 
the OUT queue, it is sent to the OUT queue instead of one 
level higher in the IN queue. 


(2) 


(3) During phase 3, messages may not move to a higher level. 
As a result, when messages in the OUT queue and the 
CHUTE move in phase 1 of the next cycle, an empty slot 
between two messages in an OUT queue will never be 


created. 


Hence, if J is the highest level in the OUT queue in which 
there is a message, then there are messages in all levels from 0 
through (J —1) of the OUT queue. D 


| THEOREM 1: All copies of a message enter/exit any stage 
between 1 and (n —1) of the network simultaneously. 


PROOF: We prove the result by induction on the stage 
number at which copy 0 of a message enters. Consider a mes- 
sage (Xa) entering stage 1 at switches (14 ),(1.j),C1.4) and 
(1). From Lemma 3 we know that the four copies of (Xa) 
enter stage 1 of the network simultaneously. Suppose (Xa )° 
enters stage 1 at switch (1.2 ) at port p at time ¢. From Lemma 
1 we know that the other copies of (X .a ) will enter switches 
(1,7 ),C1.k ) and (12) at port p. Since the destination of all the 
copies is the same, all copies will be routed to the same output 
port, say. g. (X.a)°, (X.a)4, (X.a)}? and (X.a) will be 
placed in the IN queues p at output ports g of switches 
(1.2 ).1.7 ).(1.4 ) and (1.1) simultaneously. Since all copies are 
in IN queues with the same label, they have equal priority in 
being sent to the OUT queue or to the CHUTE. 


From Lemma 4 we know that there are no empty slots 
between messages in an OUT queue. Hence, one message can be 
sent to the next stage in every cycle if there is any message at 
an output port. Suppose at the time when (X.a )° arrives at 
(1,2) there are d messages waiting in the IN queues and the 
OUT queue at output port g. Suppose none of these d mes- 
sages can be combined with each other. (The case when com- 
bining is possible is considered separately in Theorem 2). If 
(Xa )° does not combine with any of the d messages, then 
(X ,a )° will experience a delay of d cycles beore it is sent to 
the next stage. From Lemma 1 and Lemma 2 we know that if 


there are d messages in the IN queues and the OUT queue at | 


output port q of switch (1,2), then.there must be d messages 
at output port g of switches (1,j),(1.k ) and (11). Hence, all 
copies of (X,a) will experience a delay of d cycles before 
being sent to the next stage. 


Thus, in all stages, all copies move synchronously i.e. 
delays for all copies of a message in any stage will be the same. 
| 0 


_ THEOREM 2: If one copy of a message combines with 
another message then all the other copies of the two messages 
are guaranteed to combine. | 


PROOF: It has been proved in Theorem 1 that all copies of 
a message are placed at level 0 of the IN queues with the same 
label. It has also been proved that the four copies are placed in 
the IN queues simultaneously. Initially, when all queues are 
empty, if one copy of a message in the IN queue moves to the 
OUT queue, the other copies would also move to the OUT 
queue. When there are messages waiting in the OUT queue, if 
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one copy of a message moves up one level in the IN queue, the 
other copies would also move up one level. Hence, all copies 
are always at the same level of the IN queue or the OUT 
queue. All copies of a message in the IN queue will be com- 
pared to copies of the same message in the OUT queue. Hence, 
if one copy of a message moves to the CHUTE, all the copies 
must also move to the CHUTE at the same time. So if one 
copy of a message combines, then all other copies of the mes- 
sage must combine. Oo 


5. CONCLUSION 


Clearly, by sending four copies of a message instead of 
one, we have increased the network traffic. It may seem that 
this would greatly degrade the performance of the network. 
However, it should be noted that in the IBM RP3 computer the 
Fetch-and-Add instructions are routed to the combining net- 
work and all others to the non-combining network. Of all the 
requests generated by a processor, it has been experimentally 
determined that the percentage of Fetch-and-Add instructions 
is less than 25% [24]. Hence the network traffic is not very 
high when a single copy of every message is sent through the 
network and increasing the traffic by a factor of four is 
justified. 

The design of a fault-tolerant multistage interconnection 
network based on the Omega network has been presented in 
this paper. The design of the switching elements constituting 
the network was described in detail. It supports the Fetch- 
and-Add instruction of the NYU Ultracomputer and the IBM 
RP3 computer. It also provides concurrent correction and 
detection of message transmission errors. The only additional 
hardware needed is an extra stage. Compared to the Omega 
network, this amounts to N/4 extra switching elements if 
there are N processors. Some properties of the proposed net- 
work were proven that are also applicable to the Omega 


network. The proposed design considers message transmission 
from the processors to the memory modules. It is assumed 
that a similar network exists for the return path. 


REFERENCES 


D. Gajski et al., “Cedar Construction of a Large Scale 
Multiprocessor,” in Report No. UIUCDCS-R-83-1123, 
Department of Computer Science, University of IlUinois, 
Urbana, February 1983. 


A. Gottlieb et al., “The NYU Ultracomputer-Designing 
an MIMD Shared Memory Parallel Computer,” [EEE 
Trans. Comput., vol. C-32, pp. 175-189, February 1983. 


G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, 
W. J. Kleinfelder, K. P. McAuliffe, E. A. Melton, V. A. 
Norton, and J. Weiss, “The IBM Research Parallel Pro- 
cessor Prototype (RP3): Introduction and Architecture,” 
1985 International Conference on Parallel Processing, 
pp. 764-771, 1985. 


C. Wu and T. Feng, “On a Class of Multistage Intercon- 
nection Networks,” JEEE Trans. Comput., vol. C-29, pp. 
694-702, Aug. 1980. 


J. H. Patel, ““Performance of Processor Memory Intercon- 
nections for Multiprocessors,” IEEE Trans. Comput., vol. 
C-30, pp. 771-780, Oct. 1981. 


[1] 


[2] 


[3] 


[4] 


[5] 


[6] | 
[7] 
[8] 


[9] 


[10] 


[11] 
[12] 
[13] 
” 
[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 
[23] 


[24] 


H. J. Siegel and S.D. Smith, “Study of Multistage SIMD - 


Interconnection Networks,” Proc. 5th Symp. Comput. Ar- 
chitecture, pp. 223-229, April 1978. 


D. H. Lawrie, “Access and Alignment of Data in an Ar- 
ray Processor,” IEEE Trans. Comput., vol. C-24, pp. 
1145-1155, December 1975. 


K. E. Batcher, “The Flip Network in STARAN,” Proc. 
1976 International Conf. on Parallel Processing, pp. 65- 
71, Aug. 1976. 


J. E. Lilienkamp, D. H. Lawrie, and P. Yew, “A Fault 
Tolerent Interconnection Network Using Error Correct- 
ing Codes,” Proc. International Conf. on Parallel Process- 
ing, pp. 123-125, 1982. 


G. B. Adams, III and H. J. Siegel, ““The Extra Stage Cube: 
A Fault Tolerant Interconnection Network for Supersys- 
tems,” IEEE Trans. Comput., vol. C-31, May, 1982. 


C. L. Wu, T. Y. Feng, and M. C. Lin, “STAR: A Local 
Network System for Real-Time Management of Imagery 
Data,” IEEE Trans. Comput., vol. C-31, pp. 923-933, 
Oct. 1982. 


R. J. McMillen and H. J. Siegel. “Routing Schemes for 
Augmented Data Manipulator Network in an MIMD 
System,” IEEE Trans. Comput., vol. C-31, pp. 1202- 
1214, Dec. 1982. | 


N-F Tzeng, P-C Yew, and C-Q Zhu, “A Fault Tolerant 
Scheme For Multistage Interconnection Networks,” 12th 
Computer Architecture Conference, pp. 368-375, June 
1985. 


V. P. Kumar and S. M. Reddy, “Design and Analysis of 
Fault-Tolerant Multistage Interconnection Networks 
with Low Link Complexity,” Proc. 12th Comp. Arch. 
Conf., pp. 376-386, 1985. 


C. S. Raghavendra and A. Varma, “INDRA: A Class of 
Interconnection Networks with Redundant Paths,” Real 
Time Systems Symposium , pp. 153-165, May 1984. 


H. S. Stone, “Database Applications of the FETCH- 
AND-ADD Instruction,””’ ZEEE Trans. Comput., vol. Cc 
33, pp. 604-612, July 1984. 


K. Padmanabhan and D. H. Lawrie, “A Class of Redun- 
dant Path Multistage Interconnection Networks,” [EEE 


Trans. Comput., vol. C-32, Pp. 1099-1108, December 
1983. 


R. K. Iyer and D. J. Rossetti, “Permanent CPU Errors 


and System Activity: Measurement and Modelling,” in_ 


Proc. Real-Time Systems Symp.. 1983. 


X. Castillo, S. R. McConnel, and D. P. Siewiorek, 
“Derivation and Calibration of a Transient Error Relia- 
bility Model,” IEEE Trans. Comput., vol. C-31, pp. 
658-671, July 1982. 


J. Wakerly, in Error Detecting Codes, Self-Checking 
Circuits and Applications. New York, New York: El- 
sevier North Holland Inc., 1978. 


G. F. Pfister and V. A. Norton, “Hot Spot Contention 
and Combining in Multistage Interconnection Net- 
works,” 1985 International Conference on Parallel Pro- 
cessing, pp. 790-797, 1985. 


M. Snir and J. Solworth, “The Ultraswitch-A VLSI Net- 
work Node for Parallel Processing,” Courant Institute, 
NYU, NY, Ultracomputer Note 39, 1982. 


A. L. Fisher and H. T. Kung, “Synchronizing Large Sys- 
tolic Arrays,” Real Time Signal Processing, vol. 341, pp. 
44-52, 1982. 


A. Gottlieb, Private Communication. 


334 


ANALYSIS OF A KIND OF FAULT-TOLERANT INTERCONNECTION NETWORK 


Lan Jin 


Department of Electrical Engineering 
The Pennsylvania State University 
University Park, PA 16802 


Abstract -- With the aim of designing a highly 
available distributed computer system, a kind of 
interconnection network with mixed static and dyna- 
mic topologies has been proposed and developed. 
This paper gives a quantitative analysis of the 
fault-tolerance performance of the network,includ- 
ing the processor connectivity and the worst-case 
diameter. The method used in the analysis is cons- 
tructive in nature so that the result not only 
shows a higher fault-tolerance capability of the 
proposed interconnection network than that of other 
existing schemes, but also may serve as the basis 
of a distributed fault-tolerant routing algorithm 
with low space and time overheads. 


INTRODUCTION 


A distributed computer system may be judged 
by many different criteria. Reliability and avail- 
ability are among those important attributes of the 
system which should be given the most emphasized 
consideration. With the aim of designing a highly 
available distributed computer system, a kind of 
interconnection network with mixed static and dy- 
namic topologies has been proposed and implemented 
[1] - [2]. This network connects a great number of 
geographically dispersed processors relying on the 
principle of static topology, which is characteri- 
zed by point-to-point links between processornodes 
[3]. So any two processors which are not directly 
connected by a link need to have their messages 
relayed by intervening processors. However,in dis- 
tinction with the ordinary interconnection network 
with static topology, the proposed network does not 
use fixed, passive, and dedicated links between 
processors, but rather establishes these links 
through a set of explicit switching elements. There 
fore, the system can be dynamically reconfigured 
by properly setting these switching elements’ to 
make the communication links active, changable,and 
sharable among a number of processor pairs. Intro- 

duction of dynamic topology into the static inter- 

connection network significantly enhances its per- 
formance, especially the fault-tolerance capabili- 
ty. 

As summarized in [4], fault-tolerance is an 
inherent characteristic of the interconnection net- 
work. Fault-tolerance can be achieved only if the 
network can facilitate multiple-path,multiple-pass, 
and fault-tolerant switching elements. Redundant 
communication paths are allowed in most static in- 
terconnection topologies. Some static interconnec-— 
tion networks can tolerate multiple node/link fai- 
lures or failures of subnetworks into which the 
network is partitionable. More attention has been 
paid to the improvement of fault-tolerance perfor- 
mance of dynamic interconnection topology. Extra- 
stage, multiple routing-pass, redundant link,error- 
correcting code, and fault-tolerant switching ele- 
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ments are various schemes which have been proposed. 
To take advantages of both static and dynamic topo- 
logies, it is,therefore, reasonable to combine 
them in developing a new kind of fault-tolerant 
interconnection network. 

Compared with similar fault-tolerant networks 
most recently published in the literature[5] - [6], 
the network analized in this paper differs in many 
special features, such as low connection-complexi- 
ty, high fault-tolerance, short diameter, ease of 
routing, as well as suitability for distributed 
computer architecture. All these are the general 
requirements which should be satisfied by a fault- 
tolerant interconnection network. 

In this paper, after a brief description of 
the proposed interconnection network in the next 
section, a qualitative analysis of its fault tole- 
rance will be given in the third section. This 
analysis contains a few typical examples which may 
help in deriving a quantitative evaluation of the 
network in the fourth section. The way of analysis 
adopted is constructive in nature so that the proof 
of the theorem will automatically imply, as its 
direct consequence, the basic idea of routing stra- 
tegy based on which a distributed fault-tolerant 
routing algorithm with dynamic reconfiguration of 
the system could be developed. Finally, as an eva- 
luation of the proposed network, a comparison of 
its fault-tolerance capability with respect to 
other existing networks will be given. 


DESCRIPTION OF THE MIXED GROUP-SHUFFLE 
INTERCONNECTION NETWORK 


Interconnection network with mixed static and 
dynamic topologies combines the idea of intercon- 
necting geographically dispersed processors through 
point-to-point links and the idea of sharing these 
links among processors through dynamically recon- 
figurable switching elements. Such a kind of inter- 
connection network may be constructed on the basis 
of any existing multi-stage dynamic interconnec- 
tion network by distributing the switching ele- 
ments to the processors. In this way, the resulted 
mixed interconnection network becomes more adapta- 
ble to matching the distributed computer system 
architecture. The dynamic interconnection network 
which has been chosen as the basis of implementa- 


. tion in our work of developing a highly available 


distributed computer system is the multi-stage 
shuffle-exchange network [4]. 

The resultant network developed in this way 
is called the mixed group-shuffle interconnection 
network due to the following structural principle. 
It consists of m stages, each containing r™ proces- 
sors, where r is an integer denoting the number of 
processors in each group. Each processor can be 
labeled by the combination of two numbers C.P , 
where C is the stage number for 0 C#m-1 with 


C = 0 corresponding to the leftmost stage, and P 
is the processor number within the stage for 0 = P 
4, _ 1 with P = 0 corresponding to the top row. 
The integer r is also the radix in which the pro- 
cessor number P can be represented as 


= <j 
Papeete PpPp> «6 SP, SE r-1, Of 1 


Zz 


Zz 


< 


2 m-l. 


The network takes r processors with the labels 


C.P Pix X= 0, 1; r-1 


Pp eoe°@ ae 
m-1l m-2 2 
in the same stage as a group and establishes con- 
nections between the groups in the successive sta- 
ges according to the group-shuffle interconnection 
function defined as follows: 


...P,X) = P ...P.XP 


1 n-2 m-2 1 m-l. 
That means: the group of r processors labeled 
C.P—pem—2 °° PX in each stage can be conceptually 
viewed as to be connected to the group of r proces- 
sors labeled (C+1).P_5.-+PyXPyiy in the succeeding 
stage, and all the groups of processors in the last 
stage are connected to the corresponding groups in 
the first stage. All the connections may be thought 
of as to be performed by the rYXr switches. Thus the 
whole network forms a closed loop of m stages,each 
containing y~™-1 groups with r processors in each 
group. Therefore the total number of ryr switching 
elements for interconnecting mr” processors is 
mra-1 | 

Despite the fact that the radix r can be cho- 
sen arbitrarily, we have chosen in our design r=4 
for simplicity of implementation based on the 2X2 
switching elements only. The interconnection of 
four 2X2 switching elements for performing 4-to-4 
crossbar interconnection is shown in Fig.l. Here, 
using four 2X2 switching elements to replace one 
4X4 switching element, we can take the following 
additional advantages: 

-The in-degree (number of incident arcs) and 
the out-degree (number of outgoing arcs) of each 
processor are both equal to 2 instead of 4, but 
every processor can still send messages directly 
to any one of the four processors in the succeed- 
ing stage, and every processor can still receive 
messages from any one of the four processors in 
the preceeding stage. 

-Each processor can use two switching elements 
to communicate with the corresponding processors 
in the neighboring stage. This guarantees a certain 
degree of redundancy of switching elements for a 
higher fault tolerance. 

The overall structure of a mixed group-shuf- 
fled interconnection network with r = 4 is shown 
in Fig.2. For simplicity, the details of implemen- 
tation of switching elements have been neglected. 

Another property of the proposed interconnec- 
tion network which is noteworthy is its short dia- 
meter, defined as the maximum of the minimum dis- 
tances between all pairs of processors measured in 
the number of links. In normal operation, when no 
failed links, switches or processors exist in the 
et gees the diameter of an m-stage network with 

M processors is equal to 2m - 1.= O(log, N), where 
u is the total number of processors in the network. 
This can be derived by the following reasoning: 
every processor can take m steps to send messages 
to any processor in the same stage, and if any des-— 
tination processor in the intermediate stage on 


group-shuffle(P _ 
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the way can not be reached within the first m 
steps, then it must be reached within additional 
m - 1 steps from the source stage. 


QUALITATIVE ANALYSIS OF FAULT-TOLERANCE 


A great. number of redundant paths existing 
between any pair of processors makes the mixed 
group-shuffle interconnection network highly fault- 
tolerant. For the evaluation of this performance, 
we will take the following criteria in our analy- 
sis: 

-Processor connectivity -- maximum number of 
faulty processors (with any worst-case distribu- 
tion) which can be removed without danger of iso- 
lating any working processor from the rest of the 
network. 

-Worst-case diameter -- diameter of the net- 
work survived after removing the maximum number of 
tolerable faulty processors as determined above. 

*Simplicity of the routing algorithm adapt- 
able to the requirement of fault-tolerance. 

By a qualitative analysis, it is easy to de- 
termine the upper bound of the processor connecti- 
vity from the following straight argument: 

At first, no any stage can be tolerant of 
more than r - 1 faulty processors, because these 
r or more faulty processors may contain all the 
Successors of a processor in the preceeding stage 
and thus completely isolate their parent from the 
network. 

Secondly, the processor connectivity can not 
be better than r - 1 faulty processors in every 
stage. In other words, if any r -— 1 faulty proces- 
sors are removed from each stage, part of the wor- 
king processors would be isolated from the survi- 
ved network, and no communication paths can be 
guaranteed between any arbitrary pair of proces- 
sors. A special case suffices to prove this conclu- 
sion. Suppose we have the following worst-case 
distribution of faulty processors. All the r - l 
faulty processors in the second stage are succes- 
sors of the source node in the first stage, so 
only one good processor can. be reached in the se- 
cond stage from the source processor. This good 
processor may, in turn, have all its successors, 
except one, becoming malfunctioning, thus only one 
good processor in the third stage can be reached 
from it. This situation may exist and propagate 
from stage to stage until, at last, the single 
reachable good processor in the last stage may 
have its single good successor just coincident with 
the source processor. In consequence, all these 
good processors, one from each stage, may just 
form a closed loop which is isolated from the re- 
maining processors of the network. This is obviou- 
sly a situation contradictory to the requirement 
of the definition of processor connectivity. 

The network, however, can survive this worst 
situation if there is just one less faulty proces- 
sor in any one stage than the above mentioned case. 
This gives, therefore, the upper bound of proces- 
sor connectivity which equals ‘to r-2 faulty pro- 
cessors in any one stage and r-1 faulty proces- 
sors in every other stage. This constitutes the 
condition of the theorem which will be proved in 
the next section for a quantitative analysis of 
the network. 


It was stated above that the diameter of the 
network under normal conditions without node/link 
failures is equal to 2m - 1, i.e. any destination 
processor can be reached from any source processor 
within two passes. This is true even when there 
exists some limited number of faulty nodes in the 
network. To find the corresponding fault-tolerant 
condition we will follow the routing strategy in- 
dicated in Table 1. By this routing strategy, a 
shortest path is established between any two pro- 
cessors O.P-1---P Po in stage 0 and k.Q)-1---Q1Q9 
in stage k. When the message is traversing along 
the selected path from stage to. stage, the proces- 
sor number as a base-r code is cyclically shifted 
left digit by digit, and during each shift the 
least significant digit of the current code is re- 
placed by a new digit. Thus the whole process of 
routing can be turned into a procedure of generat- 
ing a replacing vector R formed by these newdigits 
as denoted by On Le Peele? els gi in 
Table 1. For guaranteeing maximum flexibility and 
fault-tolerance, we leave the selection of the com- 
munication path free at the first k steps of tra- 
versal and only at the last m steps let it be fix- 
ed by the digits QQ y 2+ + 294- Thus we are allow- 
ed to have some freedom in selecting the digits 
Ro »Ry >---Ry_y- The more we have selectable digits 
Rj, Of i ¢ k-1, the higher fault-tolerance could 
be realized. Therefore it is the value of k which 
determines the fault-tolerant condition. 

From the architectural point of view, the net- 
work analyzed in this paper can be conceptually 
thought of as a closed multiple-rooted r-nary tree, 
in which every node may be viewed as a root and,at 
the same time, may serve as the leaf of other no- 
des. So the stages 1 and k-l are the bottlenecks 
of the equivalent tree structure, and they can to- 
lerate smallest number of faulty nodes.Therefore, 
the problem can be stated as follows: A path is to 
be selected from stage 0 to stage k with two passes 
through an m-stage network. If in stage 1 there 
exist r-1 faulty nodes, but no one can be the child 
of. the starting node; in stage k-l there exist r-2 
faulty nodes; and in each of the remaining stages 
(including stages 0 and k) there exist r-1 faulty 
nodes; then what is the minimum value of k which 
makes the fault-tolerance realizable? This ques- 
tion will be answered by the Lemma 2 whose proof 
will be given in the next section. 

Here we just give two examples as shown in 
Figs.3 and 4. The first example in Fig.3 shows a 
successful case with r = 4, m = 4, k = 3. Among the 
total rk = 64 codings of Ro,R, »R2 digits, only .50 
codings are needed to mask all the possible faulty 


nodes (with worst-case distribution), so that there. 


still remain 64 - 50 = 14 codings which can be used 
to select the free paths containing only good pro- 
cessors. For the second example with r 4, m= 8, 
k = 32 as shown in Fig.4, since k is so small that 
its r~ = 64 possible codings are just enough to 
mask all the possible faulty processor numbers, no 
one coding remains to allow any valid path to be 
selected. Therefore the minimum value of k for r=4 
and m=8 for two-pass fault-tolerant routing under 
the specified conditions is equal to 4. 

It should be noted that the two-pass fault-to- 
lerant conditions specified here do not hold for 
any arbitrary pair of source and destination nodes. 
We will investigate this general case in the next 
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section. 
Table 1 Routing Strategy From Node 0.Pp7...P Po 
to Node k.Q,-1° °° 97% With Distance mtk 
Repla- No.of No.of 
Stage Processor Number cing free equiv. 
digit proc. codes 
Pa-l m2 ee -P1Po cute parcine % ‘| 
Pa-2" -3°° - Py RoP = Ro 3 ae ; 
Pn-3?m-4° ° - Py RoR] Pr_-2 Ry r Cr 
; @eeeseesrse8t @ @ eee °* 4 ed 
i Pa-(at1) °° PaBo- + -Ba-1Pn-i a es Yr 
eoee7en##e#e8e eee | eee 
k-1 Pracle ° -PRo- ee Ry 9 Pkt] Ry2 = 5 a 
k (k+1) ee -PyRo- ee Ry 4 Pek Ry e 1 
k+1 Pn-k-2 e -P1Ro- -Ry-199P p-k-1 Qo r 1 
oeeoeeees8es @ eee kk eee 
m-l RoR] - ° -Ry-1 99 Qn-1 ° e -Q49P] Qe+2 r 1 
k 
0 RRa + Ry 720%m—1°° M4129 GH Fo 1 
k-1 
1 RoRg- - -Ry_799Q 17° Ry Orr r r 
er a | | ae ee mae 
A Ryta + R-1%Qm-1 °° %-541Ri  Ueit1 : 
eae ce ae eee eee e ead 
k-1 QoQn—-1° °° QR 7 Qo r r 
Ke Qu-1%-2° + +212 oT Se ee 


QUANTITATIVE ANALYSIS OF FAULT-TOLERANCE 


The final purpose of the quantitative analysis 
is to prove the following theorem. As a preliminary 
step, three lemmas will be proved first. 

Theorem In a m-stage (m = 2) mixed r-nary 
group-shuffle interconnection network, if there 
exist no more than r -— 2 faulty processors in at 
least one stage, and no more than r - 1 faulty pro- 
cessors in every other stage, communication can be 
performed between any pair of the remaining good 
processors; and the diameter of the survived net- 
work is equal to 


m 
3 
m>5. 


3 
é 


IN 


3m, 

D {6 et m 

3m + Q10g,m) 25% 

Lemma 1 Under the condition specified by the 
Theorem, if some node has r - 2 faulty children, 
then no more than 2 steps must be made in order to 
reach a node whose all children are good nodes. 

Proof. Starting from a node in the first sta- 
ge with r - 2 faulty children, the message can be 
sent through one step to one of the two good nodes 
in the second stage, and then, from either of these 


two nodes with at most r -— 1 faulty children, the 
message can be sent further to one of 


2r - (r - 1) r+1 


nodes in the third stage. Since the fourth stage 
has at most r - 1 faulty nodes which could be the 
children of these r + 1 predecessors, at least two 
of these predecessors can not be followed by any 
faulty nodes. 

For the case of m = 2, the children of the 


r + 1 good nodes in the third stage (i.e.the first 
stage) may coincide with each other, thus resulting 
in fewer effective good nodes in the first stage. 
But the Lemma 1 still holds since all the r - 2 
faulty nodes in the fourth stage (i.e. the second 
Stage) are children of the same source node. 

Lemma 2 Under the condition specified by the 
Theorem, if there are at most r - 2 faulty nodes 
in stage k - 1, then starting from any node insta- 
ge O with no faulty children, the message can be 
sent to any destination node in stage k through 
m+ k steps, where 


0; m= 2 : 
k={m-1, 32¢mE5;3 
Falog.ml +2, m>5. 
Proof. | : 
Case 1 m= 2 


Starting from any node in stage 0 with no 
faulty children, in the first step the message can 
be sent to all r groups in stage 1, so that in the 
second step all good nodes in stage 2 (i.e. stage 
0) can be reached. Therefore, k = 0. 

Case 2 m= 3 

The routing paths followed for sending a 
message from the node 0.P,_)P,_5---P Pp to the node 
k.Q,-7.9,- -++Q,Q through m + k steps are listed 
in Table t. The replacing digits Ro>Ryo+++ Ry of 
the replacing vector R must be selected in such a 
manner that all the processors to be passed through 
are distinguishable from the faulty nodes. This re- 
quirement can be expressed in the form of a set of 
inequalities for all stages 0 through m - 1 and for 
both passes. From stage™ i in Table 1 we have 


Poainas*PyRo--RyayPuag # F 


Regge - Px1 %Qn—1- - M4418; # F for the second pass 


where F denotes the set of all faulty node numbers 
in stage i. According to the condition specified 
by the Lemma 2, this set should contain no more 
than.r -— 2 elements for i = k - 1 and no more than 
r - 1 elements for all other i, 0 € i £.m-1 except 
i = k-l. . 

Great flexibility of selecting routing paths 
is provided by the R vector, but it has only k di- 
gits Ros Ry +++ Ry 7 which can be used to serve 
this purpose. The total rk different codings of 
these k digits are used in either of the following 
two ways: 

- Part of the codings of the R vector must be 
avoided or excluded in order to mask all faulty. 
nodes in the network; 

* The remaining part of the codings can be 
used to choose the good nodes to establish the 
valid communication paths. 

The goal of the proof of this Lemma is to determine 
the minimum value of k with respect to values of m 
and r such that after subtracting the first part 
of the codings from r* the difference still remains 
positive, i.e. there remains at least one coding 
of the R vector for setting up a good connection 
between any source node and any destination node. 

From Table 1 it can be seen that since proces- 
sor numbers in different stages contain different 
numbers of R digits, the numbers of equivalent R- 
codings for masking one faulty node in different 
stages are also different, as indicated in the 


for the first pass 
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last column of Table l. 

From the condition of the Lemma, in each of 
the stages 0 and 1 a maximum of r - 1 faulty nodes 
may exist and must be excluded from the R-codings 
only for the second pass. Therefore,with the worst 
distribution of faulty nodes, r - 1 different cod- 
ings of RoR,.--R,_} must be cavoided for stage 0; 
and r-1 different codings of XR,...R,_; must 
be -avioded for stage 1, where X = 0,1,...,r-l. 
This means an equivalent number of r(r-1) codings 
of RoR, - - +R, 7 must be excluded from r* for stage.l. 

For stages 2 through k - 2, faulty nodes may 
exist in both passes. Since no digits of the R 
vector appear at the same digit positions for any 
Stage in the two passes, the worst-case distribu- 
tion is as follows: 

-r ~ 1 faulty nodes exist in each stage i, 
2£i << tk/2j,for the first pass, which is equiva- 
lent to (r-1)r*~1 different codings of the R vec- 
tor; 


ze 


-r - 1 faulty nodes exist in each stage i, 
jk/2;+1=i=k-2, for the second pass, which is 
equivalent to (r-1)r+ different codings of the 
R vector. 

Since the digit positions of the R vector in 
each stage i, 2 = i = k/2, for the first pass over- 
lap with the corresponding digit positions of the 
code of the destination node in the same stage for 
the second pass, the worst-case distribution may 
be that one of the r - 1 faulty codings for the 
first pass just coincides with the faulty coding 
for the second pass (See the example in Fig.4, | 
where coding 20321310 coincides between the two 
passes in stage 2). This is equivalent to a number 
of r+ additional codings of the R vector which must 
be excluded from r“ for each stage i, 2 © i.= k/2,.. 

Similarly, a number of rk-l additional codings 
of the R vector must be excluded from r~ for each 
stage i, k/2;+1 = i= k-2, in the first pass, be- 
cause its source code digits may coincide with the 
corresponding R digits of one of the r - 1 faulty 
nodes for the second pass.. 

However, between the first pass and the second 
pass for each stage i, 2 = i © k-2, r - 1 excluded 
codings of the R vector are duplicated. They should 
be added back to the r* total codings. 

Similar argument holds for the stage k - 1 
with the only difference that (r-2)rk-1 equivalent 
faulty codings of the R vector may exist in the 
second pass, r additional equivalent faulty codirgs 
of the R vector may exist in the first pass, and 
between them r -— 2 codings are duplicated. 

For all the remaining stages from stage k -: 
through stage m - 1, r - 1 codings of the R vector 
must be excluded from each first~pass expression. 

After all the different faulty codings stated 
above are subtracted from the total r™~ possible 
codings of the R vector, the difference obtained 
should still be positive, indicating that there 
remains some coding available for selecting good 
nodes along the communication path. This leads to 
the inequality. givan On_the next page. 

In fact, the inequality (1) involves two in- 
equalities, one for k being odd, and the other for 
k being even. Transformation of the expression 
(1). yields the two inequalities (2) and (3) which 
give the implicit functions of k with respect to 
the values of radix r and the number of stages m 
of the network. 


/ 24 /2y 
H&S @- prt = aig -> 2 
i=2 i= {k/2, +1 i=2 
k-2 
- > — -(r- 2)r*t -r- (r - 1) 
i= k/2,+1 
- (r - 1l)r - (tr - 1) (m - k) + (tr - 1) (k - 3) 
eres (1) 
2) 5 2 
2 K/2 Fae 4 AE (rp - 1) (m- 2k +7 + 4) 
-~2 > 0 for k = odd ; (2) 
k/2 (rt +1)(r - 2) ore 
eo r-l Lae 
- (r-1)(m - 2k +r +4) -2 > 0 
for k = even. (3) 


Numerical solutions to the last two inequali- 
ties (2) and (3) are listed in Table 2, which indi- 
cates that for practical values of m and r, the 
value k equals to 3. 


Table 2 Values of k as a function of r and m 


k | 2 3 4 ) 6 
r=4 3 4--7 8--20 21--33 34--77 

m r=8 j3 £44--11 12--68 69--111 112-- 
r=16 3 4--19 20--90 91--225 226-- 


- For a rough estimation by the order of magni- 
tude, it can be proved that the above inequalities 
hold true when we take either of the two values of 


k: r 
k = 21og_m! +2 or k=m-1. 


Therefore we have the following solution 


k = min { 210g, nl +2, m~ 1} 
or 
m- 1 342m45 
k {10g al + 2 n> 5 (4) 


From the solution we conclude that when k 
is larger than or equal to the values listed in 
Table 2 or calculated from (4) there must exist at 
least one coding of the reconfigurating vector R 
necessary for establishing a communication path 
between any source node 0.P, --P,Pp9 and any des- 
tination node k. ae + +Q1% “through m+k steps. 

The proof of the Shove Lemma also shows that 
the condition can be relaxed to permit more than 
r - 2 faulty nodes in stage k - 1. so long as these 
faulty nodes have their codes with © r - 2 diffe- 
rent values at LSD position.This can be seen from 
Table 1 when we look at the processor number in 
stage k - 1 for the second pass, which contains the 
digit Ri at the least significant digit (LSD) 


position-.so that the excluded codings of the R vec 
tor can be expressed as R,R.. == XX. “Ria 
where X stands for any digit feon 0 ehrough rem Ls 
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Lemma 3 Under the condition specified by the 
Theorem, starting from any node in stage O with no 
faulty children, i steps are enough to send a mes- 
sage to another node in stage i with no faulty 
children, where 1¢i‘¢m-1. 

Proof. Since the source node in stage 0 has 
no faulty children, the first step will lead to r 
good nodes in stage 1. Among these r good nodes,at 
most r — 1 nodes can have faulty children, there- 
fore at least one node is free of faulty children. 
The same argument can be extended to the following 
stages. 

Now, having proved the above three Lemmas, we 
are ready to prove the Theorem stated at the begin 
ning of this section. For simplicity of descrip- 
tion, we will label the source stage as stage 0. 
Of course, this stage 0 may be different from the 
stage 0 which we labeled in the proof of Lemma 2. 


Proof of the Theoren. 
Case 1. All the children of 0.P -1°°°P1P9 
are good nodes. 
a) The faulty nodes in stage k - 1 have r-l 


different values at LSD position. 

The condition of the Theorem tells us that 
there must be at least one stage with no more than 
r - 2 faulty nodes. Assume this occurs in the stage 
(k — 1 + i)modm, where 1 € i € m-1. According to 
Lemma 3, a traversal of i steps will lead to a node 
in stage i with no faulty children, then according 
to Lemma 2, taking additional m + k steps will lead 
further to any node in stage (k + i)modm. Thus the 
maximum traversing distance will be equal to 


(m- 1) + (m+k) + @-1) = 3m +k - 2. 


b) The faulty nodes in stage k - 1 have 
r -— 2 different values at LSD position. 
According to Lemma 2, traversing through the 
first m + k steps can lead directly to any node in 
stage k, so that the maximum distance will be 


(m +k) + (m-1) = 2mt+tk-i1. 


Case 2. 


< 


The node 0. Pa <a P,P 0 has faulty 
children. — 

a) The number of faulty children of the node 

€r-2. 

ane to Lemma 1, traversing through the 
first two steps can lead to a node with no faulty 
children, then the routing can follow the path as 
in case 1, so that the maximum distance is 


(3m +k-2) +2= 3m+k. 


b) The number of faulty children of the node 
O.Piiy-°ePyPp =r - 1. 

Under the condition of the Theorem, there must 
be at least one stage with a node whose faulty 
children “ r - 2. Assume it is the stage i, 1 # i 
<m- 1, to say the nearest stage from stage 0. At 
first, the route starts from the node 0.P,_1-.PyPo 
and arrives this stage i through i steps, then, 
according to Lemma 1, it can reach a node with no 
faulty children through 2 additional steps. 

It should be noticed, however, that if any 
node in some stage has r - 1 faulty children, then 
the faulty nodes of the next stage which must be 
masked will have their codes of the form E -1° XFo; 
where X = 0, 1,...,m-1. The LSD can have only one 
value Fo» determined by the MSD of the parent node 
in the preceeding stage. Similarly, from stage l 


0.P 


to stage i, each stage will have faulty nodes with 
only one value at the LSD position. 

After the node in stage i + 2 with no faulty 
children has been reached via the first i + 2 steps 
from the source node 0.P, 1°°°P Po, the route will 
be continued under the cond lien of case 1(b). If 
k + i-<-m, then additional m - (k + i) steps are 
needed to reach the stage (m - k + 2) so oe 
[(m - k + 2) + (k - 1)Jmodm = 1. If k + i = m,then 
[Ci + 2) + (k - 1)]modm = 1. Both cases will have 
the (k - 1)th stage whose faulty nodes take no more 
than r - 2 values at LSD as required by the condi-- 
tion of case 1(b). Therefore, the maximum distance 
is equal to 


(m -1)+2+ (Qm+k- 1) = 


3m +k. 


In summary, communication between any pair of 
good nodes can be performed under the fault condi- 
tions specified by the Theorem, and the diameter 
of the network may be increased to 3m + k, where k 
is determined by Lemma 2. After substitution for k, 
the diameter is 


3m, m= 2 ; 
D= 4m -1 , 32m€5; 
3m + 210g _-m! + 2 m>5. 


EVALUATION OF THE FAULT-TOLERANCE PERFORMANCE 


For a comparative evaluation of the fault-to- 
lerance capabilities of the proposed interconnec-— 
tion network, we take as the reference two similar 
networks recently published in the literature [5] 
-- [6]. The result of comparison is listed in 
Table 3. 


Table 3 Comparative Evaluation of The Fault- 
Tolerance Performance 


Kumar- .. Mixed group- 

Network | Reddy (84) Pradhan (85) shuffle 

Number of m m-1 m m 
+r r mr 

processors 

Degree 2r ror rtl r (logical) 
Fault-tolerant 

Capability 2r -1 r-l or r m(r~1) 1 
Normal-case * m-1 m-1 

Diameter 
Worst-case 3m + 6 6 oe 3m+210g m'+2 

Diameter = os r. 3m+ 3 


The number of faulty processors which the 
three networks can be tolerant of in each stage is 
roughly the same, but if we refer the processor 
connectivity to the total number of processors and 
the processor-degree, then the mixed group-shuffle 
interconnection network appears to be more advan- 
tageous. It can accomodate m times more processors 
and connects them with smaller physical degree. 


The normal-case diameter of the three networks. 


relative to the degree of topology is roughly the 

same, but if again we refer it to the total number 
of processors, then the effective diameter of the 

mixed group-shuffle interconnection network is the 
shortest. 
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The worst-case diameter of the mixed group- 
shuffle interconnection network is also the short- 
est, especially when its relative value with res- 
pect to the normal-case diameter is taken. Its ma- 
ximum distance of communication under faulty con- 


‘ditions is only 1.5 times longer than that under 


normal conditions, whilst for the other two net- 
works the faulty condition would increase the dia- 
meter in 3 times. 

The mixed group-shuffle interconnection net- 
work is simple in implementation. The basic idea 
of the fault-tolerant routing algorithm can be de- 
rived on the basis of the Theorem just proved.From 
Table 1 it can be seen that under normal operation 
a reconfigurating or replacing vector R should be 
found and used to replace the code of the source 
processor digit by digit on the way forward from 
stage 0 to stage k, and then the remaining digits 
of the code of the source node as well as the 
digits of. the R vector are gradually replaced by 
the code of the destination node digit by digit on 
the way through stages k+l,...,m-1,0,1,... until 
the destination node in stage k is reached. Thus 
the combination of the R; , 0 = i = k-l1,digits 
and the destination code digits in the form of 


RgRpRo ++ By Gea 9° 22% 


can be used as the routing vector for the purpose 
of routing control... Since the reconfigurating vec- 
tor is independent of the destination and transpa- 
rent to the intermediate processors, it needs to 
be calculated only once and reserved in the source 
processor for later use. Under faulty operation, 
the only work which the source processor should do 
is to calculate a modified reconfigurating vector 
and attach it to the destination code digits. In 
order to accomplish this, each processor must re- 
serve a complete list of faulty nodes of the sys- 
tem. Each time when this list is updated to add or 
delete any faulty node(s), the reconfigurating 
vector should be updated. Since the length of 
each modified reconfigurating vector <{3m, the me- 
mory space overhead is small. Since the recalcula- 
tion of the reconfigurating vector needs to be 
done only when the faulty ‘conditions change, the 
time overhead is small too. 


CONCLUSION 


The basic principle of constructing a multi- 
stage interconnection network with mixed static 
and dynamic topologies is given, and the group- 
shuffle interconnection function is defined to de- 
velop its typical structure. The resulted m=-stage 
(m®= 2) r-nary mixed group-shuffle interconnection 
network serves the object of analysis of this pa~ 
per, where r is the number of processors per group. 

Three criteria of fault-tolerance performance 
of the interconnection network are observed. The 
processor connectivity allows no more than r - 2 
faulty processors in at least one stage and no more 
than r - 1 faulty processors in every other stage. 
In total, the maximum number of allowable faulty 
processors with any worst-case distribution is 
m(r -— 1) -~- 1, The worst-case diameter under the 
above faulty condition is shown to be 


3m + min{ m-1, 10g _m' +2} 


or practically 
3m + 3 , 


which is approximately 1.5 times longer than the 
diameter of the network under normal conditions. 
The proofs of the Lemmas and the Theorem for 


the processor connectivity and the worst-case dia- 


meter gives the basic idea of the fault-tolerant 


routing control, which helps deriving a simple dis- 


tributed fault-tolerant routing algorithm for the 
proposed interconnection network. 


Comparative evaluation of the fault-tolerance 


performance among three similar interconnection 
networks reveals the advantageous cahracteristics 
owned by the proposed network. 

The proposed mixed interconnection network 


has been implemented in an experimental distribut- 


ed computer system THUDS, developed in Tsinghua 
university, Beijing, China. Though analysis has 


shown some potential of achieving high system per- 


formance, only experience in its application will 


prove finally its appropriateness and suitability 


for a variety of distributed processing needs. 
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Abstract — Massively fault-tolerant cellular array is an array 
of identical cells with connections only to immediate neighbors, 
where the cells and the connections may be defective with high 
probabilities. The cell can function as a processing element, 
as a memory, or as a switching element that connect to other 
cells. On the defective array, a large cluster of interconnected 
working cells is formed and the cells without an adequate num- 
ber of working neighbors are pruned out from the cluster. The 
working cells in the cluster are configured into a graph that de- 
termines the function of the array. The algorithms for forming 
the cluster, pruning the cluster, and configuring the cells into 
a linear array, a two-dimensional array, and a binary tree are 
described, and simulation results are presented. 


Introduction 


With the progress in VLSI technology, we will be able to 
manufacture huge number of devices on a chip, but we may be 
unable to avoid many defects. Under current integrated circuit 
architecture, we can tolerate few defects on a chip. With huge 
number of devices on a chip, we could afford the redundancy 
necessary to provide fault-tolerant operations. With enough 
redundancy, we should be able to devise a fault-tolerant ar- 
chitecture which allows efficient computation in spite of many 
defects. 


Fault-tolerant multiprocessors and configuration of cells on 
the defective array in the context of wafer-scale integration has 
been studied by many researchers [6,12,1,14,9,17,7,4,11,15,13]. 
In fault-tolerant multiprocessor systems, introduction of spare 
parts and reconfiguration has been used in a limited scale. On 
the VLSI architecture, where huge numbers of devices are avail- 
able and connections between them are difficult and defects 
cannot be eliminated, several different approaches has been 
tried. 


In technology-oriented approaches, discretionary wiring and 
laser personalization have been used for fault-tolerance on VLSI. 


In these methods, separate nonreversible processing steps are 
employed for each chip to reconfigure circuits on the defective 
integrated circuit. In other approach, programmable switches 
are provided between cells [17,7], and the switches are used to 
configure working cells in the defective circuit. Here electrical 
configuration and reconfiguration is possible, but the switches 
should be fault-free, and some means to program the switches 
should be provided. In cellular array approach, cells can be 


used as switches as well as processing elements ([12,1,9,4]. This 


allows distributed self-configuration, and maintains the general 
advantages of cellular array which derives from its homoge- 
neous and regular structure. 


To take full advantage of reconfigurable defective cellu- 
lar, we need distributed self-configuration where the global 
defect pattern of cellular array is not needed. However they 
were not efficient when many cells were defective. With ade- 
quate computing power in each cell, however, distributed self- 
configuration can be done efficiently. In this paper we describe 
a cellular array that can be efficiently configured despite many 
defects. The cellular array, called massively fault-tolerant cel- 
lular array, is an array of identical cells with connections only 
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to immediate neighbors, where each cell can function as a pro- 
cessing element, as a memory, or as a switching element that 
connects to other cells. Input and output terminals are con- 
nected only at the boundaries of the array. This cellular array 
anticipates the occurrence of massive defects in the cells and 
in the interconnections. The cellular array is designed in such 
a way that there exist a set of working cells that maintains a 
desired processing capability despite many defective cells and 
defective interconnections. 


To maintain processing capability despite many defective 
cells and interconnections, cells in the massively fault-tolerant 
cellular array need to have some mechanism to identify the 
defective cells and interconnections, and configure themselves 
around the defective cells. By employing built-in self-testing 
techniques, we should be able to devise a testing mechanism 
by introducing additional testing hardware. In this work, we 
assumed that by some mechanism, each cell knows if its neigh- 
bor and the connection to the neighbor is working or defective. 


After the defective cells are identified, we form a big cluster 
of interconnected working cells from a subset of all working cells 
in the array. The cells without an adequate number of working 
neighbors are pruned out from the cluster. We then configure 
the cells in the cluster into a graph that specifies the function 
of the array. 


In the following sections, we describe the architecture of 
the massively fault-tolerant cellular array, and the algorithms 
for the formation of the cluster, the pruning of the cluster, 
and the configuration of the cells into a linear array, a two- 
dimensional mesh, and a complete binary tree. Simulation of 
the massively fault-tolerant array and simulation results of the 
formation of the cluster, pruning, and configuration of cells into 
a linear array, a two-dimension mesh, and a complete binary 
tree are shown on arrays of size 40 by 40, 80 by 80, and 120 by 
120. 


Massively Fault-tolerant Cellular Array 


The three regular interconnection patterns studied in this 
paper are shown in Figure 1. They are selected to study the 
effect of the number of the neighbors on the behaviour of the 
cellular array. The three arrays with the three interconnection 
patterns of Figure 1 are called square array, hexagonal array, 
and octal array respectively. Figure 2 shows a square array 
with defective cells and defective interconnections. Note that 
although the initial array is regular, the ensuing array is not 
(see Figure 2), as the faults cause breakdown in the regularity 
of the array. 


Though there are many defects, there are still many work- 
ing cells in the array so that useful computation can be per- 


(b) hexagonal array 


Figure 1. Interconnection patterns of the cellular array 


(a) square array (c) octal array 


formed. The computations that the array is intended to per- 
form will determine how the working cells are configured. The 
logical interconnection of cells for any particular computation 
can be represented by a graph, called the computation graph in 
this paper. The configuration of cells into a computation graph 
is the process of embedding the computation graph in the de- 
fective array. The embedding of the computation graph in the 
defective array is represented by a graph, called the connection 
graph. For example, the tree in Figure 3(a) is a computation 
graph; Figure 3(b) shows an embedding of the tree on the de- 
fective square array; Figure 3(c) is the connection graph of the 
embedding of Figure 3(b). 


In Figure 3(b), some cells are mapped to the nodes of the 
computation graph, while other cells are used to connect the 
cells that are mapped into the nodes of the graph. The cells 
mapped into the nodes of the graph are called the computation 
cells, and the cells used to connect the computation cells are 
called the connection cells. The connection cells are represented 
by dots on the connection graph. 


A cell in the massively fault-tolerant cellular array is con- 
figured into a computation graph by identifying the logical con- 
nections of the cell. For example, a cell is configured into a 
binary tree by saving in special registers the directions of the 
neighbors as a father, a left child, and a right child of the cell. 
The configuration can be changed by changing the contents of 
the registers. The configuration is performed in a distributed 
fashion: each cell makes a decision only with the information 
that is available at the cell. 


Architecture of a Cell 


Each cell has a processor, local memory, Communication 
Registers, an ID Register, Neighbor Status Registers, a Status 
Register, a Pattern Register, and Connection Registers. Figure 
4 shows the architecture of a cell in the square array. The 
registers are explained below: 


ID Register: Stores the row and column indices of the cell in 
the array. 


Neighbor Status Register, NS[0..N-1]: Stores the status 
of the neighbors. The neighbor can be defective, working, or 
pruned out. Here N is the number of immediate neighbors. 


Status Register, SR: Stores the current status of the cell. 
The states of a cell are “idle” , “live” , “pruned”, “comp”, “conn”, 
and “mem”. The status of all working cells are initially “idle”. 
The clustering procedure changes the status of the cells belong- 
ing to a big interconnected cluster to “live”, and the pruning 
procedure changes the status of the working cells without an 
adequate number of cells to “prune”, so that they are not used 
in the configuration procedure. The configuration procedure 
changes the status of the “live” cells to “comp” or “conn”, de- 
pending upon whether the cell is used as a computation cell or 
a connection cell. Finally, the status of a cell is “mem” when 
the cell is used as a memory. 


working cells 
defective ceils 


working connections 
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defective connections 


Figure 2. A defective square cellular array 
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(a) computation graph (o) embedding of a tree 


on a defective array 


Figure 3. A computation graph, its embedding and a 
connection graph 


(c) connection graph 


Cell B 


Figure 4. Architecture of a cell and communication register 
of the cellular array 


Pattern Register, PR: Stores the current configuration pat- 
tern. The configuration patterns that are recognized are linear 
array, two-dimensional mesh, complete binary tree, and span- 
ning tree. 


Connection Register, CR[0..N-1]: Stores the directions of 
neighbors in the logical configuration specified by the Pattern 
Register, PR. The meanings of the CR[0..N-1] are different for 
each configuration. When PR specifies that the configuration 
is a linear array, CR(0] is the direction of the predecessor, and 
CR[1] is the direction of the successor. When PR specifies 
a binary tree, CR(0..2] are the directions of the father, the 
left child, the right child, and CR[3] is the level of the cell. 
When PR specifies a two-dimensional mesh, CR(0..3] are the 
directions of the up, right, down, left neighbors of the cell, and 
CR[4..5] are the row and column indices of the cell in the two- 
dimensional mesh. When PR specifies a graph, CR(0..3] are 
the directions of the two sources and two destinations. Finally 
when PR specifies a spanning tree, CR[0] is the direction of the 
father, and CR[1..N-1] are the directions of the sons. 


Communication Registers, Comm/[0..N-1]: These are used 
for communication between the cells. Cells communicate with 
other cells by sending and receiving messages through Commu- 
nication Registers. Each Communication Register provides a 
one-way communication between two cells. Between a bound- 
ary of two cells, two Communication Registers provide two-way 
communication. Figure 4 shows a pair of Communication Reg- 
isters. The arrows on the Figure 4 shows the directions of the 
information. The fields of the Communication Register are as 
follows. 


Full Bit, FB: Shows if the Communication Register is 


full or empty. 
Enable Bit, EB: Shows if the other cell is willing to receive 
the message. If EB is 0, the other cell 
may not read the Communication Regis- 
ter. EB is used to prevent deadlock. 


Messages, Mesg: Contains the messages. 


The FB is used to synchronize the communication between 
two cells. A cell can read a Communication Register when the 
Communication Register is full, and after the cell reads the 
Communication Register, FB is reset to 0. If a cell wants to 
read a message from an empty Communication Register, the 
cell waits until FB is set again. A cell can write to an empty 
Communication Register, and if a cell initiates write to a full 
Communication Register, the cell has to wait until EB is set. 
This provides the synchronization between two cells. 


Processor: The processor is an instruction set processor with 
small instruction set. The instructions includes usual arith- 
metic, logical, data transfer, program control operations, and 
send and receive for input and output. “Send (dir, message)” 
writes the “message” on the Communication Register in “dir” 
direction: return from “Send” acknowledges that the message 
is written to the Communication Register. “Recv (dir)” reads 
a message either from the Communication Register in “dir” 
direction, or from any full Communication Register if “dir” 
is “any”: return from “Recv” acknowledges that a message is 
available at the Communication Register in “dir” direction. 


Controller: In the massively fault-tolerant cellular array, an 
external machine is attached to the boundary cells to control 
the operation of the array. The external machine initiates the 
operations of the array, and provides the data to the array by 
sending messages to the array. The external machine will be 
called a Controller. 


When the massively fault-tolerant cellular array is pow- 
ered on, Neighbor Status Registers, NS[0..N-1], of each cell are 
set to indicate the status of the neighbors of each cell (work- 
ing or defective) by some testing mechanism. The hardware 
and algorithms for the testing have not been devised yet. We 
assumed that the testing can be done by some mechanism. 


After the testing is finished, the controller initiates the 
clustering procedure which identifies the largest cluster of in- 
terconnected working cells in the array. After all the cells in 
the cluster are given identification numbers, the controller ini- 
tiates the pruning procedure which prunes out the cells in the 
cluster without an adequate number of working neighbors. The 
controller then initiates the configuration procedure to config- 
ure the cells in the cluster into a desired computation graph. 
When the Controller sends a message to a cell at the boundary 
of the array, the cell relays the message to its neighbor cell; 
and the message will propagate through the working cells. The 
cells will be configured into a graph specified by the operation 
field of the message. The configuration of the cells is finished 
when all the cells set their Configuration Registers correctly, 
and the boundary cell connected to the controller returns an 
appropriate message to the controller. 


Formation of the Cluster of Cells 


Since the cells in the cellular array connect only to the 
immediate neighbors, when a cell wants to communicate with 
another cell which is not directly connected to it, the message 
should be relayed by intervening working cells to the target 
cell. Therefore, a small cluster of working cells surrounded by 
defective cells cannot be used. The array can be useful only 
when a large cluster of of interconnected working cells is formed 
in the array. 


We can use percolation theory [16,3] to predict if a large 
cluster appears on the array, and if the cluster appears, to pre- 
dict the size of the cluster. According to the percolation theory 
there exist a critical probability such that when the probabil- 
ity that a cell is working is more than the critical probability, 
there appears an infinite cluster of interconnected working cells 
in the defective infinite array of cells. 
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Percolation Theory and the Cellular Array 


Consider a lattice L defined as a graph of N sites (or ver- 
tices) and M bonds (or lines). In most cases of practical in- 
terest, L will be a regular two or three dimensional lattice of 
finite or infinite extent. In the bond problem, each bond of L 
is occupied (or open, or working, etc.) with probability p or 
vacant ( or blocked, or defective, etc.) with probability 1 — p. 
Occupied bonds are connected if they meet at a common site, 
and a connected set of s bonds form a bond cluster of size s. In 
the site problem, each site is occupied with probability p and 
vacant with probability 1 — p. Occupied sites are connected to 
form stte clusters if they are adjacent through the bonds of L. 


For an infinite lattice DZ there is a critical probability p, 
= p-(b, L) or p,(s, L) for the bond or site problems such that 
for p < p- all clusters will be finite while for p > p, there 
will, with positive probability, be an infinite cluster in L. The 
infinite cluster is called a percolation cluster. We can define 
the percolation probability, P(p), as the probability that a site, 
chosen at random, belongs to an infinite cluster. One defines 
the erttical probability, p., as 


Pe = sup{p|P(p) = 0}. 


The sites in the percolation model corresponds to the cells 
of the massively fault-tolerant cellular array, and the bonds in 
the percolation model corresponds to the connections between 
cells. In the site percolation problem, all the bonds are as- 
sumed to be occupied and only the sites can be vacant; this 
corresponds to the assumption that all connections are work- 
ing and only cells can be defective. In the bond percolation 
problem, all the sites are assumed to be occupied and only the 
bonds can be vacant; this corresponds to the assumption that 
all cells are working and only connections can be defective. 


Since the area of a cell is much bigger than the area of 
a connection in the cellular array, and the defect probability 
of an integrated circuit is at least proportional to the area of 
the integrated circuit, the defect probability of a cell is much 
greater than that of a connection. Therefore, as a first ap- 
proximation, the array can be modeled by the site percolation 
model. On the site percolation model of the defective array, we 
can take into account that connections can be defective by asso- 
clating connections with neighboring cells. When a connection 
is defective, the cells associated with the defective connection 
are considered defective. When we need to use the working 
cells connected to the defective connections, the defective ar- 
ray should be modeled by the combination of site and bond 
percolation problem [8]. 


In percolation theory the percolation cluster is an infinite 
cluster that appears on the infinite lattice. On the cellular array 
of finite size, we define the percolation cluster as the largest 
cluster that is connected to all four borders of the array when 
the array is rectangular. Using the percolation theory, we can 
predict that the percolation cluster will appear in the defective 
cellular array when the probability that a cell is working, p, is 
more than the critical probability, p, of percolation theory. 


Figure 5 shows the formation of the clusters in a square 
array, where the critical probability is 0.59. Here cells in solid 
boxes belong to the largest cluster. When p is 0.5, the perco- 
lation cluster does not appear (Figure 5(a)); when p is 0.61, a 
thin percolation cluster appears (Figure 5(b)); when p is 0.75, 
a thick percolation cluster appears (Figure 5(c)). 


However, as we can see in Figure 5(b), even though the per- 
colation cluster appears in the array, when p is not high enough, 
the percolation cluster is “thin”, contains many branches, and 
does not include many boundary cells. The cells on the thin 
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(a) p = 0.5; a percolation cluster does not appear. 


(b) p = 0.61; a thin percolation cluster appears. 
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(e) p = 0.75; a thick percolation cluster appears. 


Figure 5. The formation of clusters on the square array 


percolation cluster may not be used effectively because commu- 
nication between two cells in the thin cluster is difficult, and 
configuration of cells into a computation graph is not efficient. 
Furthermore, input and output on the thin percolation cluster 
is difficult due to the lack of the boundary cells. Therefore 
the thin percolation cluster in Figure 5(b) may not be useful. 
When p is high enough, the percolation cluster that appears in 
the array is “thick”, and has many boundary cells. The cells in 
the thick cluster may be used effectively for configuration and 
computation. The cluster in this case is shown in Figure 5(c). 


The usefulness of the percolation cluster depends on the 
computation graph that will be embedded on the percolation 
cluster. For example, a percolation cluster may be considered 
adequate enough to embed a linear array but the same percola- 
tion cluster may not be adequate to embed a two-dimensional 
array. 


Formation of the Cluster 


When p is more than p,, we can form a percolation cluster 
of working cells in the defective array. The cluster is formed by 
connecting the working cells into a spanning tree which spans 
all the working cells connected to a certain boundary cell. The 
controller sends a message to a working cell at the boundary, 
where the message specifies that a spanning tree of working 
cells be formed. The cell that received the message from the 
controller becomes the root of the spanning tree. 


After a spanning tree is formed with the cell that received 
a message from the controller as the root of the spanning tree, 
the cell returns the number of cells connected into a spanning 
tree to the controller. If the number is large enough, the cells 
in the spanning tree are taken as the percolation cluster. If the 
number is not large enough, another message is sent to some 
other working cell on the boundary, and a new spanning tree is 
formed with the new cell as the root of the spanning tree. This 
process continues until a percolation cluster is found, or the 
controller gives up finding a percolation cluster in the defective 
array. 


40 
working probability (%) 


50 60 70 


Figure 6. Percentage of working cells that are in the largest 
cluster 
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When a cell receives a message from a neighbor, the cell 
changes its state by setting the Status Register, SR to “live”, 
and it saves the direction of the neighbor on CR[0] as the father 
of the cell. The cell then sends the message to its working 
neighbors. If a neighbor returns the message that the neighbor 
is a part of the spanning tree, the direction of the neighbor 
is saved on CR[1..N-1] as a son. The spanning tree grows in 
depth first order. The enabling and disabling of communication 
registers is necessary to prevent deadlock. 


Figure 6 show the percentage of working cells that are 
in the largest cluster in the 120 by 120 square, hexagonal, and 
octal array, respectively. The experimental results are generally 
in agreement with percolation theory. From Figure 6 we can see 
that more than 90% of working cells belong to the percolation 
cluster when p is more than 0.7 on the square array, when p 
is more than 0.6 on the hexagonal array, and when p is more 
than 0.5 on the octal array. 


Assignment of Id 


-After the working cells are connected into a spanning tree, 
the cells on the spanning tree are assigned identification num- 
bers. The identification number of a cell is the row and column 
indices of the cell in the array. The identification number is 
saved in the ID register. 


The controller sends a message to the root cell of the span- 
ning tree, where the first field of the message specifies the op- 
eration, and next fields are the row and column indices of the 
root cell. When a cell receives the message, the cell saves the 
row and column indices on the Id register, and computes the Id 
of the son. Then the cell sends the message with the computed 
Id to the son. When the cell receives the message from the son, 
it iterates the same operations on the next son. 


Pruning 


After a percolation cluster of working cells is formed in 
the defective array, the cells in the percolation cluster can be 
configured into a computation graph. However, some cells in 
the percolation cluster form a single-width dead-end branch, 
and they may not be configured effectively. Furthermore, they 
may slow down communications between cells. To facilitate 
the configuration of working cells and the communication be- 
tween cells, we may prune out the dead-end branch of cells from 
the percolation cluster. Figure 7 shows an example of such a 
pruning. 

Pruning of the dead-end branches from the cluster can 
be generalized to pruning-to-k. The Pruning-to-k operation 
prunes out the cells which are connected to less than or equal 
to k working neighbors. Here k is called the level of the pruning. 
Pruning of the dead-end branch corresponds to pruning-to-1: 
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Figure 7. Pruning of dead-end branches from a cluster 


the cells connected to only one working neighbor are pruned 
out from the cluster. Pruning is applied repetitively until no 
more cells are pruned. 


By pruning the cells from the cluster, we can have a cluster 
of tightly connected cells. After the pruning-to-k operation, all 
the cells in the cluster are connected to at least k+1 working 
neighbors. This can facilitate the configuration of cells into a 
graph and the communication among the cells in the cluster. 


The Controller initiates the pruning operation by sending 
a message to a cell in the percolation cluster. The message con- 
sists of a field specifying the pruning operation and the level of 
the pruning, k. When a cell receives the message from a neigh- 
bor, the cell counts the number of working neighbors, and com- 
pares the number of working neighbors with the pruning level 
k specified on the message. If the number of the neighbor is not 
greater than the pruning level, the cell prunes itself from the 
cluster by changing its Status Register, SR, to “pruned” from 
live’. Then the cell sends messages to its working neighbors 
that it has been pruned out. The neighbors set their Neighbor 
Status Register accordingly. The cell which received the mes- 
sage from the Controller returns the message containing the 
number of pruned cells to the Controller. The controller sends 
the pruning message again until no more cells are pruned out 
from the cluster. 


Figure 8 shows the pruning of cells from the percolation 
cluster on the square, hexagonal, and octal array of size 40 by 
40, 80 by 80, and 120 by 120. The pruning level k was restricted 
to less than half the number of neighbors. When the pruning 
level is more than half the number of neighbors on the network, 
most of the cells are pruned out leaving only a small isolated 
cluster of cells. 


From the experiments, we can see that when the working 
probability is high enough, most of the cells in the percolation 
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cluster are connected to several working neighbors. When the 
working probability is 80%, on the square array, more than 
95% of the cells have two or more working neighbors, on the 
hexagonal array, more than 95% of the cells in the cluster have 
three or more working neighbors, and on the octal array, more 
than 95% of the cells in the cluster have four or more working 
neighbors. 


Configuration of Cells 


The cells have to be configured into a general computation 
graph which specifies the function of the array. Before the 
configuration of the cells into a general computation graph, we 
studied the configurations into three particular graphs: linear 
array, complete binary tree, and two-dimensional array (mesh). 
Many computations can be done efficiently on these graphs. 


The configuration of cells into a computation graph is the 
process of embedding the computation graph on the defective 
array. Since the cells that does not belong to the percolation 
cluster cannot be used, computation graph is embedded on 
the percolation cluster. Note that when the percolation cluster 
appears on the defective array, most of the working cells belong 
to the percolation cluster. 


The efficiency of the configuration into a graph G, eg, is 
defined as 


number of cells used as computation cells 
number of working cells in the cluster 


The delay dg(C1, C2) between the two cells, Cy and C2, in a 
configuration into a graph G is defined as one plus the number 
of connection cells between the two cells C,, and C2. There- 
fore, the delay between two directly connected cells is 1, and 
the delay is 2 when there is one connection cell between two 
cells. The mazimum delay of the configuration is the maxi- 
mum delay among all two adjacent computation cells, and the 
average delay of the configuration is the average of all delays 
among two adjacent computation cells. 


eg = x 100 


We define the degree of a graph G, dg, as the average 
degree of the vertices of the graph. To configure the cells in 
the defective array into a computation graph of degree dg ef- 
ficiently, the number of neighbors on the array, or the degree 
of the array, d4, needs to be greater than dg. When d, is less 
than dg, the efficiency of the configuration becomes low. We 
can expect that cells can be configured into a linear array (dg 
= 2), and a tree (dg = 3) efficiently on the square array (d4 
= 4), the hexagonal array (d4 = 6), and octal array (d4 = 8). 
But the configuration of cells into a mesh (dg = 4) on the de- 
fective square array may not be as efficient as the configuration 
on the hexagonal array or on the octal array. 


On the following sections overview of configuration proce- 
dures into a linear array, a tree, and a mesh are described. The 
detailed algorithms can be found in [10]. 
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Figure 8. Percentage of cells in the largest cluster that are pruned out 
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Linear Array 


In the linear array, every cell has two neighbors: the pre- 
decessor, and the successor. The configuration of cells into 
a linear array is the process of identifying predecessors and 
successors and saving the directions of the predecessors and 
successors in the Connection Registers, CR[0], and CR[1]. 


The configuration procedure consists of three parts: Lin- 
ear, Extend, and Join. Procedure Linear grows the linear array 
into the defective array of cells. When the linear array is grown 
in the defective array by the Procedure Linear, Procedure Ex- 
tend finds the cells which are not in the linear array, but which 
can be connected into the linear array. Then Procedure Join 
connect the cells identified by Procedure Extend into the lin- 
ear array. By combining the three procedures, Linear, Extend, 
and Join, most of the cells in the cluster are connected into the 
linear array. | 


The controller initiates the configuration by sending a mes- 
sages to a cell at the boundary of the array. The message 
consists of the fields specifying the operation, the number of 
cells to be connected into the linear array, the direction of the 
successor neighbor. When a cell B receives a message from a 
neighboring cell A, the cell B sets the Pattern Register PR to 
“linear array”, saves the direction of the cell A on CR(0] as a 
predecessor. Then the cell B tries to grow the array by adding 
a neighbor as its successor. First, if the neighbor C’ specified 
as the successor on the message is working, the message is sent 
to C. If C is connected to the linear array, the linear array 
grows from C again. The cell C returns the message to B with 
the number of cells on the linear array after C. Then cell B 
saves the direction of the cell C on CR[1] as a successor, and 
increases the number of cells by one, and returns the message 
to the predecessor, A. If C fails to be connected into the linear 
array, then the neighbor on the direction of the growth of the 
linear array as specified on the message is tried. If this fails 
too, then any working neighbor is tried. If all fail, the linear 
array retracts to the cell B, and growth of linear array is tried 
at cell B again. Since the cells do not know the global state of 
the network, the linear array can be grown into the dead-end, 
and the cells may have to backtrack often. 


When Procedure Linear is finished, the cell at the bound- 
ary which received the message from the Controller returns the 
number of cells connected into the linear array to the controller. 
If the number of cells is less than the number the Controller 
wants, the controller sends a new message to the cell. The new 
message consists of the fields specifying Procedure Extend, and 
the number of cells to be joined to the linear array. The cells 
which were not part of the linear array but adjacent to the 
linear array are identified and joined into the linear array. 


With the three configuration procedures, Linear, Extend, 
and Join, most of the cells are configured into a linear array 
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Figure 9. Configuration of cells into a linear array in the 
defective square array, p = 0.8 
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Figure 10. Efficiency of linear array configuration 


when the working probability of a cell is adequate. Figure 9 
shows the cells connected into a linear array on the square 
array. Figure 10 shows the efficiency of the configuration of 
linear array on the square array, on the hexagonal array, and 
on the octal array of size 120 by 120. 


From the Figure 10, we can see that most working cells are 
connected into a linear array. Since degree of the linear array 
is two, the configuration of cells into a linear array should be 
efficient on all arrays even when working probability of a cell is 
not high. As the coordination number of the array increases, 
efficiency of the configuration increases rapidly. On the square 
array, when the working probability is 80%, more than 85% of 
the working cells are connected into the linear array. On the 
octal array, with the working probability 60%, about 90% of 
the working cells are connected into the linear array. 


Since all computation cells are connected directly to the 
other computation cells without intervening connection cells, 
no delay has been introduced, and the average delay is 1. 


Tree 


In the complete binary tree, every cells have three neigh- 
bors: a father, a left child, and a right child. The working cells 
of the defective array are configured into a complete binary tree 
by setting their Connection Registers CR(0..2] to the directions 
of a father, a left child and a right child of each cell respectively, 
and saving the level of the cell in the tree on CR[3]. (The level 
of the leaf node is defined as one, and the level of a father 
is one more than that of its child.) Some working cells are 
used as the nodes of the tree, which are computation cells, and. 
some are used as connection cells, which are used to connect 
computation cells. 


Since the topology of the binary tree and that of the array 
of cells does not match, we need to use many working cells as 
connection cells even when there is no defects in the network. 
Koren [9] studied embedding of a tree in a defective square 
array, but his procedure allows few defects, and its efficiency is 
very low when there are many defects. The algorithm sets all 
working cells in the row and the column of the defective cell 
as connection cells, thereby making a reduced array without 
defect of one less row and one less column for each defects. 
Note'that when there are many defects, none of the rows and 
columns will be without defective cells. 


The algorithm we devised allows efficient configuration 
even when many cells are defective. The algorithm has two 
parts: Tree, and Retract. Procedure Tree connects the cell 
into a tree, and Procedure Retract retracts the subtree when 
it can not increase the level of a subtree 


The Controller initiates the configuration into a tree by 
sending a message to a cell at the boundary of the array. The 
message consists of a field specifying the operation, and the 
level of the tree desired. If the cells are successfully configured 
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Figure 11. Configuration of cells into a tree in the defective 
octal array, p = 0.7 


into a tree of the level specified on the message, the cell returns 
the level of the tree. The Controller then increases the level of 
the tree by one, and sends the message again to the cell until 
the desired level is achieved. 


When a cell receives the message from the father cell, the 
cell tries to increase the level of the left subtree by one. If 
it is successful, the cell tries to increase the level of the right 
subtree by one. If it is successful, the level of the tree has been 
increased by one, and the cell returns the message to its father. 
But if it fails, the cell is changed into a connection cell, the 
right subtree is retracted, and the tree expansion is tried at 
the left subtree again. If the level of the left subtree cannot be 
increased, the left subtree is retracted, and the cell is changed 
to the connection cell, and the tree expansion is tried at the 
right subtree again. The tree is expanded in breadth-first order. 
Figure 11 shows the trees embedded in an octal array. 


Since the average degree of the tree graph is three, con- 
figuration of cells into the tree in the square, hexagonal, and 
octal array could be efficient even when working probability of 
a cell is not high. Table 1 shows the maximum level of the 
tree into which cells are configured. Note that to increase the 
level of a tree by one, the number of computation cells in the 
tree should be multiplied by two. Table 2 shows the efficiency 
of the configuration, and Table 3 shows the average delay of 
the configuration. As the number of neighbors in the array in- 
creases, and as the working probability increases, efficiency of 
the configuration increases, and average delay decreases. 


We compared the efficiency of our configuration algorithm 
with the embedding of H-tree in a defectless square lattice. H- 
tree is a complete binary tree embedded in a recursive pattern 
that looks like the letter ‘H’ [2]. H-tree is known to be the most 
efficient way of embedding a complete binary tree in a square 
lattice. The maximum level of H-tree that can embedded in a 
defectless square lattice is 9 in a 40 by 40 lattice, and 11 in a 80 
by 80 lattice, and the efficiency of embedding the largest H-tree 
on lattice of size 40 by 40, or 80 by 80 is 32%, and the delay 
on the H-tree is about 3.4 [10]. Comparing these numbers with 
Table 2 and Table 3, we see that we can configure cells into 
a defective array without much penalty even when there are 
many defective cells in the array. 


Mesh 


On the mesh of cells, every cells have four neighbors: left, 
right, up, down. The working cell on the defective array are 
configured into a mesh by setting its Connection Registers 
CR(0..3] to the directions of the neighbors connected as up, 
right, down, and left neighbor of the cell. 


Manning [12] describes the algorithm for embedding a mesh 
on the defective square array, and Green [5] describes embed- 
ding a mesh using the channel between the cells. Both uses 
the knowledge of the global state of the defective array. Here 
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Table 1. Levels of the embeded trees 
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60% | 92 [146 [203] 31 | 68 [13.4 
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Table 2. Efficiencies of the tree configurations 


[size [40X40 
working 
probability 
50% | | 243 )195] [2.54] 2.17] 
60% 279 | 2.25 | 1.77 | 2.01 | 9.43 | 2.07 
| 70% _|| 3.03 | 2.06 | 1.88 || 2.86 | 2.22 | 1.90 | 
[90% | 2.16 | 1.88 [1.72 2.45 [1.93 | 1.79) 


| 90% __| 2.19 | 1.75 | 1.69 || 2.49 | 2.02 | 1.75 | 
| 100% _ | 2.00 | 1.60 | 1.63 || 2.24 | 1.85 | 1.75 | 


Table 3. Delays on the embedded trees 


a distributed algorithm where each cell knows only the state of 
the neighbors (working or defective) is described. 


The Controller sends a message to the cell at a boundary of 
the massively fault-tolerant cellular array, where the message 
tells. the cell to grow a horizontal line of the mesh. If it is 
successful, the Controller sends a message to the cell at the 
other boundary to grow a vertical line of the mesh. Growth 
of horizontal and vertical line alternates until no more line of 
the mesh can be grown on the array. The cells at the junction 
of a horizontal line and a vertical line becomes the nodes of 
the mesh. The cells at the nodes of the mesh are computation 
cells, and the cells connecting the nodes are connection cells. 
The computation cells are given the coordinates of the mesh. 


Since the complexity of the mesh is four, and the degree 
of the defective square array is less than four, efficiency of the 
configuration may be low on square array. When a growing hor- 
izontal line comes across a defective cell, the line should veer 
around the cell, and this uses the cells which can be used for 
a vertical line. Veering around the defective cell on the square 
array while growing a horizontal line blocks the growth of a ver- 
tical line, and vice versa. Therefore, bending the line should be 
done sparingly. On the hexagonal and octal array, the grow- 
ing horizontal line can use the connection without occupying 
the cell in the other direction. This increases the efficiency of 
configuring the cells into a mesh on hexagonal and octal array. 


Figure 12. Configuration of cells into a mesh in the defective 
hexagonal array, p = 0.9 
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Figure 13. Efficiency of mesh configuration 


Table 4. Delays on the embedded meshes 


When the cell receive the messages, it tries to grow in the 
‘direction of the line. When the neighbor on the direction of the 
line is defective, it tries to grow on the direction specified on 
the message. If the cell cannot grow the line, it backtrack to 
its predecessor cell. Figure 12 shows the cells configured into a 
mesh. 


Figure 13 shows the size of the mesh as the percentage of 
the array size, and Table 4 shows the delay of the configuration. 
As shown in the Figure 13 and Table 4, the efficiency of the 
configuration increases rapidly and the delay decreases rapidly 
with the increase of the number of the neighbors. 


Conclusion 


As shown in this paper, self-configuration of cells into var- 
ious computation graphs can be done efficiently when cells are 
adequately powerful. When we use the massively fault-tolerant 
cellular array for a particular application, we need to configure 
cells into a particular computation graph, and we can deter- 
mine the defect rate of cells which allows acceptable efficiency 
of configuration from the data in this paper. We can change 
this acceptable defect rate by changing the interconnection pat- 
terns or the size of a cell to find the feasible implementation on 


[size [40X40—~«YT—=«80X80—~«Y 120X120 
5 a oo 
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80% 95 [52 [32] 13 [70[33 | 12 | 77/35. 
[90% 42 [2.65 [21 [5.0[29|21[48|s.0[22 
95% 25 [17 [1632 [20/17] 31] 2a [17 
[100% 1.0 [1.0 [10 [t0l10[10 [10/10 [1.0 
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available technology. Currently we are designing a wafer-scale 
signal processing chip using the massively fault-tolerant cellu- 
lar array. The architecture is very homogeneous and simple, 
and shows the potential for high performance. 
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Abstract 


Several approaches to the design of fault-tolerant 
arrays of processors with a view towards wafer-scale integra- 
tion, have been proposed in the past. Using these 
approaches it is possible to design linear arrays that have 
several desirable features of an effective fault-tolerant 
design, that is, they facilitate testability, are easily 
reconfigurable, have short inter-PE wire lengths and make 
one hundred percent utilization of non-faulty processors. In 
contrast, no single scheme or combination of previously pro- 
posed schemes result in effective fault-tolerant implementa- 
tion of 2-D arrays. 


In [19] we presented a computing structure that has all 
the attractive implementation features of a linear array and 
can multiply matrices with a better time performance than a 
linear array. In this paper we examine the inter-relationship 
among the number of processors, the storage within a pro- 
cessor, internal bandwidth and the time complexity of 
matrix multiplication on such a model and establish lower 
bounds on the time complexity and the number of proces- 
sors required as a function of the storage within a processor 
and the internal bandwidth. We then present a generalized 
algorithm which matches these bounds for arbitrary storage 
within a processor and arbitrary bandwidth. 


An interesting side effect of our result is that the area 
and time complexities and the asymptotic complexity of the 
number of proceesors used by the algorithm in [19] and all 
previously known linear-array algorithms can be obtained as 
special cases of our generalized algorithm. Additionally, the 
techniques and the results of this paper can be readily 
adapted to obtain such a family of optimal algorithms for 
several other 2-D systolic algorithms. 


1. Introduction 


The ever growing demand for high performance com- 
puting, coupled with advances in integrated circuit 
fabrication technology has led to considerable interest in the 
design of VLSI architectures that realize high-performance 
parallel algorithms directly in silicon at modest costs. One 
such direction pioneered by Kung and Leiserson [12] is the 
concept of systolic arrays as a VLSI computing structure to 
_implement high-performance parallel algorithms. 


A systolic array consists of a collection of processing 
elements (PE’s), interconnected in regular nearest-neighbor 
geometrical structures like a one-dimensional (1-D) linear 
array, or two-dimensional (2-D) rectangular or hexagonal 
array. A data item that has been retrieved from the 
memory by a systolic array, passes through several PE’s 
before returning to the memory. This feature of using a 
datum from memory many times over without having to 
store and retrieve intermediate results gives rise to high 
computational throughput. Simplicity and regularity of sys- 
tolic arrays make them suitable for VLSI implementation at 
‘minimal design costs. Systolic arrays find natural applica- 
tions in signal and image processing where large matrix com- 
putations need to be performed rapidly. 
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Still higher performance can be had by application of 
wafer-scale integration (WSI) to the implementation of sys- 
tolic arrays. Rather than dice a silicon wafer into chips as is 
usually done, the idea behind WSI is to assemble an entire 
system on a single wafer. Such an increased level of integra- 
tion avoids the costs and performance loss associated with 
individual packaging of chips. However, since fabrication 
flaws in a wafer-sized circuit are inevitable it is necessary for 
these designs to be fault-tolerant so that wafers with defec- 
tive components can still be used. The homogeneity of the 
PE’s that make up a systolic array and their regular inter- 
connection make systolic arrays attractive candidates for the 
application of WSI. Thus, several approaches to the design 
of fault-tolerant arrays of processors, with a view towards. 
‘WSI, have been proposed in the recent past [6,9,11,13,17,21]. 
The designs resulting from all of these varied techniques, 
however, either ignore or fall short of meeting some of the 
desirable features of an effective fault-tolerant design - that 
the design facilitate testability, be easily reconfigurable, have 
short inter-PE wire lengths and make good utilization of 
non-faulty PE’s. We will briefly highlight two of these 


approaches [11,17] as they form the basis for the research 


presented in this paper. 


In [17] Rosenberg proposed the Diogenes methodology 
for the design of fault-tolerant VLSI arrays. The essence of - 
this methodology is linearization of processor networks 
which are mapped onto collinear layouts of processors 
configured into the desired structure by appropriate switch 
settings on buses running parallel to the PE’s. It provides 
scan-in/scan-out capability to enhance testability and 
configuration is achieved using a few control lines per pro- 
cessor. The PE’s are scanned serially and are hooked into 
the network as and when they are determined to be fault 
free. Hence use of the Diogenes technique results in one hun- 
dred percent utilization of the good PE’s. The switches can 
be set dynamically through the control lines which are acces- 
sible to the outside world thereby providing dynamic fault 
tolerance capability. Furthermore, Diogenes designs are 
modular, that is, chips designed in this manner can be cas- 
caded together by connecting their corresponding buses. 
This feature can render feasible the use of a chip having 
only a few fault-free PE’s, thereby increasing the effective 
yield. 


A second approach tailored to fault-tolerant design of 
systolic arrays, attempts to alleviate the problem of clock 
rate degradation caused by the fact that wires connecting 
logically adjacent fault-free PE’s may span a large physical 


distance. The essence of this approach involved retiming [14] 


of the systolic algorithm. We say that a systolic design is a 
retimed version of another design if the former differs from 
the latter by having additional delays on some of the com- 
munication links. Kung and Lam [11] and Varman [21] gain- 
fully employed retiming to run systolic algorithms correctly 
without any degradation in their throughput even in the 
presence of faulty PE’s. Specifically, for a linear-array sys- 
tolic algorithm that is comprised of unidirectional data 
streams, retiming requires that the additional delay encoun- 
tered by the data elements in each stream when bypassing 
a faulty PE be identical. This ensures that data elements 
that met at a PE in the fault-free array will indeed meet 
even in the presence of faulty PE’s. 


By combining the Diogenes approach to restructuring 
rocessor arrays with the retiming schemes described in 
11,21], effective fault-tolerant implementation of linear 
arrays meeting all of the desiderata enumerated earlier can 
be achieved, that is, the resulting designs are modular and 
can be easily tested and configured in the presence of faults. 
In addition, all the working cells can be utilized and more 
importantly signals need travel only short wire lengths in 
every clock cycle. 


In contrast, no single scheme or combination of previ- 
ously proposed schemes for fault-tolerant implementation of 
2-D arrays simultaneously achieves all the desirable charac- 
teristics of a fault-tolerant linear array implementation. For 
instance, the Diogenes scheme results in long wire lengths 
between logically adjacent PE’s (even in the absence of 
faulty PE’s) and requires significantly larger area than a 
two-dimensional implementation. For matrix computations 
like multiplication of two nXn matrices, such a design 
‘makes use of O(n’) * area (rather than the optimal O(n?) of 
a 2-D array implementation). Furthermore, if we assume 
linear propagation delay for signals (not an unrealistic 
assumption for the large circuits being considered in a 
wafer-scale environment [2,3]) then matrix multiplication 
requires O(n”) time (rather then the optimal O(n) of a 2-D 
array implementation). On the other hand, incorporating 
retiming techniques for fault-tolerant implementation of 2-D 
arrays requires the use of numerous programmable delay 


registers and extra wiring channel capacity in the wafer in | 


order to achieve high PE utilization [11]. This makes the 
technique cumbersome to use. 


While a linear array is easy to implement and possesses 
excellent fault-tolerant properties, simulating 2-D systolic 
algorithms (like matrix multiplication) on it results in 
significant degradation in time performance. For instance, 
multiplication of two nXn matrices requires O(n) time on a 
2-D systolic array [12] whereas it requires O(n“) time on a 
linear array [1,8,10]. 


A question that naturally arises is whether there exists 
a computing structure that has all the attractive implemen- 
tation and fault-tolerant features of a linear array and can 
simulate 2-D systolic algorithms with a better time perfor- 
‘mance that a linear array. 


In [19] we gave an encouraging answer to this question 
by presenting a collinear VLSI array that retains all the 


desirable fault-tolerant characteristics of Diogenes designs. 


but obviates the need for signals to travel more than a fixed 
physical distance in any clock cycle. On such an array we 
established a lower bound on time of 2(nvVn) to multipl 
two nXn matrices. We also presented an optimal O(nvn) 
time matrix multiplication algorithm that used Of n) pro- 
cessors, O(Vn) storage per processor and O(Vn) internal 
bandwidth (that is, the number of data items that may be 
‘simultaneously transferred in unit time across any vertical 
line passing through the array). The O(n’) area and O(nvn) 
time required by our algorithm is significantly lower than 
‘the O(n*) area and O(n?) time requirement of a straightfor- 
ward linearization of the 2-D systolic matrix multiplication 
algorithm on our array. 


In this paper we examine the inter-relationship among 
processors, storage within a processor, internal bandwidth 
and time complexity of matrix multiplication on our model. 
We establish lower bounds on the time and processor com- 
plexity for matrix multiplication as a function of the storage 
within a processor and the internal bandwidth. We then 
present a generalized algorithm that matches these bounds 
for arbitrary storage within a processor and arbitrary 
bandwidth. An interesting side effect of our result is that 
the time and area complexities and the asymptotic complex- 
ity of the number of processors used by the algorithm in [19] 


$ f(n)=O(g(n)) iff there exists constants c and ng such that f(n)<c g(n) for 
n2no and f(n)=N(g(n)) iff f(n)>c g(n) for n>no. 
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and all previously described linear systolic array matrix mul- 
tiplication algorithms in [1,4,8,10,15,16] can be obtained as 
special cases of our generalized algorithm. 


To illustrate the key ideas in this paper we focus only 
on matrix multiplication. However the techniques and 
results of this paper can be readily adapted to obtain such a 
family of optimal algorithms for other 2-D systolic algor- 
ithmms like LU decomposition, QR factorization, solution of 
linear systems of equations, etc. 


The rest of this paper is organized as follows. In the 
next section we lay the theoretical framework for the 
research reported in this paper. In section 3 we will ela- 
borate on our computing structure and outline the design of 
matrix multiplication algorithms on this model. In section 4 
we present some closing remarks. The details of our algo- 
rithm, its proof of correctness and issues regarding fault 
tolerance and modularity appears in [20]. 


2. Theoretical Framework 


One of our major objectives is to retain the excellent 
fault-tolerant characteristics afforded by collinear implemen- 
tations of processor networks but avoid the degradation in 
throughput (caused by a lower clock rate) that long inter-PE 
wires would impose. Making the clock rate independent of 
the inter-PE wire length in a collinear implementation may 
be achieved by introducing buffers in the wires at fixed phy- 
sical separation (say between every pair of physically adja- 
cent PE’s) so that signals are clocked in and out of these 
buffers at every clock cycle. Fig. 2.1(a) is a collinear 


implementation of a 3X3 2-D array obtained using the 
Diogenes methodology. Fig. 2.1(b) is the same implementa- 
tion with buffers (denoted by in the figure) on the con- 
necting wires. 


While incorporating buffers in the interconnecting wires 
may appear to be a relatively minor change, it drastically 
changes the nature of computations on the architecture. 
Interconnecting wires in the non-buffered case are now actu- 
ally ptpelines through which several data items may simul- 
taneously be in transit from a source PE to a destination 
PE. As discussed below, algorithms on such an architecture 
need to be carefully designed to exploit the potential 
afforded by having pipelined buses. 
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Fig. 2.1 (a): Linearization of 3x3 Mesh (without buffers) 
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Fig. 2.1 (b): Linearization of 3x3 Mesh (with buffers) 


First, consider a naive simulation of the well-known 2- 
D systolic algorithm for matrix multiplication [12] on our 
collinear pipelined implementation. It is easy to see that 
every communication step that transferred data elements 
between adjacent PE’s along a column in the 2-D mesh 
would now require (n-1) clock cycles to move the element 
along the (n-1) buffers separating the corresponding PE’s in 
the collinear implementation. Thus, the naive simulation 
would require O(n”) time rather than the O(n) required on a 
2-D systolic array implementation. Even more interestingly, 
a theoretical result due to Gentleman [5] implies that as long 
as the PE’s in the collinear implementation have only a con- 
stant eu) amount of storage, then any matrix multiplica- 
tion algorithm would require O(n“) time. 


Fig. 2.2 
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A consequence of the above result is that a minimum 
requirement for improving the time performance is to 
increase the storage within every PE. The intuitive reason 
for doing so is to reduce the number of PE’s that take part 
in the computation and thereby decrease the maximum path 
length that any data item has to traverse during the compu- 
tation. However, decreasing the number of PE’s beyond a 
certain limit causes the computation to become compute 
bound, as we have too few PE’s to perform the computation. 


Thus, there exists an interesting tradeoff between the 
storage per PE and the execution time of the algorithm. In_ 
addition, as we shall show, the number of parallel wires in 
the collinear layout (the number of such wires is a measure 
of the number of data items that may be simultaneously 
transferred across any vertical line passing through the 
array, and will be denoted by the term internal bandwidth) 
also affects the time complexity of the algorithm in an 
interesting way. Thus we have a four-way tradeoff between 
the execution time, the storage per PE, the internal 
bandwidth and the number of PE’s. 


Our main contribution in this paper is to fully charac- 
terize the tradeoffs involved in such a collinear model. We 
present a generalized family of matrix multiplication algo- 
rithms on this model parameterized by the storage per PE 
(s) and internal bandwidth (k). All the algorithms in our 
amily are optimal in that they make most efficient use of 
the given hardware resources and perform the computation 
in the minimum time possible. 


The interplay between these resources (execution time, 
the storage per PE, the internal bandwidth and the number 
of PE’s) can be summarized by the following graph (see Fig. 


~ 


increas ing S$ 


NV Gur seas 


a 
The curve ac denotes a lower bound on time of 2 


whereas the line bc denotes a lower bound of p which is the 
time it takes for an element to travel from one end of the 
collinear array to the other. Observe that the lower bound 


on processing time (——) matches the lower bound on the 


communication time (p) when p is nVn. For p either less or 
greater than this, one of the two times dominates. A 
rigorous proof of this 2(nVn) lower bound on time for multi- 
plication of two nn matrices appears in [19]. 


Note that to store n? elements in nVn PE’s we require 

n storage per PE. Moreover, Vn internal bandwidth 

suffices to accomplish transfer of the n” elements across any 

point in the array in nVn time. Therefore both s and k are 
n at point c. 


Feasible algorithms (that is, those that do not violate 
the time lower bound) are bounded within the region 
enclosed by the curve ac and the lines bc and a6 for p in the 
range O(n) to O(n?) and T in the range O(nVn) to O(n?) 
(the shaded region in the figure). 


From information-flow arguments [18] it can be easily 
established that for any internal bandwidth k, the time 


required to multiply two nn matrices must be at least S, 


The intuitive reason for this is that in any (approximately) 
equal-sized partition of the chip at least n°“ bits of informa- 
tion must flow between the two partitions and hence for any 
internal bandwidth k the lower bound follows. To store the 
n? elements we require at least n° storage (see [15,22] for a 
proof). Therefore, within the feasible region, the, lower 


bounds on time and number of processors are A=) and 
2 

2) respectively. 
s 


Observe that any point x (p,,t,) in the region below ac, 
represents the situation where we have too few processors 


(p,) to compute the n’ scalar products in time Lc. Mov- 


Px 
ing horizontally to the right (that is, keeping k fixed) would 
correspond to increasing the number of PE’s and underutil- 
izing the storage per PE. On the other hand, moving verti- 
cally up (that is, keeping s fixed and therefore the number of 
PE’s fixed) we underutilize the inter-PE bandwidth (k) and 


also incur a penalty in time. 


Observe also that any point y (p,,t,) in the region 
below line bc, corresponds to a situation where in time ty, 
(ty<p,) a data item has to be distributed to py PE’s 


(py=—). Moving horizontally to the left corresponds to 
s 


decreasing the maximum path length by reducing the 
number of PE’s. This can -only be achieved by increasing the 
storage per processor which may not always be possible for 
PE’s embedded in a chip. We would therefore pay the time 
penalty of moving vertically to the top. 


In this paper we will describe the design of a general- 
ized family of systolic algorithms in the feasible region, 
parameterized by s and k that simultaneously meets both 
the time and processor lower bounds. Now the algorithms 
in [1,8,10] use O(n) PE’s, O(n) storage per PE and require 
O(n?) time. These algorithms operate at point a in Fig. 2.2. 
When s is n our generalized algorithm will have the same 
processor, storage and time complexities. The modular algo- 
rithm in [15] requires O(n’) PE’s, uses O(1) storage per PE 
and also requires O(n time. It operates at point 6 in Fig. 
2.2. Our generalized algorithm will have identical processor, 
storage and time complexities when s is 1. Using a single 
bus (k=1) in our generalized algorithm results in the family 
of algorithms in [4,16] that operate along the line ad in Fig. 
2.2. Lastly, the algorithm in [19] that operates at point c is 
obtained from our generalized algorithm when both s and k 
are Vn. 
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3. Computing Model 


Let A and B be two nXn input matrices and let C be 
the result matrix. Let s (1<s<n) be the storage per PE and 
k (1<k<vWn) be the number of buses. (Observe in Fig. 2.2 
that we never require more than Vn buses in the feasible 
region.) The cells used for multiplication are arranged on a 
straight line and communicate by multiple buses as shown 
in Fig. 3.1 below for the case n=4 and k—s=2. 


The n” elements of matrix C are stored in the local 
memories of the PE’s (s of them in each PE) and updated 
in-situ as elements of matrices A and B are made available 


to the PE’s. There are four distinct types of buses which 
transport elements of matrix A (ABUS), elements of matrix 
B (BBUS), control signals (CNTRLBUS) and address signals 
(ADDRBUS) from the I/O port at the left end of the array 
to the different PE’s. There are k buses of each type and 
each PE is hooked to exactly one ABUS, BBUS, CNTRLBUS 
and ADDRBUS. This makes our structure attractive for 
implementation purposes. 


We can visualize all the cells in the layout as being 
subdivided into k“ contiguous blocks where each block is a 


n n 


ks k 
[20] for more details). Each ABUS is connected to kw (phy- 
sically) consecutive PE’s. The first ABUS is connected to 
PE’s 1,2,..,w,w+1,..,2w,..,.kw, the second one is connected to 
PE’s kw+1, kw+2,..,.2kw and so on. Each BBUS is con- 
nected to each of the cells in k different blocks where each 
block is separated by a run of (k-1)w PE’s that are con- 
nected to the remaining k-1 BBUS’s. Thus the first BBUS is 
connected to each of the PE’s in the first block comprising 
of PE’s 1,2,..,w and then to each of the PE’s in the block . 
consisting of PE’s kw+1,kw-+2,..,(k+1)w and so on. The 
second BBUS is connected to the block with PE’s 
w+1,w+2,...2w and then to the block with PE’s 
(k+1)w+1,(k+1)w+2,..,.(k+2)w and so on. The organiza- 
tions of the CNTRLBUS and ADDRBUS are similar to the 
ABUS. Observe that the above interconnection transforms 
each of the k? blocks into a linear-array block. As an illus- 
tration, observe in Fig. 3,1 (w is 3 as n is 4 and s is 2) that 
(ABUS), is connected to the first six PE’s whereas (ABUS), 
is connected to the remaining six PE’s. (BBUS), is first con- 
nected to the block consisting of PE’s 1, 2 and 3 then to the 
block with PE’s 7, 8 and 9 whereas (BBUS), is connected 
first to the block with PE’s 4, 5 and 6 followed by the block 
consisting of PE’s 10, 11 and 12. 


contiguous sequence of w cells (w==(7-1) . See 


Finally, all the processors operate synchronously and 
are driven by a global clock. 


Associated with each bus is a delay equal to the 
number of clock ticks required to move a data element trav- 
eling along the bus between consecutive PE’s. The delay 
associated with the ABUS, CNTRLBUS is 1, as shown by 
the ‘“‘unit-delay” buffers on these buses ( ’s in Fig. 3.1). 
whereas the delay encountered on the BBUS is 2 ( ’s in Fig. 
3.1). Therefore, in our model no element has to traverse 
more than a fixed physical distance in one clock cycle and’ 
this distance is independent of the size of the array. In our 
model therefore, signal delay is proportional to the distance 
it has to propagate. Such a delay model appears appropriate 
for the large circuits produced by wafer-scale integration. 
See [3] for a justification of this delay model and [2] for a. 
detailed discussion of the various delay models. 


Fig. 3.1: Collinear Network of Processors (n=4) 
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Fig. 3.2: Modular Arrangement 


Fig. 3.2 is a logical arrangement of our computing 
structure. The linear array blocks are indexed m,,mg,..,m,2. 


Note that we connect together the BBUS of blocks 
whose indices differ by k whereas the ABUS,CNTRLBUS 
and ADDRBUS of k adjacent blocks are connected. The 
difference in indices between the last processor in block m, 
and the first processor in block m,,, is (k-1)m+1. Recall 
that a data element that is travelling along the BBUS 
requires two ticks to move between consecutively indexed 
processors. Hence an element of matrix B that emerges from 
the last cell in m; reaches the first cell in block m,,, in 
2((k-1)m+1) clock cycles. The on the BBUS between 
blocks models this delay. By distributing the delay along 
the BBUS (as shown in Fig. 3.1) the clock rate is made 
independent of the physical separation between the blocks. 


3.1. Multiplication Algorithm 


We will now outline our algorithm for multiplying two 
nXn matrices whose elements are pipelined through the 
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buses in our computing structure. The elements of the 
result matrix C are stored in the local storage of PE’s, one 
per word and updated in situ as elements of matrices A and 
B are made available to the PE’s. 


We first divide matrix C into k* submatrices each of 
size >. Each of these submatrices are computed in a 


linear-array block of w cells. The first block of k submatrices 
are computed in the first k linear-array blocks, the second 
block in the next set of k linear-array blocks and so on. 


If s=— then we can compute the product of these sub- 
7) PE’s and o(;) 
storage per PE {1,8,10]. These algorithms store one of the 
input matrices (one row per PE) and pipeline the columns of 
the other matrix through the array. After all these columns 
pass through a PE, an entire row of the result matrix gets 
computed in that PE and there is sufficient storage within it 
to store the computed row. However the difficult and 


e ° es n 
interesting case arises when s< a 


matrices using a linear array of O( 


This is because we have 


less storage within a PE to store an entire row of the input 
or result matrix. In contrast to the previous case (wherein a 
PE must use all the elements of the second input matrix 
passing through it in order to compute an entire row of the 


result matrix) a PE now will be required to use only certain: 


of these elements in order to compute part of an entire row 
of the result matrix. We can handle both these cases by 
using the linear-array algorithm in [16] to compute the sub- 
matrices. Note that the a,’s and b;.’s are required in 
different linear-array blocks. The final details of our algo- 
rithm involves carefully scheduling their arrival to the 
different blocks so that the correct product terms meet at 
the correct time in the appropriate PE and accomplish the 


entire computation in optimal o(=-) time. 


We estimate the time and processor complexity of our. 


algorithm as follows (a detailed analysis appears in the 
appendix). 


2 2 
Recall that in the feasible region A(5-) and Q(——) are 
s 


the lower bounds on time and processor complexity respec- 
tively. 


2 
We use k?w PE’s and w is Oe) and hence the pro- 
s 


cessor complexity is O(——). 
s 


For estimating the time complexity of our algorithm, 
first note that we need to compute ao scalar products of a 


2 
submatrix in each linear-array block. This requires o(-5) 


2 
time which is <0(--). Secondly, an element from matrix B 


2 
need travel a maximum path length of O(-). Now k<s in 


the feasible region, and hence the time complexity is 


bounded by O(--). 
Thus, for any s and any k, our algorithm achieves 
optimal time and processor bounds in the feasible region. 


An interesting byproduct of our generalization is that 
the area and time complexities and the assymptotic com- 
plexity of the number of processors used by all known 
linear-array matrix multiplication algorithms [1,4,8,10,15,16] 
and the algorithm in [19] appear as special cases of our gen- 
eralized algorithm. We show this as follows. 


The algorithms in [(1,8,10] use O(n) PE’s, O(n) storage 
per PE and require O(n”) time. These algorithms operate at 
point a in Fig. 2.2. The algorithm resulting from our 


generalized algorithm when s is n will have the same proces- 
sor, storage and time complexities. The modular algorithm 
in [15] requires Q(n?) PE’s, uses O(1) storage per PE and 
also requires O(n“) time. It operates at point 6 in Fig. 2.2. 
Our generalized algorithm will have identical assymtotic 
complexities for time, area and number of processors when s 
is 1. Using a single bus (k=1) in our generalized algorithm 
would result in the family of algorithms in [4,16] that 
operate along the line abd in Fig. 2.2. 


Lastly, when s and k are both Vn (that is, they are bal- 
anced ), our generalized algorithm achieves the optimal time 
bound_of O(nvn) using a buses, Vn storage per PE and 
O(nVn) PE’s. All these complexities are identical to the 
algorithm in Oe Some features of this algorithm are worth 
noting. Our balanced algorithm operates at point c in Fig. 
2.2. Observe in the figure that the lower bound on com- 
munication time matches that of the processing time when n 
‘is Vn (point c in the figure). We compare this balanced 
algorithm with the naive simulation of a 2-D systolic matrix 


ee algorithm on the collinear structure in Fig. 
2.1(b). 


356 


As mentioned earlier, the primary drawback with such 
a simulation is that n Xn matrix multiplication on it requires 
O(n*) area, 2n buses and O(n?) time. Note that this is the 
case despite the buffers on the buses to pipeline several ele- 
ments along them. 


Fig. 2.1(a) is the collinear implementation obtained 
using the Diogenes design methodology. However, a major 
drawback with these designs is the long wire lengths 
between logically adjacent PE’s (even in the absence of 
faulty ee resulting in significant degradation in 
throughput (due to a slower clock speed). Simulating two 
nXn matrices on this structure requires O(n*) area. Furth- 
ermore, if we assume that signals propagate a fixed physical 
distance in every cycle, then the time required for multipli- 
cation becomes O(n‘). 

In contrast, the time complexity of our balanced algo- 
rithm is only O(nVn). Furthermore, we require only nVn 
processors (rather than the n? PE’s required for simulation) 
and only 4Vn (rather than the 2n required for simulation). 
The total area occupied by the nVn processors is O(n?) (as 
each processor requires Si storage). Also, each bus is nvn 
long and hence the wiring area occupied by all the Vn buses 
is O(n”). Thus the total area required by our balanced algo- 
rithm is O(n’). In contrast, the simulated 2-D systolic algo- 
rithm requires O(n*) area. Thus our balanced algorithm suc- 
cintly demonstrates that we can preserve the ease of testa- 
bility and configurability that are characteristics of the 
linearization approach and also have better resource utiliza- 
tion and better time performance as well. 


4. Conclusions 


Several approaches to the design of fault-tolerant 
arrays of processors, with a view towards wafer-scale 
integration, have been proposed in the past. The designs 
resulting from all of these varied techniques, however, either 
ignore or fall short of meeting some of the desirable features 
of an effective fault-tolerant design, namely, that the design 
facilitate testability, be easily reconfigurable, have short 


inter-PE wire lengths and make good utilization of non- 
faulty PE’s. 


Some of these methodologies can be combined to pro- 
duce effective fault-tolerant implementation of linear arrays 
meeting all of the desiderata stated earlier. In contrast, no 
single scheme or combination of previously proposed 
schemes for fault-tolerant implementation of 2-D arrays 
simultaneously achieves all the desirable characteristics of a 
fault-tolerant linear array implementation. 


Diogenes designs proposed by Rosenberg are very 
attractive for fault-tolerant implementation of 2-D systolic 
algorithms. However, a major drawback with Diogenes 
designs is the long wire length between logically adjacent. 


PE’s (even in the absence of faulty PE’s) resulting in 
significant degradation in throughput. Moreover, these 
designs require significantly large area for simulating 2-D 
systolic array algorithms. For matrix computations like mul- 
tiplication of two nXn matrices, such a design makes use of 
O(n”) area. Furthermore, if we assume that signals pro- 
pagate a fixed physical distance in every cycle then matrix 
multiplication requires O(n“) time. 


In [19] we presented a collinear VLSI array that retains 
all the desirable fault-tolerant characteristics of Diogenes 
designs but obviates the need for signals to travel more than 
a fixed physical distance in any clock cycle. In our model, 
all signals travel a fixed physical distance in any clock cycle, 
even in the presence of faulty processors (that is, the clock 
rate is independent of both the number of faulty and non- 
faulty PE’s). This feature of our model requires us to use 
retiming to ensure the correctness of our systolic algorithm 
despite the presence of faulty PE’s. Their presence only 
contribute to an additive increase (equal to the number of 
bypassed faulty PE’s) in the time complexity of our algo- 
rithm. We also described_an algorithm to mutiply two nXn. 
matrices in optimal O(nVn) time and O(n’) area, 


In this paper we examined the inter-relationship among 
processors, storage within a processor, internal bandwidth 
and time complexity of matrix multiplication on our model. 
We established lower bounds on the time complexity and 
the number of processors required for matrix multiplication 
on our model as a function of the storage within a processor 
and the internal bandwidth. We than presented a general- 
ized algorithm that matches these bounds for arbitrary 
storage within a processor and arbitrary bandwidth. An 
interesting side effect of our result is that the time and area 
complexities and the assymptotic complexity of the number 
of processors used by the algorithm in [19] and all previously 
described linear systolic array matrix multiplication algo- 
rithms in Ler can be obtained as special cases of 
our generalized algorithm. 


To illustrate the key ideas in this paper we focussed 
only on one matrix computation (namely dense matrix mul- 
tiplication) although our techniques can be easily extended 
to handle other important problems including LU decompo- 
sition, QR factorization and solution of linear systems of 
equations. 


References 


(1] J. Bentley and T. Ottmann, “The Power of One- 
Dimensional Vector of Processors,” Universitat 
Karlsruhe, Bericht 89, (April 1980). 


G. Bilardi, M. Pracchi and F.P. Preparata, “A Critique 
and Appraisal of VLSI Models of Computation,“ VLSI 
Systems and Computations, Computer Science Press, 
(1981), pp. 81-88. 


B. Chazelle and L. Monier, ‘‘A Model of Computation 
for VLSI with Related Complexity Results,” JACM, 
Vol. 32, No. 3, (July 1985), pp. 573-588. . 


A.L. Fisher, ‘‘“Memory and Modularity in Systolic Array 
Implementations,” Proceedings of the 1985 Interna- 
tional Conference on Parallel Processing, (August 
1985), pp. 99-101. 


W.M. Gentleman, “Some Complexity Results for 
Matrix Computations on Parallel Processors,” JACM, 
25(1978), pp. 112-115. 


J.W. Greene and A. Gamal, ‘“‘Area and Delay Penalties 
in Restructurable Wafer-Scale Arrays,” JACM, 
(October 1984). 


G.H. Hardy and E.M. Wright, “Introduction to the 
Theory of Numbers,” Oxford University Press, Fifth 
Edition, (1978). 


A.V. Kulkarni and D.W.L. Yen, ‘Systolic Processing 
and an Implementation for Signal and Image Process- 
ing,” [EEE Transactions on Computers, C-31, No. 10, 


(October 1982), pp. 1000-1009. 


I. Koren, ‘“‘A Reconfigurable and Fault-Tolerant VLSI 
Multiprocessor Array,” Proceedings of the Eighth 
Annual Sympostum on Computer Architecture, (August 
1982), pp. 262-264. 


H.T. Kung, ‘Systolic Algorithms for the CMU WARP 
Processor,” Seventh International Conference on Pat- 
tern Recognition, (July 1984), pp. 570-577. 


H.T. Kung and M. Lam, ‘‘Wafer-Scale Integration and 
Two-Level Pipelined Implementation of Systolic 
Arrays,” Proceedings of the MIT Conference on 
Advanced Research in VLSI, (January 1984). 


H.T. Kung and C.E. Leiserson, ‘Systolic Arrays (for 
VLSI),”? Sparse Matrix Proceedings 1978, I.S. Duff and 
G.W. Stewart (editors), SIAM (1979), pp. 256-282. 


F.T. Leighton and C.E. Leiserson, ‘‘Wafer-Scale 
Integration of Systolic Arrays,” JEEE Transactions on 
Computers, C-34, No. 5, (May 1985), pp. 448-461. 


[2| 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


357 


[14] 


[15] 


[18] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


C.E. Leiserson and J.B. Saxe, ‘Optimizing Synchronous 
Systems,” Journal of VLSI and Computer Systems, Vol. 
1, No. 1, (1983), pp. 41-68. 


1.V. Ramakrishnan and P.J. Varman, “Modular Matrix 
Multiplication on a Linear Array,’ IEEE Transactions 
on Computers, C-33, No. 10, (November 1984), pp. 
952-958. 7 


I.V. Ramakrishnan and P.J. Varman, ‘‘An Optimal: 
Family of Matrix Multiplication Algorithms on Linear 
Arrays,” 1985 International Conference on Parallel 
Processing, (August 1985), pp. 376-383, (accepted for 
publication in the IEEE Transactions on Computers). 


A. Rosenberg, ‘‘The Diogenes Approach to Testable 
Fault-Tolerant Networks of Processors,” JEEE Tran- 
sactions on Computers, C-32, No. 10, (October 1983), 
pp. 902-910. 


J.E. Savage, “Area-time Tradeoffs for Matrix Multiph- 
cation aid Related Problems in VLSI Models,” Journal 
of Computer and System Sciences, 20:3, pp. 230-242. 


P.J. Varman and I.V. Ramakrishnan, “On Matrix Mul- 
tiplication Using Array Processors,” Twelfth Interna- 
tional Conference on Automata, Languages and Pro- 
gramming, Lecture Notes in Computer Science, 
Springer-Verlag, Vol. 194, (July 1985), pp. 487-496. 


P.J. Varman and I.V. Ramakrishnan, “A Fault- 
Tolerant VLSI Matrix Multiplier,”, Technical Report 
85/29, Department of Computer Science, SUNY at 
Stony Brook, (November 1985). 

“Wafer-Scale Integration of Linear 


P.J. Varman 
University of Texas at Austin, 


Arrays,”, PhD Thesis, 
(August 1983). 

J.E. Vuillemin, “A Combinatorial Limit to. the Com- 
puting Power of VLSI Circuits,” Proceedings of the 
Twenty-First Annual IEEE Sympostum on Foundations 
of Computer Sctence, pp. 294-300. 


FAIL SAFE DISTRIBUTED FAULT DIAGNOSIS 
OF MULTIPROCESSOR SYSTEMS* 
Chung Yang Chiang and Chuan-lin Wu 
Department of Electrical and Computer Engineering 
The University of Texas at Austin 
Austin, TX 78712 


Abstract -- Techniques for distributively, reliably 
diagnosing multiprocessor systems are presented in 
this paper. Based on these techniques,trustworthy 
diagnostic results on the status,faulty or fault free,of 
every processor node will be revealed by fault free 
processors. An assumption which has been 
commonly made in most papers and might lead to 
totally incorrect results is eliminated for fail safe 
purpose. The assumption puts a limit on the number 
of existing faulty processors. Capabilites of the 
diagnosis system is analyzed in terms of 
trustability,diagnosability,coverage and mean 


diagnosability. Comparison to an existing diagnosis. 


approach is also provided on diagnosing system 
using a multistage topology. The results single out 
the uniqueness of our approach on the fail safe 
diagnosis. 


1. Intr ion 


A general guideline has been established to 
avoid introducing extra faults which can lead a fault 
tolerant system to earlier destruction whenever 
system recovery schemes, such as_ fault 
detection,diagnosis, isolation and reconfiguration,for 


restoring the system to operating mode are called 


for{1]. 

A lot of research works have been done to mold 
processor nodes together to form parallel processing 
systems. Yet,only a little effort has been exerted on 
how to reliably manage the system whenever faults 
occur. Fault detection and location are the first two 
steps to take to avoid further damages to the system. 
Basically there are two kinds of fault diagnosis 
methods. The first is the centralized one in which an 
external processor takes the responsibility of locating 
the faulty processors. The second is the distributed 
one in which every processor involves the execution 
of the diagnosis procedure. For a large and/or widely 
dispersed system,centralized diagnosis is not 
feasible due to the limitation of communication links. 
Besides,those diagnosis processors will form the 


‘hardcore | of the system. For nonrepairable and/or 


autonomous systems,only distributed diagnosis is 
feasible. 

Valid fault detection and location methods 
should imply one hundred percent reliability in 
carrying out the job. They should never come up with 
“ This work is supported in part by a grant from IBM 
Corporation. 
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false diagnostic results. Otherwise,these methods 
would be as harmful as those faults which call for 


these methods since they could introduce more faults 
into the system. Previous papers in fault diagnosis 
tend to put a limit on the number of existing 
faults[2-8]. If more faults than the assumed limit 
occur,often times it is quite possible,depending on 
the reliability and size of the system,the system will 
simply yield false results. This is especially true for a 
large, distributed system in which only distributed 
fault diagnosis is feasible. There should be no room 
at all for uncertainty. A very small percentage of 
increase of uncertainty will result in dramatic 
decrease of system reliability as can be observed in 
[9]. This paper reports new techniques and concepts 
for reliably diagnosing multiprocessor system. 
In section 2,we present a system graphs model 
and a fault model which forms the basis of a new fault 
diagnosis algorithm. Section 3 consists of the 
diagnosis algorithm and proof of the validity of this 
algorithm. Comparisons of coverage,trustability and 


‘mean diagnosability of systems with different 


diagnosability are presented in section 4. Section 5 
concludes the work in this paper. 


raphs an | 
stem Graph 
A system interconnection graph is an 


undirected graph showing the interconnections 
among processors of the system. Yet,test graph is a 


graph which shows the relationship between testing 


and tested processors. Interconnrction graph is the 


‘underlying graph of test graph. An example of these 


graphs is shown in Fig. 1. A directed test graph is 
depicted in Fig. 1.a in which the two processors 
linked by the arrowheaded link are associated with 
each other with the processor to which the arrowhead 
is incident defined as the tested processor and the 
other as the testing processor. Fig. 1.b is an 


undirected test graph in which every processor is 


both a testing and tested processor to its neighboring 
processors. In this particular case,the system test 
graph is the same as the interconnection graph. 

A processor has two roles : it is a tester at one 


time, and a testee at another time. As a testing 
-processor,it initiates its tested processors to 
commence self test and monitors the results from its 


tested processors. As a tested processor,it receives 
start signals from its testing processors to start self 


test,reports its own status and reroutes received fault 
vectors from its testees to its testers. 
2.2 Fault Model 

The fault model is similar to most of the ones 
presented in previous papers[2-8] except one 
point,that is,no assumption on the number of 
allowable faults is ever made in order to avoid false 
diagnostic results and therefore possible system 
catastrophy. 


The fault model is defined as follows : 

(1) a faulty processor might modify the fault 
vectors sent to it by its testees. 

(2)only permanent fault is considered. That 
is,once a processor is identified as faulty 
by any fault free processor,it will be regarded 
as faulty even though other processors might 
identify it as fault free. 

(3)a fault free processor can become faulty 
during diagnosis period. 


Link faults will be treated as proceesor faults 
since it is technically difficult to differentiate between 
link faults and processor faults. If that happens,the 
system simply loses those fault free processors which 
have association with the faulty links. | | 

Since the diagnosis system is based on a 
distributed algorithm and no global clock has ever 
been provided as a system wide synchronization 
mechanism, it is impossible for processors to know 
the exact time of occurences of any link or processor 
faults. The only condition for the processors to 
differentiate link faults from processor faults is that 
they: know exactly when faults have occurred. That 
implies a global,nonskewed clock should be 
implemented,which is impossible. The following 
paragraph explains in a circumstance on how link 
faults and processor faults can be differentiated by 
systems adopting distributed diagnosis algorithms. 


For instance,we have 2 processors testing a 
common processor,the first tester identifies it as fault 
free and the second one identifies it as faulty due to 
the broken or stuck link between them. Following 
situations are possible : 


(1)If the second tester identifies the link fault before 
the first one,then every processor will receive 
time-stamped messages from both the second tester 
indicating the tested processor as faulty and the first 
tester indicating it as fault free. Since only permanent 
fault is considered and fault condition was located 
first in terms of global clock,processors shall resolve 
this situation as a link fault. And the faulty link is the 
one that connects the second tester and the tested 
processor. In this case,the useful resource in this 
tested processor will not be dispensed. Only the 
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faulty link will be isolated. 


(2)Yet, if the first tester identifies the fault free 


processor first and the second one identifies the 


tested processor as faulty due to broken or stuck link 
between them after the first tester, then it is 
impossible to tell if it is a link fault’‘or processor fault 
since the situation is the same as that of a fault free 
processor becoming faulty later before it is tested by 
the second testers. Unless the first tester,after 
receiving this conflicting message,can test this tested 
processor again and issue a later-time-stamped 
message indicating this supposedly faulty processor 
is still fault free. | 


. Reliable Distributed Diagnosis Algorithm 


3.1 Fail Safe Distributed Algorithm 


The initiation and execution of the system fault 
diagnosis is assumed fully distributed. The following 
algorithm is for fault diagnosis only. Fault detection at 
individual processor is assumed. In this diagnosis 
system,we assume a certain number of processors 
can periodically initiate fault diagnosis cycles to 
locate faulty processors. The number of these 
watchdog-type processors should be greater than the 
expected degree of system fault tolerance in order to 
provide the necessary fault tolerance. Whenever the 
first processor initiates current cycle,every other: 
processors which receive the initiation message,in 
addition to starting the diagnosis cycle,should also 
adjust its own internal clock to prepare for initiating 
‘next cycle. For the case in which faults did occur 
before the cycle is started,the effect of faults will be 
confined to only one cycle. 


| Definition 1; Diagnosability -- A system is said 
to be t-fault self diagnosable if and only if each 
processor in the system can correctly identify all t 
faulty processors. 


Diagnosability should not be taken as the sole 
gauge of the system diagnosis capability. The degree 
of confidence on the yielded results should be 
seriously taken into consideration in addition to 
diagnosability. Some might argue that we can always 
construct a system with diagnosability as high as 
possible. Yet no matter how complicated the system 
and therefore how high the diagnosability might be,it 
is always possible that the number of faults might 
surmount the diagnosability. On the other hand, for 
systems in which the diagnosability is low due to 
physically restricted structure of the system,the 
possibility of more fatal faults is even higher. 

The algorithm is equally applicable to every 
processor in the system.The algorithm is written in 
C-like language. 


Notations 
(1) ffi] == fault vector maintained locally by p(i) 
(processor with logical id of i). 

(a) f[i][k] == kth element in f[i] representing the 
status of p(k) with the following possible 
values if k is not equal toi: 

O : if p(k) is fault free. 
1: if p(k) is faulty. 
X : Status of p(k) is undetermined. 


(b) f[i][i] is the status of p(i) itself with only two- 


possible values of 0 or 1. 
(2) F[i] == fault vector sent by p(i) == ffi] ; F[iJ[k] == kth 
element in F[i]. 


Node p(j),j=0,1,.....n-1,after receiving diagnosis 
‘initiation messages,will perform following algorithm : 


{ 

INITIALIZATION : 

/*self test and initialization*/ 

broadcast diagnosis initiation message 


neighboring(both tested and testing) processors; 
SELF_TEST : 
if(self test passed) 


ffi) = [in-4 -sij,-,14,i9] = DGX,.,0,..,x]; 
send self test status to p/(j)'s testers; 
} | 
else { 
ffi] = [x,x,.,1,.,x,x]; 
send self test status to p(j)'s testers; 
break; | 
“faulty processor makes self dormant if 
possible*/ 
while ((at least one element in f[j] undetermined) and 
(time not up yet)) 


TEST : /*testing testees p(i)'s of p(j)*/ 
for (every p(i) tested by p(j)) 
switch (status of p(i)){ 
case FAULT_FREE : 
f]{i] = 0; 
break; 
case FAULTY: 
case NO_RESPONSE and TIMED_OUT : 
fli] =1; 
break; : 
case NO_RESPONSE and TIME_NOT_UP : 
break; 
} /*switch*/ 
BROADCAST : /*broadcast newly modified fault 
vector to testers p(k)'s of p(j) if any*/ 
while((any F[i] received from fault free p(i))) { 


if(F[i] results in changes of any element in f{j] from | 


x to 1 or from 0 to 1) 


initiate self test of p(i); 
if (p(i) is found fault free and F[i][i] ==0) 


to 
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update f[j] with F[i] from p(i); 
else /*p(i) found faulty*/ 
f=1; 

Vit*/ | 
broadcast newly changed element in ffj] to 
nonfaulty p(k)'s which test p(j); 

V*while*/ 
V*while*/ 
if(at least one element in f[j] is undefined) 
FAILED: 
for(nonfaulty p(k)'s which test p(j)) 
broadcast failure message to p(k); 
else{ : 
for(nonfaulty p(k)'s which test p(j)) 
broadcast done message to p(k); 
while(at least one done message not yet received 
from fault free processors and time not up yet and 
no failure message received) ; 
if(timed out) 
go to INITIALIZATION; 
else 
if(failure message received) 
go to FAILED; 
V*else*/ 
}*main*/ 


The failure message indicates some of the fault 
free processors fail to generate decisive and 
unanimous result,although other fault free processors 
can locate all faulty processors. 


.2 Definition and Th m 


initi -- For an 
undirected(directed) test graph,the (vertex) 
connectivity Ky is the minimun number of vertices 


whose removal from the graph results in a 
disconnected (weakly connected or 
disconnected)graph [10]. 


The test graph of Fig. 1.a is a directed graph 
with connectivity of 2. Yet ,the test graph of Fig. 1.b is 
an undirected graph with connectivity of 4. 

When the number of faults is equal to the 
connectivity of the test graph,then the diagnosis 
system will fail us in Some cases as the following 
paragraphs show. By failing us,we mean the 
diagnosis system found that current fault situation is ' 
undiagnosable due to lack of unanimous and 
decisive diagnostic result. 

Assume the test graph of a multiprocessor 
system has connectivity of t,then the following cases 
are the situations in which the diagnosis system fails 
when the number of faults is not less than t. 
case 1: Any processor with all its t testers being faulty. 
In this situation,the status of this particular processor 
will be unknown to other processors in the system. 
case 2: Any processor with all its t testees being 
faulty. This particular processor will be totally 


blocked from the diagnostic information flowing 
around the system. 

case 3 : There are some special cases with not less 
than t faults which also result in failed system. 

As long as the number of faults is less than t,the 
diagnosis system always comes up with decisive and 
unanimous result. 

From the above cases,we observe that the 
diagnosability of the system proposed in this paper is 
always 1 less than the connectivity. The main reason 


behind this decrease of diagnosability,as compared. 


with that of the diagnosis systems proposed in other 
papers,is to achieve fail safety. Yet,it doesn't at all 
imply the system has an inferior capability of locating 
existing faults. On the contrary, it has a better 
capability. 


The following theorem verifies that when the: 


number of faults is less than the connectivity of the 
system,the system can always locate all faults by 
adopting the previous algorithm. And when the 
number of faults is eqaul to or greater than the system 
connectivity, the algorithm never comes up with false 
diagnostic results. 


Theorem 1 - Using the above algorithm,a 
system with test graph T of K, equals t will either 


correctly locate all faulty processors or not incorrectly 
locate any faulty processors. 


Proof : We will prove the validity of the algorithm in 
three cases. 
(1)if the number of faults is less than t : 

As the system has a test graph with K,, equals 


t,therefore,it will still be connected(or strongly 
connected). There will be at least one path between 
every pair of vertices according to the definition of 
connected graphs. Therefore,for a directed 
graph system,every fault free vertex can receive the 
diagnostic information from its fault free testees and 
broadcast new information to its nonfaulty testers. For 
a directed test graph,the set of testers and the set of 
testees of a vertex do not have any element in 
commom. Yet for an undirected test graph,the set of 
testers is the same as the set of testees for a vertex. 
There is still at least one path which consists of fault 
free vertices,and correct diagnostic information can 
be received and broadcast along this path. 


(2)if the number of faults is greater than t : 


According to the definition of connected 


graph,we encounter two situations : 

(A)it is possible that the test graph is 
disconnected(or weakly connected),then there will be 
no path at all between some pairs of vertices. These 


vertices will fail to either collect status of other 


vertices or be known to other vertices of its status. 


Therefore,we can sort all fault free vertices into two 


test 


categories : those which fail to collect status of every 
vertex and those which can collect status of every 
vertex. For those which fail,they will broadcast failure 
message to every other fault free vertex according 
to the algorithm. And since the other set consists of 
those which can collect status of every vertex,they will 
be able to collect this newly broadcast failure 
message. The result is no vertex in the system will 
ever go into a faulty state in which it thinks the 
system yields decisive and unanimous results. We 
conclude that no vertex will ever incorrectly locate 
faulty vertices. — 

(B)if the test graph is still connected(or strongly 
connected),then the situation is the same as that of 


(1). 


(3)if the number of faults is equal to t : 

This case is the same as (2) except the 
possibility of the test graph being connected is higher 
than that of (2). The only occassions that the test 
graph will be disconnected are : (a)when either all 
testers or all testees of a vertex are faulty,and (b)few 
special cases. 


Q.E.D. | 

4. m Cover Diagn ilit 

4.1 Analysis of Diagnosis Trustability 
finition __tr ili f Diagnosis --. 


Trustability of fault diagnosis is the probability that the 
diagnosis system either yields correct or does not 
yield incorrect diagnostic results regardless the 
number of faults in the system. 


Trustability is actually a reliability measure. A 
diagnosis system which guarantees locating up to k 
faults when the number of faults is not more than k 
will have diagnosis trustability of 1 when the number 
of faults is less than or equal to k. Yet,when the 
number of faults is greater than k,the trustability 
decreases as the number of faults increases. For 
example,in [2], the system will yield false results 
when the number of faults is greater than the 
assumed one. Take the 5-node completely 
connected graph system in Fig.1.a as an example. 
According to the structure of this system,diagnosis 
algorithm in [2] can diagnose up to 2 faults,yet,if the 
number of faults is greater than 2,the trustability of the 
diagnosis system will be the probability of any 3 or 
more existing faults in the system. According to the 
algorithm presented in this paper,the diagnosability is 
1 instead of 2,which of course is 1 less than that of 
[2],yet the trustability of diagnosis is still 1 regardless 
the number of faults in the system. As we can see,the 
increase in diagnosis trustability is at the expense of 
diagnosability. 


Analysis -- Trustability measures for both 


fail-safe and non-fail-safe diagnosis systems are 
analyzed below. We assume there is an exponeniiai 


failure distribution for every processor in the system: 


and the failure distributions are dependent: The 
reliability function is : R(t) = e ~4' . We will use R 
instead of R(t) just for simplicity. Equation for 
trustability is stated as follows : 


n 
T(t) => p;*C(n,)R™(1-R) 
i=0 


(1) 


where C(n,i) is the number of combinations for 
choosing r faulty processors out of a total of n 
processors. The product term of C(n,i),R and (1-R) is 
the probability of exactly i faulty processors out of n 
processors. pj; is the probability of correctly locating 


all i faulty processors or not incorrectly locating any 
faulty processor. 

(1)Fail safe diagnosis system : A system with n 
processors can correctly locate up to k faults and 
doesn't incorrectly locate faults if the number of faults 
is greater than k. Then the trustability of diagnosis will 
be 1 as described below : 


Tyg (t)= 


1*R941*C(n,1)R™1(1-R)144*C(n (n,2)R-(1-R)o+.. 41" 
C(n,k)R"- “(1- R)K 4 1*C(n,k+1)R-K-1(4- R) eat, saat 
1*C(n,i)RO-1(1- R)i +. +1 “i (at (2) 


which is the binomial expansion of (R + (1-R) ) "=1. 
The jth (j <= k) item is the probability of correctly 
locating all faults if there are j faults. The ith(i > k) item 
is the probability of not incorrectly locating any faults 
if there are i faults. 

(2)Non-fail-safe diagnosis system : If the system 
can correctly locate only up to k faults,and can not 
guarantee the correctness of the results when the 
number of faults is greater than k,then the trustability 
will be : 


Trts(t) = 1*R™ + 1°C(n, Ua R)' + 1*C(n,2)R™2 
(1-R)@+ ... +1*C(n,k)R™K(1-R) Ke DK 1*C(n,ket ROK 
(1- LRyK+T +p7*C(n,)R™(1-R)l+... +p_"(1-R)" (3) 


where pj's are the probabilities that the diagnosis 


system yields correct diagnostic results if there are ji 
faults(i > k) in the system. The probability that the 
system being led to incorrect state is 1 minus the 
trustability as follows : 


Trfs(t) = (1- “Pig 4)*C(n,k+1)RO-K1 (4 -RyK+1 + ou. 4+ 
(4) 
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We will compare the diagnosis system in [2] and 
the one proposed here. The system in [2] is the first 
one proposes a fully distributed diagnosis system. 
We call it R system and ours S system. The 
comparison is done on the diagnosis of a system 
employing multistage topology[11],which is also 
employed for functional language architectures[1 2]. 
An example topology is shown in Fig. 2. 


(A) Trustability : 


| 2 
(1) R system = >. 1* C(n,i)RO-i(4-Ry! : 
i=O 
n ; 
SS 0*C(n,i)R™(1-R) 
i=3 


(2) S system = 1 
Pj's(i >=3) are O's in R system simply because it 
assumes there are at most 2 existing faults in the 
system. If there are more than 2 faults,then,according 
to its diagnosis algorithm,every processor will 
conclude that it has already located all faulty 
processors whenever there are 2 1's in its fault vector 
even though there are still a few don't care terms. 
This is especially truly for those processors which fail 
to determine status of every processor,since they 
have no way of determining status of some 
processors. They can yield decisive results only by 
assuming 2 faults in the system. Trustability of S 
system is 1 because it determines to yield decisive 
and unanimous result,no guessing is ever involved. 

Remember that part of trustability is just the 
probability of not incorrectly locating any faults. When 
the system can correctly locate all faults no matter 
how many existing faults in the system,it is equal to 
the probability of correctly locating the faults. 
Yet,when the system can locate the faults for some of 
the fault situations and falsely locates the faults in 
other fault situations,then the trustability is simply the 
portion in which the system can locate the faults. 


(B) Diagnosability : 

According to PMC model,we have : 

(1) Rsystem = 2 ; and 

(2) S system = 1 

Even though the diagnosability of the S system 
is 1,it does not imply that it can't locate 2-fault 
incidents at all. As the way in which diagnosability is 
defined,it simply means it can not locate all 2-fault 
incidents,although it is capable of locating most of the 
2-fault incidents as we shall see in the next 
paragraph. Therefore,it should be coupled with 
coverage to give a better idea of the probability of 
correctly locating faults. 


(C) Coverage : 


initi f -- The 
percentage of fault incidents in which diagnosis 


system can correctly locate all faulty processors. 


Simulation Features -- Coverage measure for S 
system is acquired by simulation which is executed in 
the following manner : 

(1)specify size of the system, i.e. , number of 

stages. 

(2)specify number of faulty processing elements 

(PE's). 
(3)for every possible combination of the number 


of faults , check if every fault free PE can. 
receive fault vector broadcast by other fault 


free PE's. 

If at least one PE is incapable of receiving 
broadcast fault vectors from at least one other fault 
free PE,then it is regarded as a failed diagnosis 
situation since it implies this particular PE has a fault 


vector with at least one element undertermined. 


Therefore,the whole system is incapable of coming 
up with a decisive and unanimous diagnostic results. 

According to simulation,we have the following 
results : 
(1) Rsystem = 

é for all system size if # faults <= 2; 

O for all system size if # faults > 2. 
(2) S system = 

1.0000 for 3-stage system if # faults =1; 

0.7879 for 3-stage system if # faults = 2; 

0.4727 for 3-stage system if # faults = 3. 


1.0000 for 4-stage system if # faults = 1; 
0.9395 for 4-stage system if # faults = 2; 
0.8218 for 4-stage system if # faults = 3. 


The numbers shown above for 2-fault cases can. 


be estimated quite easily as following example 
shows. 


Example i -- For a 4-stage multistage 
interconnection network,there are three components 
which contribute to undiagnosable fault situations : 
(1)all testers of a particular PE are faulty,(2)all testees 
of a particular PE are faulty,and (3)special cases,as 
stated in the proof of Theorem 1. 

For stage 0 PE's there are 2*8 such cases. As 
an example,in Fig. 2,assume PE(0,0) is the faulty PE 
addressed (0,0) with the first 0 representing its stage 
address and second 0 representing its address in 
that stage starting from top. If PE(0,1) is also 
faulty,then they will block the status of their two 
common testees in stage 1,PE(1,0)and PE(1,4),from 
being known to other PE's. Yet,if we assume PE(0,4) 
instead of PE(0,1).is faulty besides PE(0,0),then they 
will block their two common testers in stage 3,PE(3,0) 
and PE(3,1),from receiving diagnostic information. 
Therefore,for PE(0,0),there are 2 cases in which the 
diagnosis system fails. It is the same for other PE's in 
stage 0. Therefore,we have 2*8 cases in stage 0. 

The numbers for stage 1 and 2 will still be the 
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get 56 for cases covered by (1) and (2) situations 
stated at the begining of this example. Besides 
that,there are 4 special cases. 

Therefore,there are 56+4 cases in which 2-fault 
situations make the diagnosis system fail. The total 
combinations for choosing 2 faulty PE's out of 4°8 


_PE's is 992. Thus,the coverage is (992-60)/992,which 


is 0.9395. The same estimation process can be 
applied to 2-fault cases for 3-stage system. 

As the size of the system grows,the difference in 
coverage diminishes as can be observed in Fig. 3.a 
and 3.b. That implies S system can correctly locate 
almost the same number of faulty PE's when the size 
of system grows,as compared with those of R 
system,although diagnosability of S system is still 1. 

As we pointed out earlier,when one considers 
the positive capability of a diagnosis system,he 
should take both coverage and diagnosability into 
account. We define the next measure accordingly. 


Definition 5 : Mean Diagnosability -- The 
expected value of faults which can be correctly 
located regardless the number of faults in the system. 


The mean diagnosability is as follows. 
n 


ED}= > i*G (5) 
i=0 


where c; is the coverage of the diagnosis system 
provided there are only i faults. 


(D) Mean Diagnosability : 


n 
(1) Rsystem=1*14+2*%1+ > 3*0 
i=3 
= 3 


(2) S system = 1* 142% 0.9395 + 3 * 0.8218 + 


n 
+ S_ 3 °C 
i=4 
n 
= 5.3444 + +a 3 °C 
i=4 
It is obvious that the mean diagnosability of S 
system is better than that of R system. 
From the above comparisons,we acquire the 
following conclusions : 


(1) Advantages of S system are : 

(a) It either correctly locates all faulty processors 
Or doesn't incorrectly locate any _ faulty 
processor,which is the basic requirement for a truly 
fault tolerant computing machine. This can be 


perceived from the trustability of S system. Implication 


of this capability is S system can afford to have a 
longer diagnosis period since it doesn't set an upper 


Jimit on the number of allowable faults. For R 
system,diagnosis period should be as short as 
possible in order to avoid extra faults occuring during 
diagnosis period. 

(b) The actual number of fault incidents in which 
faults can be located by S system is far more better 


than that of R system. This is justified by coverage 


and mean diagnosability of S system. 


(2) Disadvantage of S system is : 

(a) Diagnosability is always 1 less than that of R 
system. Yet this is not an actual disadvantage as it is 
a false indicator of diagnosis system capability. 


5. Conclusion 


We have presented a technique which is 
suitable for fail safely, distributively,locating faults in 
multiprocessor systems. That the system is both 
competent and fail safe is justified by the measures 
defined in this paper in terms of mean diagnosability 
and trustability. As we perceived,the assumption 
which has been widely used in several research 
papers does not necessarily make diagnosis systems 
more competent nor reliable. On the other hand,it is 


always fail safe to make as few assumptions as. 


possible. The simulation and comparison done on a 
multistage system proves that,by emphasizing fail 
safety,we improve not only reliability but also 
capability of the diagnosis system. 
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Abstract -- Recent vector supercomputers provide 
vector memory access to "randomly" indexed vectors, 


whereas early vector supercomputers required 
contiguously or regularly indexed vectors. This 
additional capability, known as "hardware 


gather/scatter", can be used to great effect in general 
sparse Gaussian elimination. In this note we present 
some examples that show the impact of this change in 
hardware on the choice of algorithms for sparse 
Gaussian elimination. Common folk wisdom holds that 
general sparse Gaussian elimination algorithms do not 
perform well on vector computers. Our numerical 
results demonstrate that hardware gather/scatter 
allows general sparse elimination algorithms to 
outperform algorithms based on a band, envelope, or 
block structure on such computers. 


Gaussian Elimination on Vector Computers 


Early experience with sparse Gaussian elimination on 
vector computers /3/ showed none of the dramatic 
improvements in speed-up encountered in other linear 
algebra computations. This is due to the fact that 
Gaussian elimination with a sparse data structure 
requires access to irregularly spaced data. Early 
vector computers, such as the CRAY-I and the 
CYBER 205 computers, allow vector transfers to and 
from memory only for contiguously or regularly spaced 
vectors. Most sparse Gaussian elimination algorithms 
spend the vast majority of the factorization execution 
time in a loop of the following type: 


INTEGER I, N, M 
INTEGER INDEX(M) 
REAL A, X(M), Y(N) 


DO 101=1,M 
Y(INDEX(D)=A*X(I) + Y(NDEX()) 
10 CONTINUE 


The difficulty this loop presents for vector computers 
is the indexing or indirect addressing for the vector Y. 
This loop is often referred to as an sparse or indexed 
SAXPY. The efficiency of the implementation of this 
loop determines the performance of the algorithm. 
Because of the importance of this loop or kernel a 
subroutine called SAXPYI, which follows the spirit and 
the notation of the BLAS /6/, has been proposed as a 
facility in extensions of the BLAS /2/. 


The use of the SAXPYI loop as presented above, in 
FORTRAN, would result in no use of the vector 
hardware of early supercomputers. The loop would be 
executed using scalar instructions only, producing no 
speedup at all over equivalent scalar computers. The 
indexed SAXPY loop can be rewritten to allow some 
use of of the vector hardware. Dembart and Neves /1/ 
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analyzed seven different formulations of this loop on a 
CDC STAR 100, and determined that there were 
combinations of vector length and vector density for 
which each of the formulations was fastest. Similar 
analyses by these authors for the CYBER 203 and 205 
showed equivalent results, althought the ratios 
changed. For reasonable combinations of vector 
length and vector density the most important of the 
six alternatives to the original scalar code was the 
same on all three machines. This alternative is first 
to gather all components on which operations are to be 
performed in a temporary dense vector, then to 
perform a dense SAXPY on this short vector, and 
finally to scatter the results back to the appropriate 
memory locations. 


Although this looks far more complicated than the 
Original loop, it permits the use of the vector 
arithmetic units for the SAXPY loop. The non- 
vectorizable memory transfers are _ isolated to 
separate loops that could be made more efficient by 
using assembly language subroutines (albeit in scalar 
mode). This formulation of the indexed SAXPY is 
known as a GATHER-SAXPY-SCATTER (GSS) 
implementation, for the operations performed in turn 
by the three loops. 


On a CRAY-1 computer, the original indexed SAXPY 
loop written in FORTRAN executes at a maximum 
rate of about 4 megaflops, much less than the 
maximum rate of 155 megaflops this machine can 
achieve for other operations. Woo and Levesque /9/ 
analyzed the G-S-S formulation and showed that its 
maximum rate in assembly language was around 8 
megaflops. Alternatively, a good assembly language 
implementation of the original loop using only scalar 
hardware, performs asymptotically at 13 megaflops 
(see /8/). This implementation is never slower than 
the G-S-S formulation, demonstrating that this is 
essentially a scalar computation for the CRAY-1. 


In contrast, the standard implementation of banded or 
variable banded Gaussian elimination algorithms use 
dense dot products or dense SAXPY operations as their 
fundamental inner-loops. Such implementations can 
achieve vector speeds on vector computers. However, 
they usually do not approach the asymptotic speeds of 
these machines because the vector lengths are limited 
by the bandwidth, which should not become very large. 
The possibility of using the vector hardware for these 
schemes and the inherent performance limitation of 
the indexed SAXPY loop has led to the conventional 
folk wisdom that (variable) banded factorization 
schemes will usually outperform general sparse 
Gaussian elimination on vector computers. 


The vector supercomputers being produced currently, 


in particular the CRAY X-MP/4 and the more recent 
models of the CRAY X-MP/2, are equipped with new 
facilities that permit memory access according to an 
index vector in hardware. That is, these machines 
permit the GATHER and SCATTER loops to be 
performed using vector memory transfers to and from 
a hardware vector register. This gather/scatter 
hardware leads to a much faster implementation of 
SAXPYI, using the G-S-S formulation. The assembly 
language coded implementation in VectorPak reaches 
78 megaflops asymptotically. Some detailed SAXPYI 
timings are given in Table | below /8/: 


Table 1. SAXPYI Speed on the CRAY X-MP/24 
(1 CPU, rates given in megaflops). 


M = 10 


M = infinity 


(ignoring the hardware for gather/scatter) 


CFT 1.13 5.0 5.7 

VectorPak 6.3 14.5 
(using the hardware gather/scatter) 

CFT 1.14 16.1 54.6 

VectorPak 16.1 78.6 


The older CRAY Fortran Compiler CFT 1.13 does not 
make use of the hardware for gather/scatter. A 
corresponding VectorPak implementation of SAXPYI 
has been developed for CRAY X-MP's_ without 
hardware gather/scatter. Both exhibit the scalar 
performance characteristic of the CRAY-l. In 
contrast the utilization of hardware gather/scatter 
either via CFT 1.14 or with VectorPak shows a 
dramatic improvement. 


Numerical Results. 


Our first series of numerical results demonstrates the 
speed-up which can be obtained using hardware/gather 
in a general sparse elimination algorithm on very large 
real life applications. We use a modified minimum 
degree (MD) algorithm by Liu /7/ to solve the 
following seven problems taken from the sparse matrix 
collection /4/. Below a short problem description and 
some ordering statistics are given. All problems with 
the exception of LRGPWR are finite element models 
of large three dimensional structures. All matrices are 
symmetric and positive definite. 


Table 2. Problem Description. 


Problem Description 
STK3562 Finite element model of Sports Arena 
N=3562, Anz=78174, Lpz=275360 
STK3948 Finite element model of oil platform 
N=3948, An z=56934, Lp z=647274 
STK 4884 Corps of Engineers model of dam 
N=4884, Apo=142747, LZz=736294 
LRGPWR Electric power network of U.S. 
N=5300, Anz=8271, Lpz=22764 
ST 10974 Elevated pressure vessel 
N=10974, Anz 2088 38, Lnz=994885 
ST11948 Nuclear power station 
N=11948, Anz=68571, Lnz=650777 
ST 15439 76 story skyscraper 


N=15439, Anz=118401, Lyz=1401129 
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where, 

N order of the matrix, 

Anz nonzeros in lower triangle of the original matrix, 
Lnz nonzeros in the Cholesky factor of A, 


The numbers in Table 2 show that with the exception 
of the power network problem LRGPWR all matrices 
considered have a substantial number of nonzeros per 
row. Thus we expect to see the theoretical speedups 
due to hardware gather/scatter realized on some 
practical applications. 


All problems were solved using four different 
implementations of the key SAXPYI loop. The original 
source code was modified by inserting a few compiler 
directives to affect the vectorization of the key loops. 
Even though CFT1.14 can vectorize the access of 
random elements according to an index vector as in 
the SAXPYI-loop, a compiler directive must be 
inserted in the original code, in order to instruct the 
compiler to do so. The Fortran code with inserted 
compiler directives was then compiled twice under 
CFT1.14. Using a compiler option, code was 
generated both for a Cray X-MP with and without 
hardware gather/scatter. | 


Then the code was modified to call an optimized 
implementation of SAXPYI from VectorPak /8/. This 
modifed code was also compiled under CFT1.14, and 
then executed twice. First a VectorPak library for 
Cray X-MP's without gather/scatter was used, and 
then the corresponding library for machines with 
gather/scatter. Table 3 and 4 below present the 
execution times obtained for factorization and 
solution for the four implementations. All execution 
times are listed relative to the VectorPak with 
gather/scatter time, which is normalized to one. The 
actual execution time in seconds for the VectorPak 
with gather/scatter implementation is given in Table 
5, together with the execution times for an envelope 
factorization based on the reverse Cuthill-McKee 
(RCM) algorithm from /5/. 


Tables 3 and 4 show clearly the direct benefits of the 
hardware gather/scatter feature for general sparse 
elimination schemes. Using this feature factorization 
and solution are in some cases almost up to 8 times 
faster. Very sparse problems such as LRGPWR, 
however, do not benefit from this speedup, since the 
number of nonzeros per row is too small. Note that 
even the factored matrix LRGPWR has only about 9 
nonzeros per row, too few to obtain the fast 
asymptotic rates in Table 1. On the other hand, the 
finite element problems do have a large number of 
nonzeros per row, and thus almost realize the speedup 
potential of the asymptotic rates given in Table 1. 


The numbers for VectorPak demonstrate two facts. 
Obviously it pays more in terms of speedup to use a 
carefully assembly coded subroutine on a machine 
without hardware gather/scatter. And in spite of 
advances in the CRAY compilers, VectorPak still 
offers a 20-25% improvement in the factorization and 
a 30% improvement in the solution as compared to 
straight forward CFT1.14. 


Table 3. Execution Times for Sparse Matrix 
Factorization. 
(normalized so that VectorPak with g/s = 1.00) 


Problem CFT1.14 CFT1.14 VectorPak 
no g/s with g/s_  nog/s 
STK3562 6.33 1.14 2.54 
STK3948 9.66 1.24 3.25 
STK 4884 8.70 1.20 3.14 
LRGPWR 1.25 0.93 1.17 
ST 10974 7.24 1.16 2.80 
ST11948 8.75 1.22 3.12 
ST 15439 8.78 1.25 3.15 


Table 4. Execution Times for Sparse Matrix Solution. 


Problem CFT1.14 CFT1.14 VectorPak 
no g/s with g/s__no g/s 
STK3562 6.94 1.48 2.55 
STK3948 9.22 1.47 3.13 
STK 4884 8.77 1.39 3.00 
LRGPWR 1.96 1.57 1.11 
ST 10974 745 1.48 2.68 
ST11948 5.94 1.50 2.23 
ST 15439 7.54 1.49 2.68 


Table 5. Execution Times (sec) for Factorization 
and Solution Routines. 


Problem Factorization Solution 
RCM MD RCM MD 

STK3562 3.421 1.276 0.047 0.033 
STK3948 7.487 4.074 0.064 0.055 
STK4884 4.071 4.129 0.065 0.066 
LRGPWR 3.642 0.102 0.061 0.028 
ST 10974 - 4.891 - 0.108 
ST 11948 - 3.871 - 0.094 
ST 15439 17.440 7.791 0.213 0.152 


The comparison in Table 5 shows that general sparse 
methods outperform envelope solvers on _ vector 
computers with hardware gather/scatter. Because of 
the natural vectorization of envelope methods and the 
essentially scalar performance of general sparse 
methods on earlier vector computers, general sparse 
methods were generally thought of non-competitive on 
vector computers. Table 5 disproves this assertion, in 
particular since some of the problems could not even 
be solved by RCM within the 4Mword available on the 
Cray X-MP/24 at Boeing Computer Services. 


A final point worth making concerns a software issue. 
Sparse Gaussian elilimination software from /5/ had 
been modified with calls to computational kernels 
replacing some inner loops several years ago, when it 
was installed on the CRAY-IS. As soon as the CRAY 
X-MP arrived some of the above test examples were 
run again without any code modifications. By using 
computational kernels from a kernel library such as 
VectorPak the application programmer therefore can 
reap directly the benefits of hardware improvements, 
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without concerning himself with the often subtle 
details of a new implementation, or without waiting 
for the compiler writer to catch up with the hardware 
architect. Thus the success in using hardware 
gather/scatter in the context of sparse Gaussian 
elimination also validates the concept. of 
computational kernels as a tool for combining 
portability and optimality on advanced architectures. 
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Abstract 


In this paper, an algorithm for solving sparse 
sets of equations with an almost tri-diagonal 
structure on Single-Instruction-Multiple-Data 
Stream (SIMD) computers, like the ICL 
Distributed Array Processor (DAP) and the 
Goodyear Aerospace Massively Parallel Processor 
(MPP), is described. Central to this algorithm 
is the partitioning of the original sparse set 
of equations into a number of smaller tri- 
diagonal sets that can be solved in parallel 
using cyclic reduction. A DAP implementation of 
the algorithm, which demonstrates how the 
nearest neighbor connection network can be used 
to keep track of the couplings between unknowns, 
as these (the couplings) are modified by the 
solution procedure, is detailed. The paper 
concludes with a discussion of the algorithm's 
performance on an actual application. 


1. Introduction 


The numerical solution of one-dimensional dif- 
ferential equations often requires the inversion 
of matrices with a sparsity structure that we 
will refer to as almost tri-diagonal. Such. 
problems can, for instance, be found in the 
field of nuclear reactor hydrodynamics and other 
branches of fluid dynamics. In this paper, we 
describe an algorithm for solving almost tri- 
diagonal problems on Single-—Instruction- 
Multiple-Data Stream (SIMD) computers like the 
ICL Distributed Array Processor (DAP) and the 
Goodyear Aerospace Massively Parallel Processor 
(MPP). 


These machines derive their high performance 
from the lockstep operation of a large number 
(4,096 and 16,384 in the case of the DAP and 
MPP, respectively) of simple, one-bit Processing 
Elements (PEs), rather than from a few, powerful 
processors as in the case of vector computers 
such as the CRAY-XMP and CYBER 205. Thus, 
algorithms cannot exploit the full processing 
power of the DAP and MPP unless they allow most, 
or all, of the PEs to be engaged in productive 
work(b) most, or all, of the time. Parallel 
cyclic reduction, which is a standard technique 


(a) This work was supported by a grant from 


the United Kingdom Atomic Energy 
Establishment, Winfrith. 


0190-391 8/86/0000/0369 $01.00 © 1986 IEEE 


369 


for solving single tri-diagonal sets of 
equations, satisfies this criterion. In our 
algorithm for inverting almost tri-diagonal 
problems, we take advantage of this by 
partitioning the original sparse matrix into a 
number of smaller matrices with tri-diagonal 
structures. As will be shown (see Section 
2.2.2), these can all be inverted in parallel 
using cyclic reduction, provided the total 
number of equations in the almost tri-diagonal 
set does not exceed the number of PEs of the 
computer. 


The full algorithm for solving almost 
tri-diagonal problems has been successfully 
implemented on the DAP. Following the 
description of the algorithm, details of this 
implementation are given. The paper concludes 
with a discussion of the algorithm's performance 
on an actual application. 


2. An Algorithm for Solving Almost 
Tri-Diagonal Sets of Equations 


Consider undirected graphs or networks of N 
nodes that satisfy the following requirements 
(see Figure 1): 

(i) The majority of nodes [O(N)] have only 
two neighbors. 


(ii) No nodes have more than three neighbors, 
and there are at most m such nodes 
(O<m<<N); these will be referred to as 
junctions. 


Figure 1. Sample Network 


Networks like these are frequently used to 
numerically solve one-dimensional differential 
equations. 


(b) We here define productive work as an 
operation that contributes toward completing 
the task of an algorithm. 


In that case, some quantity x is to be computed 
at each node. Typically, this involves solving 
a set of simultaneous, linear equations which, 
given conditions (i) and (ii), can be expressed 
by: 


ax. + b,x, + ¢.x + e.x% = qd, (1) 
where 

r,2%,b,i = 1..N 

cebh yati<ic at (2) 
2 = itl 

Jy Iy = numbers of junctions; k,k’ = 1..m 


From (2) it follows that for each sub-graph or 
sub-network, j,t1sisi,,-1, (1) may be rewritten 


as three equations: 


a,x. 4+ box, + ¢.x. = d.3 Jyti<t<i,.-1 

ae + b,x, + CX = qd. i= J,,+1 (3) 

a,x, 4+ box; + “ii = d.3 1 = Jyenh 

Furthermore, if a.x. and c.x are treated as 
i J, i jy. 


right-hand-side terms, (3) can be compressed to 
a single equation: 


aX. 4 + b.x, + C.Xs44 7 qd. + a a + ge (4) 
where 

a.X.4 = 0; i= i, +1 

Cs%i4a = 08 i= 5,71 

a. = 0; ij, t1<isi,.-1 

a, = 8,3 i= j, +1 

B; = 0; i, t1si<i,.-1 

5. = -C;3 i= iyo? 
It can be seen that (4) is of tri-diagonal form 


although it has three right—hand-side terms. 
The m junctions separate the original network 
(graph) into S [0(2™)] sub-networks. Hence, 
(1) can be partitioned into S tri-diagonal sets 
of equations like (4). These can be solved 
separately to yield x, (i4j,---5,) as a linear 


and x, , CC) 


function of x. 
Ix Jie: 


Ce) Obviously, the pair (k, k') is different 


for each sub-network. 
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X, = Pp. + q4.X,. + FX, (5) 
i i i Jy i iy 
where 
k, k’ =1..m 


i= 1..N (i # jy..-dm) 


To complete the solution of the original sparse 
set of equations, (1), the m equations for the 
junctions (e; # 0; i = jz...jg) must now 

be used. By substituting the expressions for 
Xr, Xg, and xp obtained from (5) in these 
junction equations, a closed matrix problem 
(size mxm) is created, which can easily be 
solved to yield values for x;, i = jz..-jp. 
Back-substitution in (5) then gives the values 
for the remaining N-m unknowns. 


An additional step is needed in the above 
algorithm when the network (graph) represents a 
closed loop. Unless the nodes can be numbered 
so that the loop closes back on itself ("wraps 
around") at a junction--this is, of course, 
impossible if a network does not contain 
junctions--(5) will have a slightly different 
form for the two sub-networks, s’ and s", that 
close the loop: 


= ° e : ® 

Xx. P; + 45 Xo + a ae 3 s‘ <i<s (6) 
n k 

x, = Ps + q5%; + Tr Xo03 s,sis<s (7) 
k 1 

where 

Sj] = lowest numbered node of sub-network s' 

sj = lowest numbered node of sub-network s" 

S, = highest numbered node of sub-network s' 

s, = highest numbered node of sub-network s" 


Before proceeding to substitute in the equations 
for the m junctions, the terms q.x ,, and r.x , 

i sy is; 
must be eliminated. It will be seen that this 
is easily accomplished using the equations 
obtained by writing down (6) and (7) for i=s} 
and i=s,, respectively. 


A fundamental problem in implementing the 
algorithm described in this section on SIMD 
computers like the DAP and MPP is how to keep 
track of the (x, , x, )-pair to which each x, is 
Jy Ix 1 
The method for accomplishing this on 

A brief — 


coupled. 
the DAP is described in Section 2.3.3. 
overview of this machine is now given. 


2.1 Brief Overview of the pap‘4) and its 
High-Level Language (HOL) 


The ICL DAP is an SIMD computer consisting of 
642 (4096) one-bit Processing Elements 
(PEs)--all under control of a Master Control 
Unit (MCU). Communication lines between PEs are 
provided by a nearest-neighbor connection 
network, and each PE is equipped with 16 kbits 
of memory for a total storage capacity of 8 
Mbytes. All of this can be used as a memory 
module of the DAP's front-end, an ICL 2980 
mainframe computer. 


The DAP can operate in two different parallel 
processing modes. The most powerful of these 
performs sequential operations on bit slices 
from 4096 different memory words and is referred 
to as matrix mode. In the less powerful vector 
mode processing, the PEs are made to carry out 
operations on 64 words simultaneously with all 
bits of each word processed in parallel. 


A special, non-portable HOL, called DAP FORTRAN, 
has been developed for the DAP. It differs from 
ordinary FORTRAN mainly in that it has facili- 
ties for expressing the hardware parallelism of 
the DAP. More specifically, DAP FORTRAN allows 
an operation to be specified for all elements of 
a vector or matrix simultaneously. There is, 
however, one proviso: the dimensions of a 
vector and matrix declared in DAP FORTRAN must 
match those of the PE array; thus, a vector can 
only have 64 elements and a matrix 4096, 
arranged as a two-dimensional 642 array. A 
matrix can also be reduced to a one-dimensional 
structure by treating it as a “long” vector 
consisting of 4096 linearly arranged elements. 
Operations on data arrays with more than two 
dimensions must be broken up into sets of either 
vector or matrix operations depending on the 
array declarations. 


To complement the parallel data structures, the 
indexing facilities of ordinary FORTRAN have 
been extended in DAP FORTRAN. Most notably, 
logical expressions, or masks, can be used to 
select data items from vectors and matrices. 


2.2 DAP Implementation Details of the Algorithm 
for Solving Almost Tri-Diagonal Sets of Equations 


2.2.1 Memory Mapping. For algorithms to 
efficiently exploit the processing power of the 
DAP, they must engage on the order of 4,096 PEs 
Simultaneously in productive work. It is 
demonstrated in the following that the algorithm 
described in Section 2 can achieve this, 
provided the networks (graphs) are mapped onto 


(d) The description is confined to the 


installation at Queen Mary College 
(University of London), and the reader 
should be aware that there are DAP systems 
that differ from this one in a number of 
respects. 
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the PE array so that all information associated 
with node i is stored in the local memory of PE 
i. This information consists of the coeffi- 
cients for equation i and the numbers of the 
nodes to which this equation couples node i. 
Without the latter, neither the substitutions 
into the m junction equations nor the back- 


substitution step can be performed. 


2.2.2 Extending the Use of Parallel Cyclic 


Reduction to Multiple Sets of Tri-Diagonal 
Equations. Parallel cyclic reduction is a 


standard technique for solving tri-diagonal 
equations on SIMD machines like the DAP and the 
MPP. Formally, for a single system of 
equations, a3 X3_4+b,x3+¢3X%54,=k;, the 

technique is expressed by the following 
recursive formulae: 


Left—Hand Side 


_a(toh) (2-2) (9-1) 


(2) 
a = a /»b (8) 
(2) (2-1) (2-1) (2-1) 
- - i4262-)) i4262-)) 
(2) (2-1) (Q-1) (2-1) (2-1) 
b. = db. ‘on a. / b (10) 
i i i y-2 (2) 4-2 62-1) 
(Q-1) (2-1) (2-1) 
Cc. a / Dd 
Right—Hand Side 
(2) (2-1) (Q-1) , (2-1) (2-1) 
a as i: ME Sy a ai es 
—2 1-2 
(2-1) , (2-1) (2-1) 
Cc. k /»b 
where 


2 = step number 


ans 0; igh) 
i ne . 
oD «95 ion g(t) 


After loggn steps, where n is the number of 
equations in a set, all coefficients a; and 

c; will have been eliminated, and the values 

of the unknowns are given by (11). It will now 
be shown how parallel cyclic reduction can be 
used to solve S sets of tri-diagonal equations 
like (4) simultaneously on the DAP. 


By applying (11) to the right-hand-side terms of 
equation (4), it follows that: 


(2) (P-1) 4 ctes oth) 
st i | , JySt<Jy 
(12) 
(2) (2-1) (2-1) ne -1) sind ag ho) 
a aaa p81)", 9 (h- 1)? Jy_e2t25,+2 
(2) (M-1). (2-1) ., 
B. = 8. ; jJy-2 <isi,. 
(13) 
(2) (1) (2-2) (2-1) 5(2-21) 
. ane aA ap (he are 3ct-a)} Jysisi,.-2 


Comparing ~— (12) and (13) with equations 
(8) and (9), respectively, we observe that: 


(2) 


(2) 
(A) a, 


40 for those values of i that a, =0. 


(2) 


The recursive formula for a. when 


(2-1). 


(B) 


j,02425,+2 


ie (2) 


40 for those values of i that B. = 


(Cc) c 


The recursive formula for sy when 


(2-1) is identical to equation 


(D) 
i,<isi,.-2 


(10). 


(A) and (C) show that the coefficients a; to and 


a, 40 and ey 40 and By ) 


40, eee ren can 
be merged ints two Gectaeu, each of length N. 
If these are declared as “long” vectors in DAP 
FORTRAN-—~in keeping with the previously speci- 
fied memory mapping--it follows from (B) and (C) 
that their elements, and hence all S sets of 
equations, can be operated on in parallel on the 
DAP as long as N<4,096. Furthermore, no extra 
(2) (2) 
and B. 


compared with that which would be nceded to 
solve a tri-diagonal problem with a single 
right—hand-side term. 


arithmetic is required to compute a; 


2.3.3 Details of the DAP FORTRAN Code. 
Figure 2 is a DAP FORTRAN listing of the main 
subprogram in the implementation of our 
algorithm for solving almost tri-diagonal 
problems. SUBROUTINE NET _TRID SOLVE first 
inverts S sets of equations of form (4) using 
parallel cyclic reduction as described in the 
preceding section. The vectors whose elements 
are ay 40 and as? 0 and ey #0 and By) #0 
ceuteetbeely: eee cennead to “Le and ube of 
Figure 2. To perform left-hand-side indexing 


is identical ate equation (8). 


elements a. 0: 
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operations on these vectors, three Boolean 
“long"™ vectors, or masks, MSK_JCT, RIGHT, and 
LEFT are used. Their effect is to disable 
writing into memory in those PEs whose logical 
positions in the PE array correspond to the 
positions of elements with value FALSE in the 
masks. For instance, the statement LDC(RIGHT) = 


-LDC... computes the coefficients ay” £0 and 
ate) 


#0 with RIGHT inhibiting seit teiae of the 

a 1), these are needed in the back- 
substitution step of the algorithm for solving 
sparse problems of form (1). The coefficients 
By? 40 are also needed in this step. 
in the statement UDC(LEFT) = -UDC... 
(2) (2) 


calculates c; #0 and B. 40, LEFT inhibits 


overwriting of these Speer ic tena: Finally, MSK_ 
JCT disables processing in those PEs that hold 
the coefficients of the m junction equations. (e) 
Thus, the parallelism achieved during the cyclic 
reduction of S sets of equations is (N-m)[0(N)]. 
After n=0[ log max(j, ,—j,+1)1] steps, all solutions 


Hence, 


» which 


of form (5) have been obtained, and the first 
stage of the algorithm is complete. 


Now, the number of elements oj; and 8; that 
have values different from zero doubles at. each 
step in the cyclic reduction phase. Thus, at 
step 2, x. is coupled to x, and x, for 

1 Ix Jue 
e @ e Q @ Q e @ @ 
j,tisi<j,+2 and Jy? <isj,.-1, respectively. 
If the values of Jy and Jy are initially stored 
in the elements (5,41) and CJye- 2) respectively, 


of two separate “long” vectors, rc and lc, then 
the elements of these vectors must be updated 
according to: 


(2). . ‘ es 2 
re. =J,,3 Jj, tisi<j,+1+2 (14) 
(2). ; eae Q 
le, =Jye3 Jye tzid, ,-1-2 (15) 
where 
k, k*’ = 1..m; 
i= 1..N; 


In Figure 2, RIGHT_JCT and LEFT_JCT correspond 
to re and lc, respectively. The modification of 
the information stored in these “long" vectors, 
as specified by equations (14) and (15), is 
performed by using DAP FORTRAN built-in shift 
functions: SHRP propagates data from a PE to 
its “right" neighbor, and SHLP performs the same 
task in the “left" direction. 


(e) The reader is reminded that these 
equations are set aside during the first 
stage of the algorithm described in 
Section 2. 


AFHAIOAIOMNAAGSIAAIANOAAIAIAAIAAAHAGAAAAAMAMOAAMAAS O- 


OoOe oo o> 


oOo Mae 


OO AOMPOMnr O§& AOMm MAO OO Aaa  ~- 


SUBROUTINE NET_TRID_SOLVE ( DIA, LOC, UOC, BC, C, 


Ge Ge ge 


RIGHT ICT, LEFT ICT, 
BRANCH, ICT _TABLE, NETICT, 
RIGHT, LEFT, WRAP, NCELLS ) 


...-SOLVES SPARSE, LINEAR MATRIX PROBLEMS, Ax-b, 
THAT HAVE AN ALMOST TRIDIAGONAL STRUCTURE.... 


KKK 


100 


INPUT ---—~ DIA : DIAGONAL COEFFICIENTS 
LOC : LOWER DIAGONAL COEFFICIENTS 
UDC : UPPER DIAGONAL COEFFICIENTS 
BC : BRANCH COEFFICIENTS 
C : RHS OF EQUATIONS 
RIGHT_ICT : NODE NUMBERS OF THE RIGHT HAND 
LEFT _JCT : NODE NUMBERS OF THE LEFT HAND 
JUNCT IONS 
BRANCH  : NODE NUMBERS OF THE BRANCH 
CONNECT IONS 
JCT TABLE : TABLE OF JUNCTION NUMBERS 
NETICT |: TOTAL NUMBER OF NETWORK 
JUNCTIONS 
RIGHT  : BIT PATTERN MASKING RIGHT 
HAND JUNCTIONS 
LEFT : BIT PATTERN MASKING LEFT 
HAND JUNCTIONS 
WRAP : SWITCH INDICATING THAT THE 
NETWORK CONTAINS A LOOP 
( .FALSE.-OFF, .TRUE.-ON ) 
NCELLS  : TOTAL NUMBER OF NODES 
IN NETWORK 
OUTPUT ----- C : SOLUTION OF SPARSE SET OF 
EQUATIONS 
INTEGER RIGHT JCT(,), LEFT JCT(.), BRANCH(.), 
& JCT TABLE(}, NOCT, ~NCELLS 
LOGICAL RIGHT(,), LEFT(,), LV(), MSK_JCT(,), WRAP 
REAL*8 u0C(,), LOC(,), DIA(,), C(,), BC(,), 
& SOL_JCT(), X1, X2, X_VAL 
EQUIVALENCE (kK, WW) 


EXECUTABLE SECTION *** 


Oe te em te cle ee ee eet nee Cae ED ee 


SAVE THE TOTAL NUMBER OF NETWORK JUNCTIONS FROM BEING 
RE-DEFINED - 


NJCT = 


K= 1 


NETICT 
.MASK JUNCTION EQUATIONS... 
MSK_JCT = 


RIGHT .OR. LEFT 


~CYCLIC REDUCTION LOOP.... 


START OF SCOPE OF LOOP 


CONT INUE 


UDC ( MSK_JCT ) = UDC / DIA 
LOC ( MSK_JCT ) = LOC / DIA 
C (MSK-JCT)=C / DIA 


~ DIAGONAL COEFFICIENTS - 


DIA( MSK_JCT ) = 
DIA ( RIGHT ) = otA - LOC * SHRP ( UOC, K ) 
DIA ( LEFT ) = DIA - UDC * SHLP ( LOC, K ) 


~ RIGHT HAND SIDES - 


C ( RIGHT ) = C ~ LOC * SHRP ( C, K ) 
C ( LEFT ) = C - UDC * SHLP ( C, K ) 


~ LOWER DIAGONAL COEFFICIENTS - 


LOC ( RIGHT ) = - LOC * SHRP ( LOC, K ) 


Figure 2. 


OOona Oc 


Oo 


OOO Aan OAM MHOMQOOAOaan AAO OQOan oO 


OM EO Oo 


oO OPQ An 


~ UPPER DIAGONAL COEFFICIENTS - 


UDC ( LEFT ) = - UDC * SHLP ( UDC, K) 


- DISTRIBUTE THE CELL NUMBERS OF THE JUNCTION 


- MASK JUNCTION 


CONNECTIONS - 


RIGHT JCT ( RIGHT ) = SHRP ( RIGHT _JCT, : ) 


LEFT_JCT ( LEFT ) = SHLP ( LEFT_JCT , 


RIGHT = RIGHT .AND. SHRP ( RIGHT, K ) 
LEFT = LEFT .AND. SHLP ( LEFT , K ) 


LV = SHLP ( LV ) 


~ COMPLETED "SOLUTIONS" FOR ALL SUB-NETWORKS ? - 
IF ( ANY ( RIGHT ) .OR. ANY ( LEFT ) ) GO TO 100 


END OF SCOPE OF LOOP 


--» NORMALIZE "SOLUTIONS".... 


UDC ( MSK_JCT ) 
LOC ( MSK_JCT ) 


Cc 


uoc / DIA 
LOC / DIA 
/ DIA 


a 


( MSK_JCT ) = C 


....FURTHER PROCESSING DEPENDS ON THE TYPE OF THE 
NETWORK. . 


- NO JUNCTIONS AND NO "WRAP-AROUND" ? - 


IF ( NUJCT .EQ. 0 .AND. 


-NOT. WRAP ) RETURN 


~- JUNCTIONS BUT NO "WRAP-AROUND" ? - 
IF ( .NOT. WRAP ) GO TO 300 


— “WRAP-AROUND” BUT NO JUNCTIONS 2? - 


IF ( NUCT .NE. 0 ) GO TO 200 


~ THE NETWORK HAS "WRAP-AROUND" BUT NOT JUNCTIONS - 


Ge Go Ge 


200 


CALL SOLVE 2X2 ( i‘ Bet RCEL ( i Ne Loc ( 1), 
es + 
tpt ( NCELLS ), C( 1), 
C ( NCELLS ), x1, X2 
SOL_JCT (1) = XI 
SOL_JCT ( 2 ) = X2 
NICT = 2 
GO TO 400 


CONT INUE 


~ THE NETWORK HAS BOTH “WRAP-AROUND” AND JUNCTIONS - 


& 


CALL SUB_WRAP_TERMS ( JCT TABLE ( NUCT +1 ), 


JCT_TABLE ( NUCT + 2 ), LOC, 
UDC, C, LEFT_JCT, RIGHT _JCT ) 


300 CONTINUE 


- THE NETWORK HAS ONLY JUNCTIONS ( "WRAP-AROUND" TERMS, IF 


ANY, HAVE BEEN ELIMINATED ) - 
Pea JUNCTIONS ( LOC, DIA, UDC, BC, C, RIGHT _JCT, 


& 


LEFT JCT, BRANCH, JCT _TABLE, NJCT, 
SsoL_JCT ) 


400 CONTINUE 
.- -BACK-SUBSTITUTION.... 


00 500 J = I, 


NJCT 


J CELL = JCT TABLE ( J ) 

X VAL = SOL_JCT ( J ) 

C( J_CELL .EQ. RIGHT _JCT ) 

: ( J_CELL .EQ. LEFT_JCT ) 
C ( J_CELL )=X _VAL- 


C - X_VAL * LOC 
C - X_VAL * UDC 


C 
‘ 500 CONTINUE 


RETURN 
END 


DAP FORTRAN Listing 


CONNECTION ( OR “WRAP-AROUND” ) TERMS - 


At the end of the cyclic reduction process, the | 
coefficients qd» vr. and P, are held in the 


“Long” vectors LDC, UDC, and C, respectively. 
In the case of a single tri-diagonal system of 
equations, qj and r; are all zero. Thus, C 
represents the solution of the system and no 
further processing need take place in NET_TRID_ 
SOLVE (see Figure 2). 


Sub-—programs SOLVE_2X2 and SUB_WRAP_TERMS 
perform the special eliminations that are 
required if a network represents a closed Loop, 
while the substitutions from (5) into the m 
junction equations take place in SUBROUTINE 
JUNCTIONS. Both of these are sequential tasks, 
and no use of the DAP's parallelism can be made 
in performing them. It is thus vital to the 
efficiency of our algorithm for solving almost 
tri-diagonal problems that the number of ~ 
junctions, m, in a network, as well as being 
very much smaller than the total number of 
nodes, N, also is small in absolute terms. 


The subprogram that solves the closed mxm matrix 
problem for x, (k=1..m) is invoked by SUBROUTINE 


JUNCTIONS. on*return from JUNCTIONS, the values 
of x are stored in the DAP FORTRAN vector SOL_JCT 
with element k being the value of x, 


Ix 
Finally, the back-substitutions for x, and x, 
k k* 
in (5) are performed by the DO 500 loop. At the 


start of each pass through this loop, the values 


of J, and are placed in J_CELL and K_VAL, 
k 


respectively. 


This is followed by the broadcasting of X_VAL to 


those PEs for which J_CELL is equal to jy. or Jy! 


the broadcasting is effected by the expressions 
J_CELL .EQ. RIGHT_JCT and J_CELL .EQ. LEFT_JCT. 
At the end of each pass through the DO 500 loop, 


the value of x, is placed in the element of 
k 

C so that on return from NET_TRID_ SOLVE, the 

complete solution to a sparse matrix problem of 

form (1) is stored in this “long” vector. 


3. Applications of the Algorithm for Solving 
Almost Tri-Diagonal Sets of Equations 


The algorithm described here has been used to 
solve some fairly simple one-dimensional 
problems in nuclear reactor hydrodynamics on the 
DAP. In one case, a grid depicting a loop with 
two branches connected to it (see Figure 1) was 
used with the total number of nodes being 256. 
The different steps of the algorithm were timed, 
and the results, together with estimates for a 
1000-node case, are summarized in Table 1. It 
will be seen that the scalar operations 
(SUBROUTINE SUB_WRAP TERMS, JUNCTIONS, and 
SOLVE_2K2) consume less than 10 percent of the 
total processing time. This demonstrates that 
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the algorithm uses the DAP’s parallel processing 
power efficiently, provided the requirement m<<N 
is satisfied. Ideally, of course, N should be 
4,096, since this allows the full parallelism of 
the DAP to be exploited. 


Table 1. Breakdown of Processing Time 
Spent in Different Parts of Algorithm 
for Solving Sparse Linear Problems With 
an Almost Tri-Diagonal Structure 


Number of 
Network Nodes 


Code Section 256 1000* 
SUBROUTINE NET TRID SOLVE 118 ms 141 ms 
SUBROUTINE SUB_WRAP_TERMS 11 ms 11 ms. 
SUBROUTINE JUNCTIONS 5 ms 5 ms 
SUBROUTINE SOLVE_2X2 1 ms 1 ms 


ms = milliseconds 


* Figures are estimates 


4. Conclusions 


An algorithm for solving almost tri-diagonal 
sets of equations on SIMD computers like the DAP 
and MPP has been described. The algorithm 
exploits the sparsity structure of this class of 
problems, thereby allowing parallel cyclic 
reduction to be used as the main step in the 
solution procedure. It has been demonstrated 
that this results in efficient use of the DAP's 
processing power in solving problems with an 
almost tri-diagonal structure, provided the 
number of equations is comparable to the 
parallelism of the machine (4,096). 
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ABSTRACT 


Experience with the CRAY-2 on the effects of common memory speed and 
loading on performance indicate that local-memory-based algorithms have 
potentially a large advantage. The performance of a number of common- 
and local-memory algorithms are compared for the LU factorization of a 
dense system of equations on the CRAY-2. A _ two-level blocked 
algorithm is introduced to allow both pivoting and high asymptotic 
performance. Measurements show as much as a 6:1 speedup between 
blocked assembly-language versus (traditional) vectorized Gauss Fortran 
implementations; the contributions of both the algorithm and the 
language to this speedup are evaluated. 


Introduction 


CRAY-2 Architecture and Algorithm Implications 


The CRAY-2 architecture (Figure 1) has several features relevant to this 
algorithm study. 


(a) Common memory features.The massive common 
memory (CM) trades size for access time, so that a 
considerable delay is usually encountered in reading 
from CM. Also, only one data path connects common 
memory to each processor's functional units. 

(b) Local memory. The above speed disadvantages are 

compensated by a local memory (LM), which serves as 

backup vector and scalar storage for the functional 
unit's register storage. 


(c) Chaining. The CRAY-2 does not have hardware 
chaining; this must be achieved by software and/or 
algorithm means. 


The implications of distributed memory (including hierarchies such as CM 
and LM) on linear algebra algorithm organization has been studied since 
the existence of paged memory systems [1[2][(3][4]. In general, 
computations must be arranged so that the number of floating-point 
operations on data at the low memory levels is sufficient to warrant data 
transfers to these levels. This implies, for example, that a matrix-vector 
multiply - which involves only two operations for each matrix data 
element - may perform less efficiently than a matrix-matrix multiply. 


view Vecto 


The asymptotic execution rate (MFLOPS) of a factorization algorithm is 
equal that of the kernel that performs the add-multiplies associated with 
reducing rows and columns. Three substantially different such algorithms 
deserve consideration. 


Gauss vector-scalar multiply. This requires that, in reducing 
the rth row , successive operations on preceding rows must be performed 
serially since a partial result from each row-operation is used as an 
operand in the next one (the reader is assumed familiar with this 
procedure). In rows with lengths longer than the vector functional unit 
length, this dependency can usually be avoided by assembly coding; it is 
then termed the GAXPY kemel. 
largest average vector length of any of the following kernels and so is a 
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It has the advantage of yielding the — 
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serious consideration when the matrix size is not significantly larger than 
the maximum allowable vector length, and assembly coding is allowed. 
This procedure does not lend itself to partitioning when large matrices are 
involved, but is a potentially useful subalgorithm in such cases. 


Matrix-vector_multiply. Early experience with the CRAY-1 
indicated that block-oriented algorithm organization had at least a 
pedagogical advantage for large problems[5][6]. Unfortunately, the 
necessary {vector - (matrix*vector)} kernel was not made a part of the 
CRAY scientific library and consequently the kernel was not 
syntactically distinguished from the rest of CAL-coded factorization 
algorithms. The organizational concept of basing factorization on 
matrix-vector multiply subroutines was developed in [7][8], where it was 
illustrated how unrolling Fortran loops could be used to achieve a high 
performance completely from a high-level language. This emphasized 
portability and maintainability. More recently, these kernels - known as 
second-level BLAS - have been proposed as the basis for other common 
linear algebra algorithms[9]. 


Matrix-matrix multiply. Although it has always been clear 


that factorization can be accomplished by a matrix-matrix multiply 
kernel, the CRAY-1 memory hierarchy was not sufficiently distinctive to 
achieve a significant advantage over the above two kernels[3]. The 
additional memory paths of the X-MP made this even less attractive, and 
partially reduced the advantage of the matrix-vector multiply above. The 
disadvantages of basing factorization on matrix multiplies are the 
necessity for other matrix-level kernels to perform reciprocations and 
substitutions and, most important, the difficulty of partial pivoting. 


The memory distribution must be quite distinctive to warrant the 
programming effort of matrix*matrix-based factorization. This paper 
documents this case for the CRAY-2, while laying the groundwork for a 
multiprocessor implementation. 


Pivoting Algorithms 
M*V-based Algorithms 
Given a set of equations” 


AX=B 


where A is an nxn matrix and X and B are vectors, the factorized solution 
proceeds by forming lower and upper triangular factors L and U, viz 


A =LU (1) 


and then solving for Y and X in substitution steps 


LY = B, Ux = Y 

The complexity of the factorization step of Eq. (1) is O(n3 ). The 
substitutions have complexity O(n2), with only one add-multiply for each 
L and U element; consequently no algorithmic speedup results from 
transferring L and U to local memory, so that these substituion steps will 
not be studied. 


*Matrices are in bold, vectors are in upper case, and scalars are in lower 
case type. 


For algorithms based on a matrix-vector multiply (abbr. M*V-based), the 
columns of L and rows of U are indicated as in Figure 2. Here the 
diagonal element, the row to its right, and the column beiow it are 


denoted a7, Aj9, and A9; respectively. Ignoring pivoting for the 
moment, the steps to perform the factorization are then 


a921<-- {822 |=|A21] A12 (2) 
A3 A32} |A31 

Ag3 <--- A23 - Agi A413 (3) 
ang <-- Way (4) 
A32 <--- 822 A32 (5) 


The performance of the matrix*vector kernel of Eqs. (2) and (3) depends 
on the implementation; even Fortran codings: have radically different 
performance (see Appendix). 


M*M-based_ Algorithms 


Multiply kernel (level 1) 


Another matrix partition permits the factorization to be performed on 
submatrices (Figure 3). The equations equivalent to (2) - (3) are 


A22|<--- |A22|-|421] A412 (6) 
A32) A32] {431 
A23 <-- A23- Aa, Ay3 (7) 


where Ay is anngxng matrix. This multiply kernel has an asymptotic 


execution rate of approximately 400 MFLOPS when written in assembly 
language (CAL) and executing from LM with ng = 64. This includes time 


to transfer to/from CM from/to LM under a daytime memory load 
condition. 


On a vector machine such as the CRAY-2, partial column pivoting has 
two components: (1) the search for the maximum element of a column, 
and (2) exchange of two complete rows of the matrix. The latter is 
usually preferred over maintenance of an index pointer in order to avoid 
relatively slow indirect addressing. These two functions are denoted 


a <--- piv { s, V } 


where a is the element of maximum absolute value of scalar s and the 
elements of vector V. 


In M*V-based factorization this search is routinely performed after Eq. (2) 
or (3) by the step 
a2 <"- piv { 422> A39 } (3a) 


However, in the M*M-based version, the granularity of the algorithm 
does not recognize individual matrix elements and columns. The problem 


then becomes to preserve the high performance of the matrix-matrix 
multiply by performing the majority of computations at the block-level, 


yet_to occasionally expose individual columns to permit pivoting. The 
solution is the following 2-level algorithm (Figure 3). 
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In level 1, Equations (6) and (7) are first carried out as an O((n/ng)?) 
process. The columns of the resulting A, = [Ao2! A32! yt 


block-column matrix are at this point partially reduced, with all the 
accumulations from the columns of A¥3, performed but without 


contributions from the internal columns of Ay. - Level 2 begins by 
reducing A, using either a GAXPY kernel or the M*V method of Eqs. 


(2)-(5), viz (an underline represents components of this second reduction 
level) 


292 |<--- {822} - 421) Ai2 (8) 
A32 A32} |431 

Ag3 <--- A23- 421 A413 (9) 
222 <-- piv { a22 . Aza } (10) 
a22 <-- a2? (11) 
A32 <~- 822 A32 (12) 


The computations of Eq. (8-12) are performed from CM and will so be 
slowed by memory access delays. The execution rate of this step is 
approximately 125 MFLOPS, using a CAL-coded pivot search. 


Eqs. (8)-(12) have the effect of performing the block factorization and 
substitution steps 


Aq2 <- 22022 (13) 
-1 
A33 <--- A32U22 (14) 
Level 2 is then completed by the block substitution 
-] : 
A32 <- A32L22 (15) 


This can be carried out in local memory at a speed somewhat less than 
200 MFLOPS. 


With ng fixed, the complexity of level 2 is readily shown to be O(n2), 
whereas the M*M kernel complexity remains O((n/ng)°). For large n, the 


execution rate should therefore approach that of the multiply kernel or 
approximately 400 MFLOPS. 


Performance 


Both the M*V- and the M*M-based pivoting algorithms were run on the 
MFECC CRAY-2 in November, 1985, using the CIVIC Fortran compiler 
and CRAY-2 assembly langusge (CAL). Figure 4 presents the results of a 
number of algorithms and implementations. 


The poor performance of the conventional vectorized Fortran Gauss 
algorithm is due to the lack of chaining and the long, single path to 
CM. A significant advantage accrues from use of the M*V algorithm, with 
or without use of CAL. The superior performance of the M*M-based 
solution eventually becomes evident for n > 1000; indeed the latter is the 
best algorithm for all but the smallest n shown. However, it would likely 
not be worth the programming effort for matrices in the order of n = 100. 


Although such large matrices can rarely be factored without pivoting, it is 
interesting to observe the degradation due to introduction of the 2-levei 
algorithm. The CRAY-2 implementation the piv{ s, V } function of Eq. 


(3a) requires a fixed overhead dependent only on the length of V and 
independent of the matrix element values. Consequently, it is possible to 


delineate between the pivoting speedown due to piv{ s, V } and that due 
to the 2-level nature of the algorithm. These are presented in Figure 5 for 
the M*M algorithm. For n > 256, the larger degradation is the result of 
the piv { s, V_ } function. Since the latter cannot be avoided, the 
algorithmic speedown from the introduction of a second level does not 
appear significant. 


Parallel Implementations 


In general, the partitioning of an algorithm into larger computational 
tasks favors a parallel] implementation, since fewer task startups are 
involved. Thus, an M*M-based algorithm seems advisable, with Ng 


large. However, in a CRAY-2 system dedicated to an equation solution (a 
somewhat unlikely event), equalizing the workload among the processors 
(load-leveling) also becomes an issue; this favors smaller tasks 
associated with M*V-based factorization or else a smaller ng. These 


issues are currently under investigation. 


APPENDIX 


Fortran statement complexity has a significant impact on the performance 
of linear algebra and other scientific codes. This is illustrated in Figure 
6, where the M*V Fortran kernel is unrolled and the resulting performance 
plotted. Performance continues to increase significantly for unrollings as 
large aS 32, in contrast to the X-MP, where a 4-way unrolling is 
considered satisfactory. This phenonenom is considered to be the result 
of above-mentioned architectural featires. A 16-way unrolling is used in 
the M*V kernels cited in Figure 4. 
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Figure 2. M*V-based factorization algorithm 
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Figure 5. Influence of pivoting, 2-level algorithm 
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Figure 3. 2-level M*M-based factorization algorithm 
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VLSI TIME/SPACE COMPLEXITY OF AN ASSOCIATIVE 
PARALLEL JOIN MODULE 


Recent interests in the so called 5th 
generation machines and languages coupled 
with the inefficiency of conventional 
machines in handling large volumes of 
non-numeric data, have given a new dimension 
to the design and implementation of data base 
machines. The majority of these 
architectures are based on the relational 
model, because of its simplicity and the 
mathematical foundation on which this model 
is based. Among the primitive relational 
operations, literature has placed the 
greatest emphasis on join. This is because 
of its complexity and the practical 
applications of the join operation in 
combining relations with common domain(s). 


This paper introduces an associative parallel 


join algorithm and its hardware design. In 
addition, it addresses the VLSI time/space 
complexity of the proposed model, as well as 
its performance evaluation. 


1. INTRODUCTION 


Continuing interests in data base 
systems coupled with advances in technology 
have motivated special purpose architectures. 
These architectures are designed to remove 
the shortcomings of the conventional 


von-Neumann design in the efficient handling: 


of large volumes of non-numeric operations. 
In addition, recent discussions about 
knowledge base machines [16], and logic data 
bases have opened a new door for further 
research in this area. Finally, the 
definition of 5th geueration languages such 
as PROLOG [131] calls for the 
practical/efficient implementation of these 
languages which necessitates the design and 
implementation of data base architectures. 


The architecture of these special 
purpose systems is based on constraints which 
are imposed by the data base environment, 
i.e., associating name space with the 
information space at the high level, and 
performing a sequence of simple and 
repetitive non-numeric operations on a 
massive amount of data at the low level. 
Data base machines have met these conditions 
by: 
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i) searching and selecting the data at 
Or near secondary storage 
in an associative fashion, and 

ii) hardware implementation of suitable 
parallel algorithms for 
non-numeric operations. 


In contrast to conventional 
architectures, such a direction 
eliminates the existing address mapping 
resolution and reduces the data 
transportation and semantic gap. The 
underlying structure of the majority of these 
architectures are based on the relational 
model [4], This is due to the following 
reasons: 


i) The simplicity of the relational 


model, 

ii) The strong mathematical foundation 
on which the relational model 
is based, and, 

iii) The one-to-one mapping between the 


representation of data and 
primitive operations in logic 
programming and the relational model 


In practice, the performance of a 
relational data base system has been 
associated with the join operation. Such an 
association is based on the practical 
applications of the join in the queries and 
the complexity of the operation. This 
complexity directly affects the time and 
space efficiency of the join operation. 
Different hardware/software algorithms have 
been defined and implemented to improve the 
efficiency of such operation. This paper 
addresses a hardware module for the join 
Operation along with its performance 
evaluation. Section 2 overviews some of the 
proposed join algorithms. Section 3 
introduces the overall organization and the 
flow of data and control in the proposed 
architecture. Section 4 discusses the VLSI 
time/space complexities of the proposed 
model, and finally, section 5 evaluates the 
performance of the model. 


2. BACKGROUND 


Join has been defined as a binary 
operator which combines two relations over 
their common attribute(s). Formally, let 
r(R) and s(S) be two relations and let AER 
and BeS be @compatible, where, 
O€{=,#,<,>,<,>}, then r[A@B]s is defined as: 


{t] te (A)@ts (B) and t(R)=ty and t(S) = tg}. 


In a uniprocessor environment, three 
classes of join algorithms have been studied, 
namely nested loops, sort-merge, and hashing, 
with respective execution times of 0(n2), 
O(nlgn) and O(n). However, in a thorough 
evaluation of these algorithms, parameters 
such as communication, simplicity, 
regularity, space efficiency... etc., should 
also be taken into consideration. In the 
data base architecture where a higher 
performance and throughput is cited, the 


discussion about join should center around a 
parallel version of these algorithms. 
Although there are some designs which have 
ignored such a discussion!2,191, there is no 
reason to accept that they would not be able 
to perform this function. The assumption is 
that either a front-end machine or a 
dedicated join module will do the ‘job. 


Literature has studied different 
algorithms for join in both uniprocessor and 
multiprocessor environments!3, In the 
following discussion, some of these 
algorithms which are proposed for systems on 
the exhaustive search policy will be 
overviewed: 


cars!1] utilizes a hashing scheme. It 
can perform the semi-join operation by 
marking an array of bits based on the tuples 
in r and then using this array to select 
proper tuples ins. However, the 
architecture has not addressed the 
implementation of the 6-join. Based on the 
hardware capability of the design, such an 
operation is expensive since it require 
several passes over the data file. | 


RAPL17] has studied two types of join 
namely implicit (e.g. semi-join) and physical 
(e.g. O-join); the latter can be carried out 
aS a sequence of implicit joins. The 
algorithm is based on the concept of nested 
loops, with some degree of parallelism, which 
is determined according to the number of 
read/write heads and the hardware capability 
of each read/write head. 


pIRECTI5] is capable of performing a 
block oriented version of nested loops with 
some degree of parallelism (e.g. number of 
processors). During each iteration, a block 
(page) of the outer relation (say (r)) will 
be joined with entire blocks of the inner 
relation(s) which are distributed among 
different processors. Within each processor 
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a Simple sort-merge algorithm will be 
utilized to join two blocks together. 


_pE~TAL7] has defined a parallel version 
of the sort-merge algorithm in its design: 
12 sorters in parallel sort the relations in 
sequence based on the join attribute, and 
then a single merger is utilized to merge the 
sorted relations. 


The performance of these algorithms has 
been formulated and competed against each 
other. Bitton et. al.[3] have shown that in 
a uniprocessor environment, the sort-merge 
algorithm has a better performance than 
nested loops. However, in a multiprocessor 
environment such as DIRECT, the nested loops 
algorithm outperforms sort-merge. This is 
due to the higher degree of parallelism in 
the operation, and better processor 
utilization made by the nested loops 
algorithm. 7 


Valduriez and Gardarin[21] have compared 
the performance of a block oriented version 
of nested loops, parallel sort-merge, and 
parallel hashing schemes against each other.. 
They have shown that in general, if the 
number of accesses to the hashed file is low, 
hashing algorithms offer a better performance 
than the other two techniques. Moreover, for 
relations of the same size, the sort-—merge 
algorithm has a better performance. Finally, | 
if the ratio between the relation sizes is 
different from 1, nested loops algorithm is 
the superior method. 


The evaluation process as abovel3,21] 
partially represents the relative behavior of 
these algorithms. A thorough evaluation and 
comparison should address: : 


i) The merit of each approach based on 
the environment, and 


ii) Equally important parameters such 
as, Space and processor utilization, 
ease of implementation from. 
hardware/software point of view, 
processor-memory intercomnunication 
and finally, practical overhead and 
practical limitations. 


3. PROPOSED ARCHITECTURE 


Associative memory has been defined as: 
"a collection or assemblage of elements 
having data storage capabilities, and which 
are accessed simultaneously and in parallel. 
on the basis of data content rather than by 
specific address or locations,"{14], Content 
addressability of data implies that each 
basic cell of memory should have the hardware 
capability to perform read, write, and search 
operations. Such a capability eliminates 
name resolution probleml9] and reduces the 
bottle neck between memory and CPU. 


common data bus 
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Figure lL 


Associative memory provides a suitable 
and fast medium for representation of the 
relational model and implementation of the 
basic relational operations. This stems from 
the fact that the represeNtation of data in 
the associative memory is similar to the 
tabular representation of the data in the 
relational model. In addition, attributes in 
the relational data model are equally 
available to be searched, as in the 
associative memory where search can be 
performed on any defined field. Finally, 
there is a one to one mapping between the set 
operations and the associative operations. 


The proposed join module is a collection 
of n identical and independent modules of 
associative memory (Figure 1). Associative 
memories can transfer data among each other 
under the control of a master controller, via 
a common data bus. In addition, they can 
pass control signals through a chain of 
acknowledge bus. These modules can be linked 
together to make a memory of size (k*w)*d 
(1<k<n, w is the word length and d is the 
depth) capable of holding d tuples of size 
k*w each, or they could be linked to forma 
memory of size w*(k*d) capable of holding k*d 
tuples of size w. This facility enables the 
system to adjust the available memorybased on 
the length of the tuples and the cardinality 
of the relation. The independence of the 
associative memories from each other enhances 
the modularity of the system and as such the 
fault tolerance of the system. 


With respect to a query, domains ina 
relation (say r with relation scheme R) can 
‘be grouped into three classes: 
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: Overall Organization of the Proposed Join Module. 


i) those which are defined as a member 
of the search argument part 
(i.e., SA = {Rj| RieR and Rj «€ search 
argument} ), 


ii) those which are the member of the 
output set part 

(i.e., OS = {Rj| RieR and Rj output set}), 
and 
iii) those which are not the members of 

SA or 

OS (i.e., {Rj| RyeR and Rj ¢ (OS U SA)}). 


Our assumption is that the relations are 
preprocessed before being routed to the join 
module, and then the selected tuples are 
passed to this module. In other words , a 
subset of the relation (rg) with a relation 
scheme (Rg)will be explicitly stored in the 
join module: 


Rg CRand¥Tg€rg » Ter such that 


Ts = T[OS] and T[SA] = search argument 


This assumption is in accordance with the 
90-10% rule. 


The proposed join module performs a 
concurrent version of the nested- loops 
algorithm. Concurrency is achieved as a 
result of: 


i) embedded parallelism in the 
associative operations, and 


the independence of associative 
memories from each other. 


ii) 


Therefore, we have parallelism within each 
associative memory and overlapping across the 
memories. For a join operation, three memory 
modules will be tightly coupled together 
_ (Figure 2). In our discussion we call them, 
source, target and destination modules 
respectively. Interestingly, one can utilize 
random access memory modules to house source 
and destination relations. However, for the 
sake of the uniformity and generality of the 
proposed join module, in the rest of this 
paper we assume three associative memory 
modules to house source, target and 
destination relations. Tuples from the 


source module (source-tuples) are routed to 


the concatenate-register and target module 
(1) one at a time. A source-tuple is 


searched against the contents of the target — 


module (2). In case of a match, the selected 
tuples are routed to the concatenate-register 
in pipeline fashion (3). The result of 
concatenation is then sent to the destination 
module (4). When, the selected tuples in the 
‘target module are about to be exhausted, an 
acknowledge signal is sent to the source 
module informing the need for a new tuple (5). 


The order of this algorithm with respect 
to the number of searches is O(n), as opposed 
to O(n2) in the uniprocessor systems. 
Moreover, because of the interaction between 
the source and target memory modules 
(acknowledge signal bus), operations in the 
source memory can be initiated and overlapped 
with operations in the target memory. From 
this discussion, one can_conclude that the 
steps @ , (4) , and are overlapped 
with step @G) (Figure 2). Algorithm 1 
represents the sequence of operations 
according to the notations and the 
organization which have been adopted in this 
paper. The generality and adaptability of 
the associative operations increases the 
versatility of the proposed model in 
performing different variations of the join 
operations (e.g. natural join, semi-join) 
without additional hardware overhead. 


The proposed model can be extended over 
a parallel execution of a sequence of 
interrelated joins. Such a concept is of 
special interest in a PROLOG type logic 
programming environmentl13], 


FOR ALL t,€r D0; 
parallel search t, against s 
while there is at, such that 


q = ty |[ts 


t (A) Ot, (B) 


end: 


end; 


ALGORITHM 1: AN ASSOCIATIVE JOIN 
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Figure 2: Flow of Data in a Join Operation. 


4. VLSI TIME AND SPACE COMPLEXITIES OF THE 
PROPOSED MODEL 


In the design and development of a 
topology for current technology, one should 
remember the practical constraints which are 
imposed by the VLSI technology. The idea is 
to reduce the execution time as well as the 
communication through the replication of a 
few basic blocks in time or space. A fully 
parallel associative memory bears these 
conditions and hence is suitable for VLSI 
implemention. This is not the intension of 
this article to discuss the organization of 
associative memory, the interested reader is 
referred to [14, 20, 22], tn this section we 
will calculate the VLSI time and space 
complexities of the proposed model according 
to the design rules discussed in Our 
associative module, should be able to perform 


read, write and different variations of 


search operations with resepct to the 
contents of the mask and word select 
registers. In our design each cell of the 
mask, comparand, and memory is a dynamic RAM 
cell refreshing itself during the second 
phase of a 2-phase clock system. Figure (3) 
depicts the stick diagram of the interaction 
between one bit of mask and comparand 
registers according to the following 
interpretation: 


si iff 


Z1,=1 
J 


M.=1 and Comp.=1 
J J 


otherwise 


Metal | @ Contact 


Polysilicon 


omumamecrzex 1) if fusion (J ion Implant 


interconnection between mask and 
comparand registers. 


Figure 3: 


the 2145 and Onelj are run in metal and shared 
by the jth bit of each memory word. The 
associated circuitry for read/write 
operations and selection of a selected word 
has been discussed in 
simplify the memory cell and hence to reduce 
the size of the module: 
(i.e., word select, search) are used at the 
word level rather than the bit level. In 
other words, the search is carried out for 
all the words, and then, the tag bit is set 
according to the control lines and the word 


select signal. ii) the searches on 
equality, in equality, less than equal, 
greater than equal are performed based on 
searches on less than and greater than. 


In order to- 


i) control signals 


Onel, Z1 


Figure (4) represents one bit of the 
memory cell and the circuitry associated 
with it, where Cj ,4 (Cj, ) is the content of 


the jth bit of the it® word, Lj 4-1, Li,j-1, 
Gji,j-l, and Gj,4-1 are less than and greater 
than signals generated by the left most bit 
(e.g. j-l) and finally Li,j, Vist, Gj,4, and 
Gi, j are the less than and greater than 


signals generated by the jth bit of word i. 
Figure 5 depicts the circuitry around the tag 
bit according to the above discussion, LT, 
GT, and EQ are the less than, greater than, 


and equality control lines respectively. 


Assuming the AT be the average delay 
time of an inverter, then, the search time is 
calculated based on: i) 4 parallel delay for 
a 3-input NAND gate and an inversion, ITI) 
(n-1) 2At serial operation of (n-1) NOR gates 
and an inversion, and iii) 9ATt delay for 
setting the tag bit. Resulting in a total 
delay time of: 


13 At + 2(n-1) At (n is the word length) 


i) The geometry area of a cell is estimated 
at 200A* 100A including the area needed for 
routing the signals among the cells. Thus, 
an associative memory of m words, each n bits 
long requires an area of: 200nA *100mA For 
m=512, n=256, At=l nsec. and A=l um we will 
have a search time of less than 1000nsec for 
an environment of 217 bits (including the 
expected complexity due to the read/write 
circuitry around each cell). In addition, 
the geometry area for the above configuration 
is estimated to about 5 cm * 5cmn. 
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Figure 4: A Memory Cell. 
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5. PERFORMANCE EVALUATION OF THE PROPOSED MODEL 


EAL ATION OF THE PROPOSED MODEL 


The objective of this seciton is two 
fold: First, the timing analysis of the 
Proposed model is discussed and then its 
performance is calculated and compared 


against Some of the available models in the 
literature, 


Execution time analysis 


Our timing analysis is based on the 
Sequence of operations discussed in Algorithm 
1 and the following notations. 


1) No ordering is imposed on the storage 
policy - i.e., unordered data file. 

ii) According to 90-10% rule, the source 
and target relations are preprocessed 
before being transferred to the join 
module. 

iii) Parallel preprocessors share a common 
cache memory with a hit ratio (H) of 85%. 
The organization of such intermediate 
memory between the secondary memory and 
array of preprocessors is similar to the 
model in [3,21]. 

iv) Size of associative memories (e.g. 
associative blocks) are different from 
the page sizes in secondary storage. 
This is due to the reconfigurability of 


the associative memories (section 3) and | 


the technology. 


v) n,m: Number of pages in the source and 
target relations respectively. 


Cy: Time to read and pass one page from 
the cache memory to the preprocessors. 
It is defined as: 


Cy=H (Ry) + (1-H) (RaytRe) 
where Rm and Rg are the page read time 


of secondary storage and cache memory 
respectively, and H is the hit ratio. 


(1) 


Cselect: Time to select and project 
tuples in a page. 
to the number of tuples in a page. 


Cselect=K* (Tselect + Tproject) (2) 
Cf: 
rule [9], 


P: Number of preprocessors. 


6: The ratio between the associative 
memory block size and cache memory page size 
- i.e.,d=T/K where T and K are the number of 
words in an associative memory and cache page 
respectively. 


This is proportional 


Compaction factor defined by 90-10% | 
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Circuitry around a tag bit. 


Cpa: Time to fill out one associative 
memory by the preprocessors. 


Cyoin: Time to join two associative 
memories together (equation 4). 


Cw: Time to pass the contents of an 
associative memory to the front-end 
processor. ? 


S: The join sensitivity factor. It 
is defined as the average number of 
tuples in the target relation which are 
joined with a tuple in the source 
relation. It should be noted that due 
to the preprocessing operation, the data 
in the associatve memories 
are more sensitive to the join operation. 
Therefore, the join sensitivity factor 
in our design should be higher than the 
one proposed in 


According to these parameters, the 
execution time of the operation is calculated 
as: 


F n*C | mC, 
nem a +C 
Ty, = Sp? (Ont Ccezect? * | ; (c,,* | 3 | (Coa) * Sjoin 


+ s(c,)) (3) 


In which: 


Coin  Cinttia’ Sread* th search? 7 S(Creaq?)* Cher, = (4) 


~ () Ton Implant 


where: 


Cinitial: represents the initial time 
which one has to spend before initialization 
of join operation (i.e., setting the mask 
registers). 


Creaa:? is the time needed to read 
out a record from an associative memory 
module and pass it to the proper destination. 


Csearch: is the time needed to search 
the contents of the comparand register 
against the contents of the memory in 
associative fashion. This is a function of 
the word length and isindependent from the 


depth of the associative memory. 


According to our implementation (Section 
4), the search time is unique for equality 
and greater than/less than operations. 


Csearch: Csearch = aQtb where Q is the 
word length and a,b are constant values which 
are defined by technology. 


Cter.: is the overhead time which is 
needed to terminate the operations (i.e., 
passing the last acknowledge signal to 
controller, etc.). 


Performance Evaluation 


Our evaluation and comparison are based 
on the time analysis of the basic operations 
as are dictated by the current technology. A 
detaileddiscussion for the time analysis of 
the basic operations is given in 


Table 1 depicts the execution time of 
the join module for different word sizes. In 
addition, Table 1 shows the estimated size of 
the source and target relations with respect 
to the 90-10% rule. As can be seen, the join 
operation between two relations of 256K and 
2048K sizes regardless of the input/output 
operations is about 18 msec. 


Table 2 shows the execution time of the 
proposed model against the models in [3, 211], 
All the models in table 2 are based on the 
nested loops policy. The choice of the 
algorithm regardless of the uniformity among 
all algorithms is due to the fact that the 
nested loops algorithm in a parallel 
environment offers a better performance. and 
resource utilization (Section 2), than the 
algorithms based on sort-merge and hashing. 


6. CONCLUSION AND FURTHER DISCUSSION 


An associative nested loops algorithm 
for join operation has been discussed. It 


has been shown that in a parallel environment . 
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TABLE 1: EXECUTION TIME OF THE PROPOSED 


JOIN MODEL. 

WORD SIZE EXECUTION SOURCE RELATION TARGET RELATION 
(Byte) TIME(msec. ) (Byte) (Byte) 

64 4.44 64K . 


8.8 
17.54 


128 
256 


TABLE 2: EXECUTION TIME OF THE DIFFERENT 
JOIN MODELS. 


EXECUTION TIME(msec.) 
Tyr (proposed model) 926.93 
Ty (Bitton et.al.) 6593. 


Ty, (Valduriez et.al.) 


6610.44 


a join operation based on the nested loops 
policy offers a better performance than 
algorithms based on the hashing or sort-merge 
policies. However, through the time analysis 
we have shown that the proposed algorithm has 
a better performance than the proposed models 
in [3, atl by a factor of 7. In addition, 
the generality of the associative operations 
on one hand, and the one to one mapping 
between the relational operations and 
associative operations on the other hand can 
provide a general purpose module for 
relational operations. This increases the 
resource utilization of our model in 
comparison to other proposed models, which 
are utilized just by the join operation. 
Finally, our model is capable of handling a 
sequence of the interrelated join operations 
in parallel. A feature which is missing in 
all other proposed models. 


The regularity and, modularity of 
associative memory, and simplicity of each 
basic cell has provided a suitable ground for 
VLSI implementation of the proposed 
architecture. Our VLSI time and space 
analysis have shown that the proposed model 
is well within the range of the gurrent 
technology. 


In the past, due to the high cost of 
hardware, a wide range applications of 


associative processing were not feasible. It 
has been estimated that a basic associative 
cell is about 2-3 times more complex than a 
basic cell of random access memory. In 
addition, an associative module requires a 
special I/O connections for fast processing. 
However, because of its preprocessing 
capability the proposed model utilizes the 
same I/O connections as the models in [3, 6, 
21], Moreover, improvement in technology and 
hence cost reduction of the hardware modules, 
have made it feasible to increase the 
utilization of associative processing. 
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ABSTRACT 

We develop an efficient bidirectional chain VLSI system for 
the adaptive recursive filtering problem. Our design is an 
improvement over previous designs. It matches the perfor- 
mance of a broadcast chain but does not use the broadcast 
capability. 


Keywords and Phrases 
VLSI architectures, systolic systems, adaptive recursive filter- 
ing 


1. Introduction 


VLSI architectures for a variety of problems have been 
proposed by several authors. <A bibliography of over 150 
research papers dealing with this subject appears in 
[KUNGS83]. In this paper, we are concerned solely with the 
adaptive recursive filtering problem. The input to this prob- 
lem is ann X w matrix A of weighting coefficients and an 
1 X w vector (7,_4, ..-, Zo). The output isal X n vector 
(4, -.-» Z_, ) where: 

w 
t= D4; Te4g-—-w-1 2=1,2,...,27 (1) 


=1 


In evaluating a VLSI design, we assume that the VLSI 
system will be attached to the host processor using a bus. 
The evaluation of a VLSI design should take the following 
into account: 


1. Processors --- how many processors are used in the VLSI 
system? This figure is denoted by P. 


2. Bus bandwidth --- the maximum amount of data to be 
transmitted between the host and the VLSI svstem in any 
cycle. This figure is denoted by B. 


3. Speed --- how much time does the VLSI system need to 
complete its task? This time may be decomposed into the 
times T> (time for computations) and Tp (time for data 
transmissions both within the VLSI system and between 
the host and the VLSI system). 


Let C denote the time spent for computation by a sin- 
gle processor algorithm and D denote the total amount of 
data that needs to be transmitted between the host and 
VLSI system. For the adaptive recursive filtering problem, 
C =nw andD =nw+n+w. 


The ratio 
Rp =B*T)/D 
measures the effectiveness with which the bandwidth B has 
been used. Clearly, Rp > 1 for every VLSI design. 
The ratio 
Ro =P#T7,/C 


measures the effectiveness of processor utilization. 
again, we see that Ro > 1 for every VLSI design. 


Once 


—_—_ 
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Finally, we may combine the two efficiency ratios Re 
and fp into the single ratio R = Rg * Rp. A design that 


makes effective use of the available bandwidth and processors 
will have R close to 1. 


The efficiency measure R as defined here is the same as 
that used in [CHEN84a,b and 85] to evaluate VLSI designs 
for matrix multiplication and back substitution. This meas- 
ure is also quite similar to that proposed in [HUANS82]. In 
fact, the two measures become identical when JT c = Tp. 


In comparing different architectures for the same prob- 
lem, one must be wary about over emphasizing the impor- 
tance of Rc, Rp and R. Clearly, by using P —1 and 
B =1, we get Rc = Rp =R = 1 but no speed up at all. 
So, we are really interested in minimizing Tg and Tp while 
keeping R close to 1. 


VLSI architectures for the adaptive recursive filtering 
problem have been proposed earlier in [KUNG78 and 84], 


[LEIS83], [HUANS82] and [ROBES84]. The design of 
[HUANS82] uses a broadcast chain and has P = w, 
B=w +2, To =n +w -1, Tp =n +vw, 


Rg ~1+w/n~1, Rp ~141/w +w/n~1 = and 
R ~1. The design of [KUNG78] uses a bidirectional chain 
of processors. An improved version is described in [LEIS83]. 
For this, P= [w/e ], B= [w/2]+2, 
To =2n + w -2, Tp = 2n + w —-1), 
Ro ~1+w/(2n)~1 Rp ~14+3/w +w/n ~1 and 
R ~1. The design of [KUNG84] uses a systolic ring archi- 
tecture to solve the simple recurrence problem. It can be 
easily extended to solve the adaptive recursive filtering prob- 
lem. This extension has P = [w/2 ],B = [ w /2 1 +1, 
To = An -1)4+ wu, Tp = 2&n + w -1) +1, 
ae ~1l+w/(2n)~1, Rp ~14+1/w +w/n ~1 and 
~~ 1. 


While all the above designs have an R ~ 1, the broad- 
cast chain of [HUANS82] has a T> and Tp that is about half 
that of the other designs. In this paper, we develop a 
bidirectional chain VLSI system that has the same (actually 
slightly smaller) Tp and Tc as the broadcast chain of 
[HUANS82]. For our design, P=w, B=w +1, 
To =n + [w/2 |, Tp =n +w +1, 
Ro w~1+w/(2n)~1, Rp ~1+w/n ~1 and R ~1. 
Our design shows that a broadcast chain is not required to 
obtain this Tg and Tp performance. 


2. o(n ) Throughput Bidirectional Chain 


An o(n ) throughput bidirectional chain for the adaptive 
recursive filtering problem can be obtained by extending the 
systolic design of [ROBES84] for the nonadaptive recursive 
filtering problem. This extension requires us to recast (1) 
into the following form: 


w 
= Day Ha j-w 1 


j=l 
w-i w 
= 4 +g -w-1 t+ %w DG -1,7 B+j-w-2 
i=} y= 
= 4; Ge g-w-1t Mw DG -15 41% 45 -w-1 
i tee 
= Yo; Bez-w-1> 7 >1 (2) 
j=0 
where 
by 9 = Giy Ajay OG; = Oy; + Gi Oia, G4. 
| 1c yj Sw-l (3) 


Qn =1, a = 255 Su, ty =7%q (4) 


To calculate the 6;;’s of (3) dynamically, w PEs in addition 
to the w + 1 PEs used in [ROBES84] are needed. The perfor- 
mance figures of the _ resulting VLSI system = are 
P =2w +1, B=2w +3, To ~n + [w/2 ], 
Tp ~n+uw, Ro ~2+1/w +w/n ~ 2, 
Rp ~2+1/w +w/n ~2andR ~ 4. 


Improved performance can be obtained by using the 
bidirectional chain architecture of Figure 2.1. All the even 
numbered PEs are on the left, while all the odd numbered 
PEs are on the right. The output is generated from the mid- 
dle PE, PE(w). The PEs to the left of PE(w) compute all 
terms involving even columns of A, while PEs on the right 
compute all terms involving odd columns of A. The case 
when w is odd is shown in Figure 2.1(a). The case when w 


is even is shown in Figure 2.1(b). 


= - @iw-2 an) 
a2 10-1 ed Gow —2 : 

a fie : 

. Gow . 

- . Gy wv —2 
Gnw—1 . 

Gay 
(a) w is odd 


n2 910-2 i = Ga) 
G2u-2 id Gy 1 
’ Gin G2 — ~-1 
7 Gon . 
Gn yw ~2 . . 
On wi 


(b) w is even 


Figure 2.1 
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The middle processor, PE(w ), has the five registers: A , 
V,xX, Y and Z. The remaining PEs have three registers 
(A, X and Y) each. We use the notation R (2) to denote 
register R, R ¢ {A,V,X,Y,Z}, of PE(i). The A regis- 
ter of each PE is used to hold an input value from the A 
matrix. PE(z) receives input from column 7 of A only, 
1<i<w. The X register of each PE holds an 2; value 
while the Y registers hold partial sums in the computation 
of an z;. In each cycle, the X (2 )s move one step away from 
the center PE, PE(w), while the Y(z)s move one step 
towards this PE. 


The working of the VLSI system is described formally in 
Algorithm 2.1. The first for loop sets up the _ initial 
configuration. The three steps in the parallel do are exe- 
cuted simultaneously. When this for loop terminates, 
PE(w) contains 2, for p = [(w -1)/2 ]-w 
= [-(w +1)/2 | in its X register. The X register of a 
PE that is a units away from PE(w) contains z,_,. The 
second for loop contains two sets of concurrently executed 
statements. In the first set, i.e. first parallel do, essentially 
five concurrent activities are performed in each iteration of 
this loop: 


(1) PE(w) either inputs an 2;, ¢ <O or outputs a newly 
computed z;,% > 0. 


(2) All X values move one PE away from the middle PE. 


(3) Each PE inputs an A_ value. 
a;; =Ofort <0. 


(4) All Y values move one PE towards the middle PE. 
However, the Y value from PE(w — 1) is moved to the 
Z register of PE(w) rather than to its Y register (this 
latter register receives the Y value from PE(w — 2)). 
The boundary PEs (1 and 2) reset their Y registers to 
Zero. 


(5) From the data patterns of Figure 2.1(a) and (b), we 
observe that if the Y value in PE(w — 1) is a partial 
sum for z;, then that in PE(w — 2) is a partial sum for 
Z;4 3 Hence, Y(w) and Z(w) contain incompatible 
partial sums. The partial sum in Y(w ) is to be used in 
the next iteration. V(w) is used to save the previous 
value of Y(w). Consequently, V(w) and Z(w ) contain 
partial sums for the same 72; . 


Note that we assume 


In the second parallel do set of statements, either a 
new term is added to a partial sum Y(z) or a new 2; is com- 
puted. PE(w) computes a new 2; by computing 


(V(w)+ Z(w)) and A(w) * X(w) in parallel. The two 
results are then added (the operations may also be pipelined). 
Assuming that the time for an addition is no more than that 
for a multiply, the computation performed in PE(w) takes 
the same time as that performed in the other PEs. 


Figure 2.2 is a timing diagram for the case w =5 
where j refers to the for loop index of Algorithm 2.1. For 
each PE, the contents of its X and Y registers following the 
execution of the for loop for that 7 value are shown. The 

P 


notation [t, p] denotes 5) a,; 2;4;-»-1 for PEs on the right 
De 
P 
of PE(w) and )) 4; 24,;-»-1 for PEs on the left. V(w) 


= 
J even 


contains the sum of odd terms (as w is odd), while Z (w ) 
contains the sum of even terms (as w is odd). 


B=w +1, 


The performance figures of this design are P —w, 
| Tp =n+w +1, 
Ro m~1+w/(2Qn)~1,Rp ~1+w/n ~1andR ~1. 


To =n + [ w /2 1, 


for j+1 to [ (w -—1)/2 | do 
do in parallel 
KW) Bey 
X(w -1)+ X(w) 
X(i)+ X(t + 2), 
end 
end 
forj<—[(w +1)/2 |ton +w do 
do in parallel 
case 
j <wet+l1: X (w)— 2; _» 
j =w+1:X(w)+— 2, 
j > w +1: output X(w) { output 2; _, _,} 
endcase 
X(w -1)+ X(w) 
X(t) X(t +2),1<1 cw -2 
AWN Oe sigs w 
A(t) a3 4 Lw-iye J¢4i-wi 1 St Sw-l 
Y (1) — Y (2) 0 
Y(Q)+- YQ -2),3<i1<w 
V(w)+ Y(w) 
Z(w)+ Y(w —-1) 
end 
do in parallel 
YQa)+ YO)+AQ)* XC@)1I<t Sw -l 
X(w) ++ (V(w)+Z(w))+A (w)*X(w) jg Swi 
end 
end 
output X(w) { output 2, } 


1<i<w-2 


Algorithm 2.1 


1 
2 
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4 
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i 
oS 


Figure 2.2 w=5 
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3. Conclusions 


We have developed a VLSI system for the adaptive 
recursive filtering problem that has T> and Tp that is o(n) 
and also has R ~ 1. Previously, this had been done only for 
the case of VLSI systems using the broadcast capability. 
Our design does not employ this capability. The perfor- 
mance characteristics for various VLSI systems for the adap- 
tive recursive filtering problem are summarized in Table 3.1. 


a" See 


i ee 


Cann [moet [oe 
a 


[w/e] +2 


Systolic 


Perf 


Broadcast Chain Ring 


(HUAN83] 


[KUNG84] 
er 
[~/2] +1 


An —-1)+w 


2n +w 


2(n + w — 1) 2(n + w —1) 


Table 3.1 


Further improvement in throughput (at the expense of 
design complexity) is possible. However, this cannot be 
obtained using recurrence (1) as in order to compute 2; , we 
need to know 2; _,. Hence z; can be computed, at best, one 
cycle after z; _, has been computed. However, we can bring 
both Tg and Tp down to o(n/2) by computing two 2;s 
each cycle using recurrence (2) and formulae (3) and (4). The 
idea is the same as used in the back substitution problem in 
[CHENS84b]. The VLSI system that incorporates this uses 
more hardware and is quite a bit more complex. The method 
may be extended to get a Tc and Tp of o(n/k) for any 
fixed k. 
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ABSTRACT 


This paper presents two massively parallel processing 
architectures suitable for solving a wide variety of divide- 
and-conquer type algorithms for problems such as the Discrete 
Fourier Transform, Production Systems, Design Automation 
and others. The first architecture, called the Chain-structured 
Butterfly ARchitecture (CBAR), consists of a two-dimensional 
array of N =L.(log2(Z ) +1) processing elements (PE) organ- 
ized as L levels of log2(L) +1 stages, and which has the 
butterfly connection between PEs in consecutive stages with 
Straight-through feedback between PEs in the last and first 
stages. This connection system has the desirable property of 
allowing thousands of PEs to be connected with O (N ) connec- 


tion cost, O (log o( — 


W )) communication paths and a small 


number (=4) of I/O ports per PE. However, this architecture is 
not fault tolerant. We. therefore, propose a second architec- 
ture, called the REconfigurable Chain-structured Butterfly 
ARchitecture (RECBAR), which is a modified version of the 
CBAR. The RECBAR possesses all the desirable features of the 
CBAR, with oe oe of I/O ports per PE increased to six, 


and uses 02 


=) overhead in PEs and approximately 50% 


overhead in links to achieve single-level fault tolerance. Relia- 


bility improvements of the RECBAR over the CBAR are stu- 
died. 


1. INTRODUCTION 


Recent developments in technology have made it possible 
to interconnect a large number of processing elements in order 
to form an integrated system. Various network architectures 
have been proposed that are suitable for both multiprocessors 


and VLSI systems [1, 2]. In this paper, we present a massively | 


parallel processing architecture suitable for solving. a wide 
variety of divide-and-conquer type algorithms for problems in 
signal processing, production systems, design automation and 
others. This architecture, called the Chain-structured 
Butterfly ARchitecture (CBAR), has the desirable property of 
allowing thousands of PEs to be connected with O (NW ) connec- 


tion cost, O Clog ( 


number (=4) of I/O ports per PE. 


One attribute that is desirable in any complex parallel 
processing system but missing in the CBAR is fault-tolerance. 
Fault tolerant network architectures are emerging as an impor- 
tant area of study [3, 4,5]. We, therefore, propose a second 
architecture, called the REconfigurable 
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six, and uses O( 


labeled in a sequence from O to L —1. 


Butterfly ARchitecture (RECBAR), which is a modified version 
of the CBAR. The RECBAR possesses all the desirable features 


of the CBAR, with the number of I/O ports per PE increased to 
log N 
N 


) overhead in PEs and approximately 


| 50% overhead in links to achieve fault tolerance. 


2. CHAIN-STRUCTURED BUTTERFLY ARCHITECTURE 


The chain-structured butterfly architecture (CBAR) is a 
novel parallel processing architecture which has processors in 
place of exchange switches in a regular butterfly interconnec- 
tion network. In addition, the output ports of the processors 
in the last stage of the CBAR are fed back to the of the first 
stage. The network resembles the Loop Structured Switching 
Network proposed by Wong and Ito [6] except that there is no 
unshuffle while traversing the feedback path. The organization 
of a processing element (PE) is similar to that in the MAN-YO 
Architecture [7] ; each PE consists of the actual processor and a 
router cell, the latter managing packet transmission among. 


different PEs. The router cell is basically a store and forward. 


crossbar switch with three input and three output ports, one: 
input- output pair being solely used for the processor, and the 
remaining two port pairs being network ports, used for connec- 
tion among router cells. 


2.1. Network Topology 


The CBAR is a_ two-dimensional array of 
N =L.(log (L)+1) PEs arranged in JL _ levels and 
S =log (L) +1 stages. The following description can be easily 
understood if the reader refers to Fig. 1, which shows the net- 
work for N =32 and L =8. The stages are labeled in a 
sequence from 0 to log (L) with O for the leftmost stage and 
log (L) for the rightmost stage. Similarly the PE levels are 
Each PE can be 
uniquely identified by the stage and level to which it belongs: 
‘<i,j > represents a PE in the i** stage and j™ level. The 
integers i and j can be represented by their binary equivalents 
with s =log (S) and l =log (LZ) bits, respectively. Hence the 
processor <it,j > has a binary address [@s—1---70-Pi—1--P ol. An 
output link of a PE is referred to as the ‘0 link’ if it connected 


‘to the upper output port, and a ‘1 link’ if it is connected to the 


lower output port. 


The topology describing rules of the network are defined 
as follows: 
(1) For link 0: CBAR <i,j > =<i +1, j —2! bit;(j)>. 
(2) For link 1: CBAR <i,j > =<i +1, j +2' [1 —bit;(j)]>. 

for 0 <i Slog (L)—-1,0 Sj SL 1, 

where bit;(j) equals the bit with weight 2' in the binary 
representation of j. 
(3) For the feedback links, 0 and 1: 

CBAR <log (L), j > =<0,j >, for0 <j <L —1. 


The labeling scheme and interconnection of the PEs can be 
verified in the example shown in Fig. 1. 


2.2. Properties of the CBAR 


This section states some useful results concerning the 
behavior of the CBAR for a single packet being routed from a 
source PE to a destination PE. We define a ‘step’ in packet 
routing as the transfer of a packet from a (network) input port 
buffer or the processor output port buffer of a PE router cell to 
a (network) input port buffer of the next PE router cell. This 
involves setting of the router cell to use either the 0 or the 1 
link of the PE. The process of choosing the 0 or the 1 link of a 
PE in the i** stage can be viewed as modifying the i“ bit of 
the address of the level in which the packet currently resides. 
For the remaining part of this section we consider a CBAR 
which has L levels and a packet which is generated at the PE 
<i,DPi-1‘°°*PiP0> and destined for the PE 
<i',p's-1°°° P'1p'o>» wherel =log (L), and 0 <i <l. 


Given two bit strings, STR1 and STR2, the maximum 
length common suffix of STR1 and STR2 will be denoted as 
MLCS (STR 1,STR2). Given a bit string STR, a cyclic shift 
left by k bits followed by a bit reversal will be denoted by 
CSLR, (STR ). The length of a string STR is denoted by |STR |. 
The Augmented Source Bit String (ASBS ) and the Augmented 
Destination Bit String (ADBS ) are defined as (p, p;-}...Po) and 
(p')p'1-1--P'o), respectively, where p, and p', are extra bits 
that are added to model the feedback connection of the CBAR 
network. The extra bits are equal to each other and can be 
assigned either 0 or 1. 


It can be shown that the packet will be routed to the des- 
tination PE in exactly k +[G' —i —k) modulo (1 +1)] steps 
after its generation at the source PE, where 
k =l +1 —| MLCS (CSLR,-_; 4:(ASBS ),CSLR,_; 4,(CASBD )) |. 
It follows that in a CBAR with Z levels, a packet will be 
delivered to its destination within 2.log (L ) +1 steps of rout- 
ing regardless of where it is generated. This allows us to con- 


clude that the CBAR network has O (log ( BN. 


be N )) communi- 
cation between PEs. 


stage 9 1 2 
level 


Fig. 1.4 8x¢CBAR with name representation 


3. THE RECONFIGURABLE-CHAIN STRUCTURED 
BUTTERFLY ARCHITECTURE 


We now propose a modified version of the CBAR which 
we call the reconfigurable chain-structured butterfly architec- 
ture (RECBAR). 


3.1. Network Topology 


The RECBAR can be formally described as follows. The 
network consists of N =(Z +1).(og (Z) +1) PEs arranged in 
L +1 levels and log (L) +1 stages. The stages are numbered 
in a sequence from 0 to log (L), with O for the leftmost stage 
and log (L) for the rightmost stage. Similarly, the levels are 
labeled in a sequence from 0 to L. Each PE can be uniquely 
identified as <i,j >, where i denotes the stage and j, the 
level to which the PE belongs. Each PE has three input and 
output ports except those in the first and last stages which 
have two input and three output ports, and three input and 
two output ports, respectively. An ‘upper link’ is attached to 
the upper output port of a PE, a ‘middle link’ to the middle 
output port and a’lower link’ to the lower output port of the 
PE. 


The topology describing rules of the network are as fol- 
lows: 
(1) For an upper link: 

RECBAR <i,j > =<i +1, (j —2‘) modulo (L +1)>, 
(2) For a middle link: RECBAR <i,j > =<i +1, j>, 
(3) For a lower link: 

RECBAR <i,j > =<i +1,(j +2') modulo (ZL +1)>., 

for 0 <i <log (L)—-1,0 <j <L. 

(4) The feedback links: | 

RECBAR <log (L), j > =<0,j >. forO0 <j <L 
The reader may verify the PE interconnections in the example 
of Fig. 2. It should be noted at this point, that the RECBAR 
network resembles the Inverse Augmented Data Manipulator 
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Fig. 2. A 9x4 RECBAR with name representation 
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Network proposed by McMillen and Siegel [8], with the fol- 


lowing three differences. First, the exchange switches in the 
IADM are replaced by processing elements. Secondly, in the 
IADM there were L levels, whereas we have L +1 levels in 
the RECBAR. Thirdly, we use only LZ of the levels during 
operation, the extra level being included just for 
reconfiguration, whereas in the IADM, all the L levels were 
used simultaneously and there were no_ ideas of 
reconfiguration. 


3.2. The RECBAR operation under no fault 


We will show that the RECBAR can emulate the CBAR 
under no fault conditions. The level of PEs labeled L is not 
considered so that we have L. (log (LZ ) +1) PEs arranged in L 
levels and log (L) +1 stages, as required by the CBAR. Now 


any PE <i,j > (i #log (L)), in the CBAR is connected to PEs: 
<i +1, j7>,and <i +1, j —2'>o0r <i +1, j +2'> depend- 
ing on the value of the i bit of the binary word correspond-: 


ing to j. In the RECBAR, we have the PE <i,j > connected 
to PEs <i +1, j >, <i +1, j —2'> and <i +1, j +2'>, the 
modulo operation coming into effect only if j —2' is negative 
or j +2! exceeds L +1, which would not be the case for the 
CBAR addressing scheme. The feedback links remain the same 
in both the networks. All one needs to do is to select the two 
proper output links out of the three available, in each PE 
(except for the last stage), and the RECBAR would operate as 
the CBAR. 


The output link selection is based on the CBAR topology 
rules developed in Section 2.1. Address each PE <i,j> 
(i log (L)), by the binary representations of the stage and 
level to which it belongs. If the i” bit of the level address is a 
0, choose the lower two output links, assigning a 0 to the mid- 
dle link and a 1 to the lower link. If thei” bit is a 1, choose 
the upper two output links, assigning a 0 to the upper link and 
a 1 to the middle link. The significance of the 0 and the 1 links 
is the same as in the CBAR. Fig. 2. shows the RECBAR 
configured for operation under no fault. The continuous lines 
denote the links in use and the dotted lines denote the redun- 
dant links. 


3.3. The RECBAR operation under a single fault 


We will show that the RECBAR can emulate the CBAR in 
the presence of faults in a single PE level. 


THEOREM 1: The RECBAR can be reconfigured to emulate 
the CBAR under failure of any single level. 


PROOF: Suppose a level r of PEs fails for some 
0 <r SL —1. We will show that it is possible to reconfigure 
the RECBAR such that the level r is removed and the spare 
level ZL brought instead. We will transform the addresses of 
the PEs by the following simple transformation: the PE whose 
old address was <i,j > will be assigned the new address 
@.j-—7—1), where the subtractions are performed 
modulo (L +1). For example, the PE whose old address was 
<i,r +1> is assigned the new address (i,0) and the PE 


whose old address was <i,r —1> is assigned the new address - 


G,L —1), since 
=(—2) modulo (L +1) =L —1. 


In order that the CBAR structure be realized, we need to 
have the following connections for the PEs in the new address- 
ing scheme: 

(1) <i,j >to <i +1,j>, 

(2) <i,j > to <i +1, j —2' >, if dit; (j) =1, 

(3) <i,j > to <i +1, j +2'>, if bit; (7) =0, 

(4) <log (L), j > to <0,j >, 

| for-0 <i <log (L)—-1,0 <j <L —1. 
This implies that the original network should have the follow- 


(r —1—r —1) modulo (L +1) 
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‘ing connections: 
(1) <i, (j +r +1) mod (L +1)> 
to <i +1,(j +r +1) mod (L +1)>, 
(2) <i, (j +r +1) mod (L +1)> “ 
to <i +1,(j +r +1) mod (L +1) +2'>, 
(3) <log (ZL), (j +r +1) mod (L +1)> 
to <0, (j +r +1) mod (L +1)>, 
for 0 <i <log (L)—1,0 <[(j +r +1) mod (ZL +1)] <L -1. 
From the definition of the RECBAR, these connections are 
present, where j is replaced by (j +r +1) mod (LZ +1). Oo 


EXAMPLE 1: Consider a RECBAR with LZ =8. Let level 3 
become faulty. We rename level 4 as 0, level 5 as 1...., level 8 
as 4, level 0 as 5,.... and level 2 as 7. Let us see whether we 
have proper connections for the PE <2,2> in the new address- 


‘ing scheme. We need the connections <2,2> to <3.2> and 


<3,6>. These correspond to <2,6> to <3,6> and <3,1> in 
the original network, since (j +r +1) mod (L +1) 
=(j +4) mod 9 , which is equal to 6 for j =2 and is equal to 1 
for j =6. We notice that these connections are present in the 
RECBAR of Fig. 2. Fig. 3 illustrates the reconfigured RECBAR 
for a fault in level 3. 


The selection of the proper outputs ports in each PE is 
straightforward. Let level r be the faulty level, for some 
0 <r <L —1. Compute (j —r —1) mod (LZ +1) for each PE 
<i,j> and represent the value obtained by its binary 
equivalent. Bit i of this binary word then determines the 
ports to be selected, the selection process being similar to the 
one discussed in Section 3.2. The reader may verify the port 


‘Selection on the example of Fig. 3. 
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Fig. 3. A reconfigured 9x4 RECBAR for a fault in level 3 


The addition of 2! 


bit, [Cy +r +1) mod CL +1)}. 


or subtraction depends on 


3.4. Structure of the Router Cell 


The router cell has three network input-output port pairs 
and one processor input-output port pair, as shown in Fig. 4. 


A 4x4 crossbar switch connects the input ports to the output 
ports. The output port set selection logic is shown in outline 
‘mode. It consists of two 1-to-2 demultiplexers controlled by a 
common port set selection line. The operation of the router 
cell is clear from the figure itself and need not be explained. 


3.5. Overhead - 


Consider a+ CBAR with JL _ levels. It has 
N =L.(log (L) +1) PEs and 2N links. The corresponding 
RECBAR has N' =(L +1).(log (Z ) +1) PEs and 3N' —(L +1) 
= (L +1).(3.log (L) +2) links. The PE overhead ratio comes 


out to be equal to — and the link overhead ratio comes out to 


(L +3).log (L) +2 In the limit as Z tends to a 
beveual Ose (La 


large value, the link overhead ratio becomes nearly 50%. 


4. RELIABILITY ANALYSIS 


We will now estimate the improvement in the reliability 
of the system topology. We assume that the failure rate of a 
PE is exponential with a failure rate of A, and the failure rate 
of a link to be A,. In Fig. 5, we show the reliabilities of the 
CBAR and the RECBAR for L =8, A, =0.1 failures per unit 
time and A; =0.01 failures per unit time. We assume that the 
failure rate of a link is much less than that of a PE because it 
is much less complex. 


5. CONCLUSIONS 


In this paper, we have presented a chain-structured 
butterfly architecture (CBAR) similar in organization to the 
architectures proposed by Wong and Ito [6] and Koike and 
Ohmori [7]. The CBAR has the desirable property of being able 
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Fig. 4. The RECBAR router cell block diagram 
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Fig. 5. Reliability comparisons between CBAR and RECBAR 


to connect an extremely large number of processors with 
O(N) connection cost. O (log ( 


paths, and a small number (=4) of VO ports per PE. We have 
also discussed a reconfigurable version of the CBAR, the REC- 
BAR, which can tolerate any number of failures of processing 
elements and links associated with a single level. This fault 
tolerance is achieved at a processor overhead ratio of 
O¢ a ) and a link overhead ratio of approximately 50%. 
The reliability analysis has shown that the RECBAR is much 
more reliable than the CBAR. 


)) communication 
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Abstract 


The performance of a parallel algorithm depends on 
the interconnection topology of the target parallel system. 
_ An interconnection network is called reconfigurable if its 
topology can be changed between different algorithm exe- 
cutions. Since communication patterns vary from one 
parallel algorithm to another, a reconfigurable network can 
effectively support algorithms with different communica- 
tion requirements. This paper describes a reconfigurable 
optical network and explains how to generate network 
topologies which are optimized with respect to a given task. 
We describe an algorithm that takes as input a task graph 
and generates as output a topology that closely matches the 
given input graph. The best, worst and average case perfor- 
mance of the algorithm is analyzed and it is shown that on 
the average the optimum topology is generated. 


1. Introduction 


The performance of a parallel algorithm depends on 
how well its communication characteristics match the inter- 
connection topology of the underlying parallel system. 
Since different algorithms exhibit different communication 
patterns, work in the area of general-purpose parallel sys- 
tem design centers on developing a network topology that 
is suitable for various communication requirements. A 
common approach is to statically connect processing ele- 
ments in a regular pattern with certain properties such as 
small diameter, easy routing and expansion, low conges- 
tion, etc. General-purpose parallel systems with static 
interconnection networks however have several limitations 
due to their fixed nature. The first limitation is that a given 
algorithm, which may be viewed as a graph with nodes 
representing processes and edges representing potential 
communications, has to be mapped onto the parallel sys- 
tem. This mapping problem in general is computationally 
intractable even for parallel systems with homogeneous 
processing elements. The second limitation is that although 
one interconnection topology may be ideal for a set of algo- 


rithms, it may introduce. unacceptable communication 
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delays for other algorithms even with the best possible 
mapping. The third limitation is that developing parallel 
algorithms that closely match the interconnection topology 
of a target parallel system is a difficult task. A parallel sys- 
tem with a network whose interconnection topology can be 
configured to match the communication characteristics of 
an algorithm remedies these limitations. 


An interconnection network is called reconfigurable 
if its topology can be altered between different algorithm 
executions or even between different phases of the same 
algorithm execution. Examples of architectures that can be 
configured into a limited number of interconnection pat- 
terns include the MPP [1] and the CHiP [2] architecture. In 
contrast to these networks which can only realize a limited 
number of topologies, this paper will investigate the use of 
a reconfigurable network which can realize any r-regular 
graph as a network topology. Such networks are referred 
as r-reconfigurable networks. 


2. Reconfigurable Optical Network 


Optics provides a way of implementing r- 
reconfigurable networks. We propose using an optical net- 
work that requires only O(nr) components to implement a 
r-reconfigurable network for n processors, where n is ~ 
100. The network consists of nr optical transmitters (laser 
diodes), nr deflectors (acousto-optic devices or mirrors) 
and nr photo-sensitive receivers (photodiodes). The deflec- 
tors can be either mirrors mounted on servo motors or 
acousto-optic devices which deflect an incoming beam at 
an angle proportional to an applied frequency. To establish 
a communication channel between processors, the source 
processor directs a control signal to its deflector. The con- 
trol signal’s value determines the deflection angle of the 
incoming laser beam so that it impinges on the photodiode 
associated with the desired processor. The transmitting 
processor then modulates the laser beam with the informa- 
tion to be transmitted. The receiver detects this light beam, 
demodulates it, and processes the resultant data. accord- 
ingly. Since laser beams do not interfere with each other, 


the network can realize any permutation. Other researchers 
have given alternate designs of optical reconfigurable net- 
works [3]. 


In the design of our reconfigurable network, there is 
a tradeoff between the cost of the network and the time 
needed to reconfigure. For example, if mirrors are used as 
deflection elements, the network has a reconfiguration time 
of ~ 1 msec; whereas, if an acousto-optic deflector is used, 
the network has a reconfiguration time of ~ 1 psec. The 
two networks differ in cost by at least an order of magni- 
tude due to the cost of the acousto-optic deflector and 
related control electronics. In this paper, we consider the 
use of the network with the slower reconfiguration time. In 
particular, the time needed to make the transition from one 
configuration to another is long in comparison to the time 
between two communication events. Therefore, we assume 
that the topology changes only between different algorithm 
executions or between different phases of the same algo- 
rithm execution. 


Given that the topology remains relatively static 
throughout the execution of an algorithm, the topology 
must be chosen so as to closely match the communication 
requirements of the algorithm. In the next section we 
describe an algorithm that takes as input a task graph and 
generates as output a topology that closely matches the 
given input graph. In section 4, we briefly analyze the best, 
worst and average case performance of the algorithm. 


3. The Synthesis Problem 


We now discuss the problem of generating an inter- 
connection topology that matches closely the interconnec- 
tion patterns of a given task graph. We define an optimum 
topology (and corresponding mapping) as one that maxim- 
izes the number of pairs of communicating tasks that fall on 
pairs of directly connected processors. The definition of 
the synthesis-mapping function can be stated as follows: 

Given a connected and undirected graph T, |V(T)| =n 
such that each element of V (T) is labeled ¢,, 1 <i <n, and 
a set V(P) of n nodes labeled v;, 1 <i <n, and a function 
g:V(T)—-> V(P), where g(t;)=v;, and an integer r 22, 
then the problem is to find a function 

c:V(P)xV(P) > {0,1} such that P={V(P),E(P)}, 
where E(P)= {(u,v): c(u,v)=1}, is a degree bounded 
connected graph with degree(v;)<r for all 
vy; €Vp 1 <i Sn, and cardinality = 


l{(@,v): (uv)EE (LT) and (g(u),g(v))EE (P)} | 
is maximum. 

T is a task graph that models the algorithm to be 
executed on the system. The vertices in the graph 
correspond to individual tasks and an edge between two 
vertices signifies that communication occurs between these 
two tasks. No attempt has been made to quantize the 
amount or cost of communication between tasks. The pro- 
cessor system is also represented as a graph with vertices 
corresponding to processors and edges to communication 
links. The problem is to find the set of edges (communica- 
tion links) to maximize the number of pairs of 
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intercommunicating tasks that fall on pairs of directly con- 
nected processors. At the same time, because of the limited 
number of links each processor controls, the degree of each 
vertex must be bound. Also, the resultant topology must be 
connected since a message sent to a processor that is not 
directly connected to a source processor must be forwarded 
through intermediate processors. Graphs meeting these 
conditions will be referred to as degree constrained con- 
nected graphs (DCCG). The problem of finding such 
graphs will be referred to as the DCCG problem. 


' We have shown in [4] that the DCCG problem is 
NP-complete. Although the DCCG problem is NP- 
complete, the synthesis problem can be solved in polyno- 
mial time using graph matching methods if the condition 
that the resultant graph be connected is removed [5]. The 
outline of an algorithm [4] for finding sub-optimal DCCGs 
is as follows: First, use a matching technique to find a 
optimal disconnected topology (henceforth referred to as a 
maximal deficient r-factor or MDRF). Then, using heuris- 
tics, connect the disconnected components of the MDRF 
together. The connection of disconnected components may 
require that existing communication links interconnecting 
communicating tasks be broken. When links are broken, 
our algorithm ensures that further discontinuities are not 
introduced. The algorithm runs in the time complexity of 
O (n°). 


4. Analysis of the Synthesis Algorithm 
In the best case, our algorithm can generate topolo- 
gies with a cardinality of |nr/2|. In the worst case, we can 
bound the cardinality with respect to the optimum cardinal- 
ity. To do this we note that the cardinality of the MDRF is 
at least as large as the cardinality of the optimum connected 
topology. Therefore, we can use the cardinality of the 
MDRE as a bound on the optimum cardinality. It is shown 
elsewhere [4] that use of our algorithm to connect the 
disconnected components of the MDRF reduces the cardi- 
nality by at most |n/(r+1)|-1. Therefore, our algorithm 
always generates a connected topology that is within 
[n/(r+1)|—1 of optimal. Furthermore, for any given n and 
r, it is possible to construct a graph with a MDRF such that 
the connection of the MDRF by any connection algorithm 
results in a cardinality loss by the bound. That is, the per- 
formance of our connection algorithm equals that of an 
optimal algorithm in the worst case. 


We use random graphs as processor graphs to 
analyze the average case. Random graphs are used since 
we want to determine the performance of our algorithm 
over a wide range of graphs. We define a random graph as 
a graph where the probability of an edge between two 
nodes is fixed. We say that almost every random graph has 
a given property if the probability of a random graph hav- 
ing that property approaches 1 as the number of nodes in 
the graph approaches infinity [6]. To apply well-known 
results from random graph theory in analyzing our algo- 
rithm for the average case, we assume that the topology 
synthesized by our algorithm can be modeled as a random 
r-regular graph. Since we are investigating the 


performance of our algorithm assuming a random graph 
input and since our algorithm generates a r-regular graph as 
output by deleting some of the edges of the input graph, 
this is a reasonable assumption to make. 


To determine the average case performance, we 
need to determine the expected cardinality of a MDRF and 
also the expected reduction in cardinality due to the con- 
nection process. The cardinality of a MDRF has been 
characterized by Shamir and Upfal [7] who state that if 
. almost every random graph G, with nr even, has a minimal 
degree of r, then almost every graph G has a MDRF with a 
cardinality of nr/2. Thus, for almost every random graph 
used as input with a minimal degree 27, a MDRF will be 
found with nr/2 edges. 

Our algorithm diverges from optimality during the 
connection process. In particular, the algorithm does not 
connect the disconnected components so that the cardinal- 
ity is reduced by a minimal amount. However, as we men- 
tioned above, this reduction is < |n/(r+1)|-1. On the 
average, one would expect that the reduction would be 
smaller than |n/(r+1)|-1. Indeed, Wormwald [8] has 
shown that if r 23 then almost every random r-regular 
graph is r-connected. Since almost every graph generated 
by the matching process is a r-regular graph, almost every 
graph generated by the graph factoring step of our algo- 
rithm will be r-connected and hence 1-connected. There- 
fore, almost every graph with minimal degree of r 23 will 
have a cardinality of |nr/2| when mapped onto the topol- 
ogy generated by our algorithm. 


To verify the average case prediction, we generated 
a range of random graphs with a varying number of nodes 
and edges. For each of the graphs we found a correspond- 
ing MDRF and then used our algorithm to connect the 
MDRF. It was observed that for almost all of the graphs 
the corresponding DCCG’s had a cardinality of |nr/2). 
The only situation where the average case prediction failed 
occurred when the task graphs approached a completely 
connected graph. In this situation the matching algorithm 
that we were using to generate the MDRFs was producing 
MDRFs with relatively large numbers of disconnected 
components. This occurred since the algorithm used exam- 
ines the edges adjacent to a node in sequential order in an 
effort to determine which ones to keep in the DCS. For 
example, let r = 3. The algorithm first examines node 1. It 
sees that there are edges (1,2),(1,3),(1,4) and adds them to 
the degree constrained subgraph (DCS). Similarly, it sees 
that there are edges (2,3), (2,4) and (3,4). Since it examines 
edges sequentially, it adds these edges to the DCS and 
thereby forms a component of nodes 1,2,3 and 4 to which 
no further edges can connect. This does not happen until 
the graph is almost completely connected since the proba- 
bility of having all of the needed edges to form a com- 
ponent with consecutive nodes is low until the graph 
becomes almost completely connected. To avoid this prob- 
lem the algorithm needs to be modified so that it examines 
the edges of each node in random as opposed to sequential 
order. 
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5. Summary 


We have presented an optical reconfiguration net- 
work that requires only O (mr) components. For this net- 
work to achieve better performance than that available from 
conventional static networks, the topology chosen must 
match that of the algorithm to be executed. We define what 
it means to match and mention an algorithm that takes as 
input an arbitrary task graph and generates as output a 
topology that closely matches the given input graph. We 
then analyze the best, worst and average case behavior of 
our algorithm. It is shown that on the average the algo- 
rithm almost always produces optimum topologies. 
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Abstract 

Throughput and latency in packet communication networks 
are determined to a large extent by throughput and latency in 
the first-in first-out (FIFO) queues used for packet buffering in 
these networks. We describe a design approach for self-timed 
FIFO queues with a novel organization which allows tradeoffs 
between area, throughput and latency in VLSI implementations. 
This flexibility is made possible by the use of asynchronous dis- 
tributed control circuits. These circuits are synthesized directly 
from a graph model called Signal Transition Graphs, and are 
completely hazard-free. A number of nMOS test chips were fab- 
ricated and they worked at a 4 Mbytes/sec throughput rate. 


1. Introduction 


Multistage interconnection networks (MIN) are used to support 
communication among processing and storage modules in multi- 
processor architectures 2]. To use a packet routing MIN, mod- 
ules communicate by sending packets to each other. A packet 
consists of an address and data. The address is used to forward 
the packet along a path through the network from the sender 
to the destination module. Multistage interconnection networks 
are usually constructed out of N x N crossbar switches, where 
the number of ports N is determined by performance require- 
ments and packaging constraints. A switch forwards an input 
packet to an output port according to the destination address 
of the packet. Each switch may also provide buffering storage 
for packets whose forwarding paths are temporarily blocked by 
other network traffic. The buffering storage is usually managed 
by a first-in first-out discipline. 

In this setting, the design of FIFO queues is an important 
consideration in the design of practical packet routing networks. 
We present an approach for designing FIFO queues in VLSI 
technology which allows tradeoffs between area, latency and 
throughput. At one end of the design spectrum, an area-efficient 
implementation with high throughput, but long latency, can be 
obtained. In this organization, the register stages are connected 
serially to allow data to ripple through; there is no global com- 
munication. At the other end of the spectrum is a queue with 
minimal latency, but somewhat lower throughput rate due to 
increased delay in the control operations and in the loading of 
global buses. On a VLSI chip, these buses are sets of wires car- 
rying input and output data which are connected to the input 
and output of all register stages. An optimal design somewhere 
in the range of these extremes can be chosen depending on the 
application. A queue of N stages can be partitioned such that 
only M register stages load the buses; given that the permissible 
latency is L stage delays, then M = N/L. Thus, this distributed 
organization also reduces the amount of global communication; 
this is particularly important for large queues, where the load- 
ing on control and data buses approaches the level existing in a 
typical memory array. 

Our FIFO queue design makes use of distributed control 
structures and local communication. There are only a few types 
of modules in this design, with modules of each type replicated as 
necessary to construct complete FIFO queues. The distributed 
control structure allows the exploitation of concurrency. Con- 
current read/write supports a higher throughput rate. The 
FIFO queue is also completely data driven, hence no potential 
read/write conflict exists and there is no need for any arbiter. 

The distributed control organization of the FIFO lends itself 
naturally to a design using asynchronous, self-timed hardware 
circuits. Towards this end, a specification technique for asyn- 
chronous control structures based on a graph model called Signal 
Transition Graphs (STGs) have been proposed, and methods for 
direct and efficient synthesis of self-timed hardware circuits from 
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such specifications have been developed. The STG model is es- 
pecially appropriate for control modules which exhibit a high 
degree of asynchronous concurrency. A preliminary discussion 
of the STG model and its expressiveness and implementability 
are given in [5]. The use of STGs in the design and implemen- 
tation of a self-timed 2 x 2 packet router is reported in [3]. 

The paper is organized as follows. Section 2 describes the 
functional behavior and alternate organizations of the FIFO 
queue. Section 3 introduces the STG model and discusses its 
use in the specification of self-timed circuits. In Section 4, STG 
specifications of the building block control modules for realizing 
the various FIFO queue organizations and their implementation 
are presented. Finally, Section 5 presents the results and some 
further discussions on the STG model. 


2. Organization of the FIFO queue 


This section discusses the one and two-dimensional organi- 
zations for the FIFO queue (Figs. la and b). The R-module 
is a control circuit designed to support pipelined operation of 
the register stages. A R-module has an input link from a previ- 
ous stage, an output link to the next stage, and an output link 
to a register module in the same stage to control data loading 
into this register. A link is a pair of ready/acknowledge wires, 
depicted as an arc with the arrow pointing in the direction of 
the ready signal. When input data is available for loading into 
a register, a ready signal is sent to its R-module controller on 
the J, wire (Fig. la). The actual loading is performed when 
the R-module controller sends a ready signal to the register on 
the L wire. Once data have been loaded into a register, an ac- 
knowledge signal is sent to its R-module on the D wire. The 
R-module then returns an acknowledge signal on the J, wire of 
the input link and forwards a ready signal on the O, wire of its 
output link concurrently. The next data item will be loaded into 
its register only after the R-module has received an acknowledge 
signal on its O, wire and another ready signal on its I, wire. 
Thus, the operation of the R-module is pipelined. Data from 
one stage will be forwarded to the next unless the latter is full. 
The throughput rate of this queue is determined by the delays of 
the R-module and registers, whereas its latency is proportional 
to the number of stages in the queue. 

A two-dimensional, or ring organization is shown in Fig. 1b. 
This queue consists of M linear queues, each of L stages, and two 
token rings for controlling input/output operation. The capacity 
of the queue is M x L and the latency is proportional to L. I- 
modules are connected together to form a token ring to control 
the writing of data into the queue. The ring is initialized such 
that only one I-module contains the token, marking the next 
available empty register stage. Since the Write-request signal, 
carried on wire W,, is connected to all I-modules, the token 
should not be passed on to the next module in the ring if the 
Write-request signal is still active. This is an important timing 
restriction. The Write-acknowledge on wire W, is the output of 
an OR gate (shown as a heavy bar with a + sign) whose inputs 
are acknowledge wires from all I-modules. Similarly, reading 
from the FIFO is controlled by an Output token ring, formed 
by connecting O-modules together. Data written into the linear 

queues ripple to their output side, ready to be gated onto the 
output bus. The Output ring is initialized such that only one 
O-module contains the token. This module then controls the 
timing and signaling for gating of data to the output bus. The 
Read-request signal on wire R, is the output of an OR gate whose 
inputs are request wires from all O-modules. Another timing 
restriction exists for the Output token ring: since the Read- 
acknowledge signal, carried on wire R, is broadcast to all O- 
modules, the token should not be passed on to the next module 
while Read-acknowledge is still active. 


The Ring buffer which we fabricated is one with minimal la- 
tency (ZL = 1), with each of the linear queues containing exactly 


one stage. Registers in each stage have inputs connected to the 
input data bus, and outputs connected to the output data bus. 


3. Signal Transition Graphs 


In this section we introduce the STG model and its application © 


to the specification of pipeline controllers. A more complete 
discussion of STG can be found in [6]. In short, STGs are a form 
of Petri nets restricted by a set of axioms, and their components 
(such as transitions and places) assigned attributes related to 
physical circuits. The result is a graph model which is much 
more amenable to analysis due to the reduced complexity, and 
still retains sufficient expressiveness for specifying most common 
behaviors of control circuits including concurrency, choices and 
conflicts. 

A hardware circuit consists of an interconnection of logtc 
elements, each having an output terminal and a number of input 
terminals. Every input terminal is connected either to an input 
terminal of the entire circuit, or to an output terminal of another 
logic element in the circuit. The set of all terminals of a circuit 
is called the set of signals, M. In order to describe the dynamics 
of a circuit, a set of signal transitions T= M x {+, —} is used 
to. specify the rising and falling transitions of each signal in M. 
For each 1 € M, its associated transitions are denoted by 7+ and 
i—. It is often convenient to use the notations ¢ and t to denote 
pairs of transitions, such that if t = 7+ then t = 1— and vice 
versa. 

For the purpose of this presentation, a STG is a directed 
graph represented asa triple (T. R, To) where T is the set of sig- 
nal transitions (defined over a signal set M). Tg the set of tran- 
sitions which are enabled in the initial state of the circuit. and 
RCTXT anirreflerive, intransitive relation over the set of tran- 
sitions, called the causal relation. Graphically, t; Rtz is shown 
as an arc between two transitions: t; — ty. Let R*é CT xT 
denote the transitive closure of 8, t; 2+t2 means that there ex- 
ists a directed path from t; to tg; this is shown graphically as 
t; —» tz. The semantics of STG can be expressed in terms of 
transition sequences and their compositions; this has been car- 
ried out in (6) using trace theory ‘14]. Informally, ¢;®tz2 means 
that the occurrence of transition ¢; causes that of transition tg; 
this implies that if the circuit is in some state s in which transi- 
tion t, is enabled and eventually occurs, then the occurrence of 
t; brings the circuit to another state s’ say, in which tz is enabled 
and hence will eventually occur. This last statement hints that 
given an initial state of the circuit (in which transitions in Tzo 
are enabled) and a STG expressing the causal relation between 
its signal transitions, one can generate a state transition graph 
from the STG. A circuit realization can then be obtained from 
the state graph. Furthermore, a constraint ¢t;Rt2 can be imple- 
mented as a logic element with ¢, as one of the inputs and tz as 
output. This important observation allows the decomposition 
of a STG specification into smaller subgraphs, each of which 
contains only transitions which are causally related. Thus, STG 
allows a very efficient and direct implementation based on this 
decomposition principle. This is a unique feature of STG com- 
pared to other approaches. 

A multi-arc connects several tail transitions to one head tran- 
sition, or one tail transition to several head transitions. We call 
these arc configurations And Forks and Joins (there are also 
Or constructs for specifying choices or conflicts in the complete 
STG model), and their diagrammatic notations are shown in 
Figure 2a. An And-fork is used to describe a situation in which 
the occurrence of a tail transition causes the occurrences of all 
of the head transitions. An And-join describes a situation in 
which all tail transitions in the relation have to occur to cause 
the occurrence of the head transition. These And constructs are 
used to describe concurrent operations in circuits. 


Liveness and Persistency. A STG has a deadlock-free and 
hazard-free circuit realization only if it satisfies the properties 
of liveness and persistency. Since STGs are merely behavior 
specifications from which state graphs can be derived for im- 
plementation, these properties must ultimately be based on the 
latter type of graphs. However, there is a one-to-one correspon- 
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dence between them, so that liveness and persistency will appear 
as syntactic constraints on STGs. We discuss briefly these prop- 
erties and their STG syntactic ramifications. 

The continual operation without deadlocking of control mod- 
ules is a property called liveness. A STG is live iff its underlying 
state transition graph is strongly connected. The necessary con- 
dition for a STG to be live consists of (i) the STG is a strongly 
connected graph, and (ii) there is a s¢mple cycle containing both 
t and ¢ for every t € T. Since live STG are strongly connected, 
concurrency and ordering have to be characterized differently: 
two transitions can occur concurrently iff there is no simple cy- 
cle containing both of them in the STG; equivalently, the oc- 
currence of two transitions are ordered iff there exists a simple 
cycle containing both of them. 

Due to the similarity to a special class of Petri nets called 
marked graphs |7|, it may appear that the form of STGs dis- 
cussed here is always persistent. However this is not the case, as 
the underlying state graph may exhibit nonpersistency whenever 
two transitions are enabled in the same state and the occurrence 
of one removes the enabling condition of the other. A persistency 
constraint is an ordering constraint between two transitions used 
to eliminate nonpersistency, as illustrated in Fig. 2b. For a live 
STG, the condition t2*t always holds for every transition t. 
If #2u exists as shown and uR*t (depicted as an heavy arc in 
Fig. 2b) were not present, then transitions t and u can occur 
concurrently. Suppose the course of action tRu is implemented 
by a hardware element with ¢ as one of its inputs and u as its 
output. Concurrency between t and u implies that while the 
hardware element is reacting to t to cause u, t may be occurring 
simultaneously at the input of that hardware element. This is 
commonly known as a race condition in hardware circuits and 
can lead to malfunction. The approach to deal with this prob- 
lem is to impose a persistency constraint on STG specifications, 
namely uX*t, to eliminate this nonpersistent behavior. Hence, 
a STG specification is persistent if every transition u caused by 
a transition t precede t, i.e. Vu € T, if tRu then uR7t. 


Specification of Pipelined Circuits. We can now develop 
a STG specification for pipelined control operations such that 
liveness and persistency are satisfied. Consider two cycles of 
transitions as shown in Fig. 2c. The left cycle (with a and 
a) represents the control sequence of the input portion of a 


pipelined circuit; the right one (with 6 and 6) represents the 
control sequence of its output portion. The necessary condition 
for liveness is satisfied by each cycle if every transition in a cycle 
is paired with another in the same cycle. We want the two cy- 
cles to operate in parallel as much as possible, with the left one 
initiates control actions on the right one through arc aXb. In or- 
der for the STG to be persistent, three additional arcs (in heavy 
line) are required. Because of the existence of arc aRb, arc bRG is 
required as a persistency constraint to prevent concurrent firing 
of b and @. The introduction of bR& requires adding aRb to pre- 
vent concurrent firing of @ and 0, and this in turns requires arc 
bRa. These four arcs allows the synchronization of two cycles of 
transitions in pipelined fashion such that the resulting STG is 
persistent. These constraints can be viewed from a different per- 
spective by “unfolding” the cycles (Fig. 2d) into partial orders, 
in much the same fashion as occurrence nets [1]. In this type of 
nets, a node such as a represents an instance of a transition a 
in a cycle of operation. It can be seen that the persistency con- 
straints appear as aRbRGRb... and thus forces these transitions 
to occur in sequence. Otherwise, transitions belonging to other 
branches of the cycles can occur concurrently. 


4. STG specification of control modules 


In this section we apply the STG model to specify the I-module 
used in the FIFO organizations in Section 2. We will illustrate 
how to specify a STG such that it meets the liveness and persis- 
tency conditions set forth in Section 3. Once a STG satisfying 
these conditions is obtained, it can be translated directly into 
a circuit module using the synthesis steps discussed in [6]. We 
will give the hardware implementation obtained through this 
synthesis procedure for each STG specification, and discuss the 


procedure itself informally. The reader is referred to a complete 
version of the paper 4] for the discussion of O- and R-modules. 

For all the control modules we will specify, event occurrences 
are signalled over control links, using the reset segnaling hand- 
shake protocol [12]. Usually, an occurrence of an event is sig- 
nalled by a positive transition on the ready wire of the control 
link; its acknowledgment is signalled by a positive transition on 
the acknowledge wire of the control link. The signals on these 
links are then reset through negative transitions before the oc- 
currence of the next event can be signalled. 

While liveness and persistency are considered to be funda- 
mental properties of STG, there are other properties more re- 
lated to the implementation of control circuits according to a 
certain design methodology. Two such constraints pertinent to 
the ensuing discussions called R1 and R2 are described. 

R1: This constraint concerns the behavior in the initial state 
of a control circuit operating with the reset signaling protocol. 
Starting from the idle initial state, every control module used in 
the FIFO organizations alternates between an active phase con- 
sisting entirely of positive signal transitions, and a reset phase 
consisting entirely of negative signal transitions. In a circuit 
implementation, the signal state at each terminal is identified 
with a signal transition at that terminal: if the state of a signal 
u is 1 (0), it implies that u+ (u—) has occurred. If the initial 
state of a circuit is all 0’s then negative transition of the form 
u— must have occurred in each of these signals in the immediate 
past. Thus, any positive transition in a STG, say u+ which is 
preceded only by negative transitions of the form t— will always 
be activated in this initial state. When this is not desired, an ar- 
tificial constraint from some other positive transition r+ to u+ 
must be added. Hence, it is required that for every STG, the 
subgraph induced by the set of positive transitions ts connected. 


R2: The second constraint results from the communication 
discipline imposed on a control circuit. Control circuits operat- 
ing with the reset signaling protocol uses pairs of ready/acknow- 
ledge wires to communicate with the external world. A transition 
on the acknowledge wire can only occur in response to a transi- 
tion on the ready wire and vice versa. For a pair of wires {/,, I,} 
where /, is an input ready and J, an output acknowledge, this 
communication interface to the external world is specified in a 
STG by the pair of constraints {J,— RI,+,I,+ RI,—}. Simi- 
larly for a pair of wires {O,,O,} where O, is an output ready and 
O, an input acknowledge, its corresponding set of constraints is 
{O,+ ROw+,O--— RO,—}. Thus, in a STG, every transition of 
an input signal has exactly one transition which directly precedes 
it, and this transition must be that of an output signal. Tran- 
sitions of an input signals are underlined to distinguish it from 
transitions of “non-input” ones. 


STG Specification of the I-module. An [-module controls 
the loading of input data into a linear queue (Fig. 1b). When 
its turn comes, a token is passed to the I-module from its im- 
mediate predecessor in the token ring, through a control link 
P = {P,, P,}. Upon receiving the token, the I-module responds 
to the next input request on its W = {W,,W,} control link by 
sending a load request to the linear queue it controls through 
its T= {I,, [,} control link. After loading a new data item into 
the linear queue, the I-module forwards the token it holds to the 
next I-module in the token ring through its N = {N,, N,} link. 

The STG description of I-module in Fig. 3 contains two main 
cycles, the left one coordinates the reception of the ring token 
with the reception of the next data item presented to the FIFO 
queue. The right one manages the forwarding of ring token to 
a successor I-module. The *-arc (W,— RN,+) implements the 
timing constraint discussed earlier, such that N, does not go 
high (to pass a token to the next module) until after the input 
Write-request W, has gone low. These two cycles and the *-arc 
together provide the specification for proper event sequencing in 
the I-module. Other arcs are to be added to satisfied persistency 
and other constraints discussed above. First, arc D; ensures that 
all positive transitions form a connected subgraph according to 
constraint Rl. Since D, is a constraint from transition [+ 
to N,+, the pairs of transitions {I4+,/4—} and {N,+, N,-} 
could be used for implementing the persistency constraints. This 
set of arcs would include: [,+ RN,+, N,+ RI,-, Ig- RN,-, 
and N,— RI,+. However, since [,+ and I,— are transitions of 
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an input signal, each can have no more than one incident arc 
according to constraint R2. To enforce these constraints, we 
change N,+ RI,— to N-+ RI,—, and N,— RI,+ to N,— RI,+4+. 
These final constraints are shown as D, — Dg in Fig. 3. 

Liveness and persistency are satisfied by this STG. The syn- 
thesis procedure produces a state graph, from which the re- 
alization in Fig. 3 is obtained. The logic equation for J, is 
I, = W,P,N, + I,(W, + P, + N,) and its implementation is a 
C-element with inputs W,, P, and N,. The logic equation for 
N, is N, = NiW,1I,+ N-(N+ J.) and its implementation is as 
shown in Fig. 3. The reader can readily verify that the circuit 
operates according to its STG specification. 


5. Result and conclusion 


A FIFO with 8 stages and 9-bit wide data path was designed, 
using a 4 micron nMOS technology. The chip size (including 


pads) is 3.15 x 2.25mm?. Six chips were received from MOSIS, 
they were tested and five were fully operational at a throughput 
rate of approximately 4 MBytes/sec. An nMOS circuit diagram 
for the chip is shown in Fig. 4, the lower portion is the con- 
trol circuitry, with R-modules on top, I-ring in the middle and 
O-ring at the bottom. The control circuits take a relatively 
large amount of area in this chip. However, in a 2-dimensional 
organization with L > 2, the overhead due to I-modules and 
O-modules can be reduced significantly. 

In this paper, Signal Transition Graphs have been used as 
a specification tool for asynchronous control modules. A STG 
specification can be viewed as an interpreted Petri net in which 
each transition is identified with a signal transition in a hardware 
circuit. In the synthesis approach proposed, a state transition 
graph is generated from a STG and then used to derive logic 
equations and hardware structures for the signals. A STG spec- 
ification can thus also be viewed as a concise yet more abstract 
notation for specifying a class of state transition graphs. 

In our specification and design examples, it has been shown 
how introducing additional constraints in a STG allows us to use 
level-sensitive hardware circuits instead of transition-sensitive 
hardware circuits in its implementation. These constraints are 
justified only informally. A more formal theory based on trace 
theory and state transition graphs are developed in {6}. 

The module descriptions used in this paper require only con- 
structs for specifying sequencing and concurrency. There are 
other behaviors which exhibit conflict and data-dependent sig- 
nal flow that would require additional STG constructs for their 
specification. These latter constructs are called OR-constructs, 
and the reader is referred to [5] for an introduction to their for- 
mulation and applications. 

In [9] Martin described a design approach using constructs 
for non-deterministic programming to specify hardware mod- 
ules whose behaviors exhibit only sequencing and arbitration 
requirements. This approach uses a subset of Dijkstra’s guarded 
command language to specify each process; concurrent cooper- 
ating processes are described using notations similar to Hoare’s 
CSP [10]. Heuristic procedures are used to “compile” a hard- 
ware implementation from a module specification into an inter- 
connection of standard hardware templates such as And, Or, 
C-elements, etc.. During the compilation, the technique of re- 
ordering events in a sequence is made use to improve implemen- 
tation efficiency. The complete STG model allows the specifica- 
tion of concurrency, sequencing and conflict in module behav- 
ior, and our implementation approach is aimed at automating 
the derivation of hardware structures from STG specifications. 
Recently there are works on the classification and synthesis of 
delay-insensitive circuits based on trace theory '13,14]. The re- 
lation of STG to trace theory is analogous to that of Petri nets 
model to its underlying sequence semantics. Thus, we believe 
that STG can serve as a high-level, more abstract specification 
than that of an approach directly based on trace theory. 

There are also related works on verification of asynchronous 
hardware structures based on temporal logic (8'. Such tech- 
niques can be used fruitfully for correctness validation of self- 
timed circuits. The design of suitable translation techniques 
from high-level language to STG -in the same vein as those done 
by Martin and Rem |11]- is another area for further exploration. 
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ABSTRACT 


The use of optical techniques has been widely recog- 
nized as a solution for overcoming the fundamental problems 
in data communications. It is potentially feasible to build 
large optical crossbar networks that do not suffer from many 
of the limitations of their electronic counterparts. In this pa- 
per we discuss several optical matrix-vector implementations of 
crossbar networks and present a comparative study of these 
designs. These designs employ acousto-, electro-, or magneto- 
optic spatial light modulators for achieving the generalized 
crossbar functions. Some new optical systems suitable for im- 
plementation of crossbar networks are also described. They 
can also be used to construct multistage networks of larger 
size. 


1. INTRODUCTION 


Any multiprocessor system that uses several processors 
must be designed to allow efficient communication among pro- 
cessors and memory units, as this data communication contri- 
butes significantly to the overall performance of the system. 
An ideal interconnection network for this purpose is a high 
speed, high bandwidth crossbar network. A crossbar network 
will allow any processor to communicate with other processors 
or memories with simple control and small delay. Even 
though crossbars are ideal networks for multiprocessors, they 
are generally used in only relatively small systems, as the cost 
of large crossbars is prohibitively expensive with the VLSI 
technology. There have been several crossbar designs reported 
for up to several hundred inputs and outputs [BURG 83, 
DENN 82, BROO 84]. 


For large crossbar implementations other types of tech- 
nology may offer a viable alternative. The use of optical tech- 
niques has been widely recognized as a solution to overcoming 
the fundamental problems in data communications [SAWC 
84]. The parallel nature of optics and free-space propagation, 
together with their relative freedom from interference make 
them ideal for parallel communications. There exists many 
different optical devices which provide switching capability. It 
was shown in [SAWC 85] that any implementation of matrix- 
vector multiplier in optics results in an optical crossbar net- 
work. 


Suppose that N serial data input lines (each one-bit 
wide) are to be connected to a set of N serial data output 
lines (each one-bit wide). We can think of the data bits on 
each line flowing synchronously, although, for some of the 
techniques we describe, this synchronism is not necessary. We 
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can represent the data on the input lines at some instant of 
time by the column vector b of length N. The data on the 
output line of the interconnection network (which we refer to 
as a switch) is represented by the column vector @ of length 
N, and both these vectors have only binary elements. The 
state of the switch is described by the N XN matrix A whose 
elements are also 0’s and 1’s. Matrix A is a generalized per- 
mutation matrix; an entry of 1 in row? and column 7 of A 
means that input 7 is connected to output line z. 


Once the physical connection described by the matrix 
A is made, the data transfer could be synchronous or asyn- 
chronous [SAWC 85]. This idea can be generalized for the in- 
terconnection of N parallel input lines (each M bits wide) to 
N parallel output lines (each M bits wide). Thus, systems ca- 


pable of performing matrix-vector or matrix-matrix multiplica- 
tions with binary entries may be suitable for use as a crossbar 
interconnection network. In this paper, several different opti- 
cal implementation of crossbar networks are presented and the 
tradeoffs in their performance are discussed. 


2. OPTICAL MATRIX-VECTOR SYSTEMS 


There are many techniques available for matrix-vector 
or matrix-matrix multiplication using optics. In some of these 
systems, the physical switch may be analog and passive (a 
switchable reflective or transmissive element); thus bit syn- 
chronism is not required and the data bandwidth is limited 
only by the optical sources and detectors used (currently avail- 
able sources and detectors can operate at > 1 Gb/s). In other 
systems, the matrix-vector multiplication is implemented by 
active electro-optical or acousto-optical components, implying 
the detection and regeneration of optical signals. In such sys- 
tems the switch itself may limit the data bandwidth. In the 
remainder of this section we will discuss some of the optical 
systems we have studied for crossbar implementations. 


2.1 N*-Parallel Matrix-Vector Inner Product Processor 


In this system, N input lines drive an array of N lght 
emitting diodes (LEDs) or laser diodes with a binary signal, so 
that a binary 1 is represented by light of a fixed intensity, and 
a binary 0 is represented by a lower (or zero) intensity. An 
optical system to the right of the input vector spreads the 
light from each input source into a vertical column that il 
luminates the crossbar mask [SAWC 85]. Following the 
crossbar mask, the next set of optics collects the light 
transmitted by each row of the mask, and sums the mask out- 
put onto a vertical array of N photodetectors corresponding 
to the N output lines. Thus the system performs a parallel 
matrix-vector multiplication. Since it is a passive interconnec- 
tion, once the mask is set, the data can flow through synchro- 


nously or asynchronously, can be analog or digital, and has a 


as being equal to the word length so that an M-bit wide line 
bandwidth limited only by the sources and detectors. 


can transfer one word at an instant of time. In the N2 
parallel engagement system, there are N physical input lines, 
and the switch operates on word slices: it is word serial and bit 
parallel except for a unit time shift from one physical line to 
the next. In other words, the first line has the first bit of each 
word, sequentially in time, etc. In the systolic case above, the 
physical lines are actually serial over diagonals of matrix B 
rather than strictly bit-serial or word-serial. The detectors 
here are 1-D or 2-D time integrating detector arrays. As with 
systolic systems, reconfiguration is rapid but updating of the 
state of the switches is needed even when the state does not 
change from one multiply to the next. 


2.2 Systolic and Engagement Architectures 


Systolic and engagement architectures have the advan- 
tage of providing for rapid reconfiguration of the network, Le., 
they can be reconfigured at the bit rate of the data (neglecting 
any overhead in calculating the needed states of the crossbar 
matrix elements). In these systems, though, the input data 
must be configured appropriately for the matrix multiplication 


to proceed; this will require electronics at the input that can 
buffer the signals and send them into the switch in the ap- 
propriate sequence and at appropriate times. In some cases, 
only minimal buffering is required, e.g. when the data is origi- 
nally time-division multiplexed bit by bit. More details on sys- 
tolic architectures can be found in [SAWC 85]. 


An engagement architecture for matrix multiplication 
is shown in Fig. 2.1. The matrix elements of A enter into a 
2-D source array or a multichannel AO cell, skewed and for- 
matted as shown. The data, or elements of B, enter a second, 
crossed multichannel AO cell from the top and propagate 
downwards. A 2-D stationary integrating detector array then 
accumulates the results and outputs them in parallel. Systolic 
architectures are similar to engagement architectures but use 
shift-and-add detectors instead of time-integrating detectors 
and data enters in a different way. 


oe 


=~ Optics—> 
Detector Array 


Fig. 2.2. N 2_parallel outer product processor. 
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Diy 2 2.3 N*-Parallel Outer Product Processor 
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An N%parallel outer product processor such as that 
bs shown in Fig. 2.2 performs C=—=AB by taking outer products 
ey of column z: of A and row? of B, and summing the results 

over 2. It can perform a matrix-matrix operation in N clock 

a cycles. The requirement of a 2-D detector array will limit the 
bandwidth to much lower values; removing the data electroni- 

cally from the detector array slows the system down. The 
data input is a row of B at a time, or bit parallel, word-serial. 
The detector is a 2-D array of stationary, time-integrating 
detectors. Since the A matrix is read in for each matrix mul- 
tiplication, reconfiguration is rapid, although the state of the 


Fig. 2.1. Engagement architecture for matrix multiplication. 


Compared to the inner product architecture above, sys- 
tolic architectures yield crossbars that are more light efficient, 
have shorter reconfiguration times, but have lower bandwidth. 
For an N-parallel system, with sufficiently fast input and out- 
put devices, the AO cell will generally limit the bandwidth of 
each individual line to v)/2N, where 4) ~ 1 Gb/s. A number 
of these 1-D elements in the N-parallel multiplier can be 
placed in parallel however, permitting each line to be M bits 
wide, thereby increasing the effective bandwidth by a factor of 
M (M =~ 100 with current devices). If M —WN, then the 
system is an N*-parallel systolic matrix-vector multiplier. Be- 
cause a new matrix is essentially read in for each new bit or 
word, the system can be reconfigured rapidly. 


An N-parallel engagement matrix multiplier requires 
N(38N-2) clock cycles to complete one matrix-matrix opera- 
tion or one arbitrary switching operation on N N-bit wide 
lines, approximately the same as the (N-parallel) systolic case. 
An N*-parallel engagement arrangement, such as the RUBIC 
(Rapid Unbiased Bipolar Incoherent Calculator) cube [BOCK 
84], takes (3N—-2) clock cycles. The input data formatting re- 
quirements are similar to the systolic case, but interleaved 0’s 
are not required. The N-parallel case uses time multiplexing 
again. We can think of the width of each functional input line 
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switches needs to be updated for each matrix multiply. 


2.4 N*-Parallel Inner Product AO Deflector 


While the above systems may be useful in many appli- 
cations, they all share a common fault. Since the matrix A is 
mostly 0’s, most of the scalar multiplications performed by the 
above systems are multiplies by 0. This translates into either 
wasted time (thereby lowering the bandwidth) or wasted light 
(lowered light efficiency). One of the new architectures we 
have devised eliminates these problems which is the inner pro- 
duct AO deflector and is very similar to the system described 
in Section 2.1 except for the deflector. It is also a passive sys- 
tem and utilizes a simple 1-D detector array. 


The input data to the crossbar enters through a 1-D 
array of sources. Each source is then transferred to the 
corresponding cell (or channel) of a multichannel AO device, 
used as a deflector. There is a separate channel for each data- 
input line, or each column of A. Rather than reading in the 
elements of A into the AO device (N elements into each 
channel), one single number is input into each channel, and 
the signal is frequency-modulated according to the amount of 
deflection desired for the corresponding input line. Broadcast- 
ing can be achieved, to a limited extent, by superimposing 


multiple frequencies onto the same channel of the AO cell, or 
by time multiplexing the different frequencies in the cell. Each 
channel of the AO cell in this system has just one signal on it; 
it is not divided into multiple (moving) resolution elements. 
Each channel then deflects the signal the appropriate amount 
in the y dimension, and the light is then focused down in the x 
dimension, yielding a one dimensional output array. 


2.5 N?-Parallel Inner Product/Engagement Processor 


Another new type of matrix-matrix processor architec- 
ture that could be used as an optical crossbar is shown in Fig. 
2.3. The system is a combination of an N*-parallel inner pro- 
duct processor and an N*-parallel engagement architecture. 
The 6’s to be multiplied in the system enter a parallel array of 
AO cells as shown, although the data for each row enters 
simultaneously instead of being staggered as in the N*-parallel 


multi-element 
g cell 


Fig. 2.3. N 2_parallel inner product/engagement processor. 


engagement processor. The input data also arrives in a time- 
staggered form and controls the illumination of LED’s or laser 
diodes as shown. A set of optics similar to that in the N% 
parallel inner product processor spreads the light across the 
multichannel Bragg cell, and the final result accumulates on a 
time-integrating detector array. The results emerge in a time- 
offset fashion from a row of the detector array as shown. This 
system requires (3-2) clock cycles to complete a matrix- 
matrix multiplication and requires a digital synchronous data 
format. Broadcast operation is possible with this system, and 
the overall light efficiency can be as high as 1/N. 


3. PERFORMANCE COMPARISON 


The optical systems introduced in the previous section 
differ widely among themselves in many characteristics which 
are of importance when used for interconnection. For exam- 
ple, some of the systems provide a bandwidth that is essential- 
ly independent of the size of the system, while for others, it is 
an inverse function of the system-size. In this section, we com- 
pare the different systems based on some parameters of in- 
terest in crossbar implementation. 


1. Number of Lines N: Optical systems can implement 
moderately large crossbars (100X100 to 500X500) with rela- 
tive ease. 

2. Bandwidth: For passive optical networks, the bandwidth of 
each line is limited by sources and detectors, which can easily 
operate at 1 Gb/s rates or higher. Sources (laser diodes) and 
detectors are available at 20 GHz and 40 ps, respectively. 1-D 
arrays are presently limited to lower values; current detector 
arrays with parallel outputs operate at approximately 50 
Mb/s. Higher bandwidths could be achieved by first coupling 
to fibers and routing the fibers to detectors; still higher 
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bandwidths might be possible by avoiding conversion to elec- 
tronics altogether. In contrast, electronic systems typically 
operate at 10 Mb/s for each line. 


3. Reconfiguration Time: Reconfiguration of optical networks 
is limited to approximately 1 ps for moderate or large net- 
works with current or near-future technology; in fact, most 2- 
D switch arrays have a frame time of 1 ms or greater. 
Acousto-optic devices have faster response times, but the elec- 
tronic input (to each cell) is serial and limited to a few GHz; 
and the signal cannot be changed as it propagates down the 
cell. Most of the acousto-optic architectures permit fast 
reconfiguration at the expense of bandwidth. 


4. Broadcast Capability: In electronic implementations, the 
provision of broadcasting involves higher control complexity, a 
larger number of pins, and/or slower operation. In some opti- 
cal systems, the addition of broadcast capability involves only 
very minor increases in complexity. 


5. Data Format: The switching process as well as the transfer 
of data can be performed either synchronously or asynchro- 
nously. Synchronous operation requires strobing or capture of 
data at precise instants of time in the system. 


6. Type of detector: There are three major types of optical 
detectors, namely, real-time, shift/add, and time-integrating 
detectors. The inner-product processors need a vector array of 
detectors with separate channels and a real-time output. The 
data-bandwidths of these systems is generally limited by the 
response characteristics of the detector array. In some of these 
systems (e.g. those with real-time detectors), the output may 
be directly coupled to a light guide or fiber; in this case the 
detectors, if any, are physically located some distance from the 
optical switching array. The engagement and outer-product 
systems with full N? parallelism require N XN _ time- 
integrating detectors which can be operated synchronously 
with the data. 2-D N XN detector arrays have either N out- 
put lines or 1 output line. Thus the necessity of multiplexing 
at least N detector signals onto each electronic output line 
creates a bottleneck and limits the bandwidth of the overall 
system. The systolic architectures require a shift/add detector 
array to perform the summation. N XN _ shift/add arrays 
necessarily have at most N output lines; this and the use of 
electronics to perform the shift and add limits the detector 
speed, in turn lowering the overall bandwidth of the system. 


Table 3.1 summarizes the above considerations for the 
six basic optical matrix-vector crossbar architectures given in 
Section 2. 


4. CONCLUSIONS AND FUTURE WORK 


In this paper, we described several possible optical sys- 
tems for implementation of crossbar networks and studied the 
tradeoffs involved. Advantages of these systems include large 
amount of inherent parallelism, high data-bandwidth, small 
size and power requirements, and freedom from mutual in- 
terference of signals. We have found that moderately large 
crossbars (6464 to 512512) may be feasible using current 
or near-future optical technology. It is feasible to build optical 
crossbars that have a higher bandwidth and more data lines 
than electronic systems, although their reconfiguration speeds 
are much slower. The main difficulties that need to be sur- 
mounted are the slow reconfiguration, means for efficient 
conversion of electronic signals to optical signals and vice- 


Table 3.1. Comparison of optical crossbar capabilities.. 


versa, and techniques for control. 


An intriguing and powerful possibility is to use these 
crossbars as building blocks to make larger networks. In order 
to make use of this important possibility, means of cascading 
these optical crossbars need to be devised. This involves 
studying the input and output formats of the data in the 
different optical systems for compatibility. The light efficiency 
of the crossbar must also be factored in, as it determines how 
often detection and regeneration of the optical signals is need- 
ed. Methods of physically coupling the output of one system 
to the input of the next must be devised, and additionally the 
signals may need to be sent to multiple crossbars in different 
locations. Of course, one way of doing this would be to use 
electrical lines to connect up the optical crossbars; but it is 
preferable to keep as much of the overall system optical as 
possible, in order to maximize the advantages that the optics 
provides, such as high bandwidth and large number of lines. 
All-optical amplifiers have been demonstrated [KOBA 84], 
[DAGE 86a]. While there are practical problems to be solved 
to build 1-D arrays of optical amplifiers, it can, in principle, be 
done [DAGE 86b]. Optical cascading of crossbars can be done 
with fibers or holograms. It has been shown [JENK 84] that 
holograms can be quite powerful for interconnecting optical 
gates. Even the capability of connecting a small number (4 or 
5) of sizable crossbars together could yield a system that is 
substantially more powerful than a single crossbar. 
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Abstract: Applicative systems are promising candidates for 
achieving high performance computing through aggregation of 
processors. This paper studies the fault recovery problems in 
a Class of applicative systems. The concept of functional 
checkpointing is proposed as the nucleus of a distributed 
recovery mechanism. This entails incrementally building a 
resilient structure as the evaluation of an applicative program 
proceeds. A simple rollback algorithm is suggested to 
regenerate the corrupted structure by redoing the most effective 
functional checkpoints. Another algorithm, which attempts to 
recover intermediate results, is also presented. The parent of a 
faulty task reproduces a functional twin of the failed task. The 
regenerated task inherits all offspring of the faulty task so that 
partial results can be salvaged. 


Keywords: fault tolerance, error recovery, distributed 
systems, applicative systems, data flow architecture, 
functional language. 


1. Introduction 


An important feature of a multiprocessor system, including 
applicative multiprocessing systems, is the ability to sustain 
partial system failures. By anapplicative system in this paper, 
we mean a partitioned-memory system such as Rediflow [8, 9, 
18] which coherently executes applicative, or functional, 
programs. 


The evaluation of an applicative program generates an implicit 
call tree. The result of the root task is the answer of the 
program. Every task in the call tree represents a partial result 
which is used by its parent task to compute other partial 
results. Because the semantics of applicative of languages has 
no notion of destructive modification, a parent task is capable 
of regenerating all of its child tasks based upon the argument 
and function information. 


Many fault-tolerance techniques for general multiprocessor 
systems have been proposed [1]. Some of these schemes can 
be adapted to applicative systems. However, applicative 
Systems possess some interesting characteristics, e.g., 
ae ae that merit distinct fault recovery considerations 


In this paper, fault tolerance issues in a class of applicative 
Systems are studied. We assume that any task can be 
executed by any processor and that tasks are dynamically 
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assigned to execution processors at run time. A single 
processor failure is also assumed. A processor is assumed to 
be either faulty or fault-free. A faulty processor must 
voluntarily declare itself faulty, or otherwise be identified as 
faulty by other processors. 


It is further assumed that if a processor fails, it will no longer 
transmit any valid messages. This assumption can be enforced 
by commanding a faulty node to keep silent and not to respond 
to any inquiry. Alternatively, a faulty node may answer an 
inquiry with an invalid message. Several techniques are 
available for a processor to determine node malfunctioning. 
Parity checking on the system bus or resident memory, illegal 
instruction trap, protection violation, or a subsystem 
breakdown may trigger the CPU reporting a processor failure. 
Duplication of processors within a node, called "passive node 
diagnosis" [12], is also a common technique for building self- 
checking nodes. 


It is assumed that a processor makes its best effort to 
communicate with a destination node. If the destination cannot 
be reached due to a network problem, the unreachable node is 
considered faulty. Problems with the interconnection network 
may be detected via coding or timeout mechanisms. 


Our approach exploits the determinacy property of applicative 
programs. A distributed checkpointing scheme, functional 
checkpointing, is proposed in the next section. As the 
evaluation of an applicative program proceeds, a distributed 
resilient evaluation structure is incrementally established across 
the network of processors. Any single processor breakdown 
is salvaged by the implicit redundant path of the robust 
structure. A simple rollback recovery algorithm, which 
basically discards all partial results, is discussed in section 3. 
In section 4, another recovery algorithm, splice recovery, is 
proposed to salvage as many intermediate results as possible. 
Tasks which are equivalent to those trapped inside the faulty 
processor are generated to replace the failed tasks. Partial 
results produced by the failed tasks are inherited by the 
recovery tasks. 


2. Functional Checkpoints 


Checkpointing is familiar in the fault-tolerant computing 
literature [1]. In a uniprocessor system, checkpointing is 
normally performed by storing machine state on nonvolatile 
devices periodically. Such a periodical checkpointing 
technique has been extended to multiprocessor systems [3, 5, 
7, 15]. The basic idea is to virtually stop all computational 
operations while periodic global checkpointing takes place. 


Periodic global checkpointing may not serve the best interests 
of fault tolerant applicative systems. For example, nonvolatile 
storage for storing system states may not be necessary, if 
recovery of a faulty processor is accomplished outside the 
node. Checkpoint information may be stored on one or more 
peer processors. Furthermore, periodic global 
Synchronization among a large number of processors is 
potentially inefficient [2]. 


We propose a distributed checkpointing strategy for applicative 
systems. The approach attempts to exploit the determinacy 
property of applicative programs. 


By a functional checkpoint, we mean a recovery point for a 
function application in an applicative system. A partial state of 
the system is stored so that recovery of the function is 
possible. The partial system state used in a functional 
checkpoint is related to a single function only. Normally, a 
functional checkpoint does not have enough information to 
recover an entire node, not to mention recovering a system. 
The sole purpose of the partial state is just to back up a 
function application. 


The idea of functional checkpointing is to disseminate the 
responsibilities of recovering a faulty node to processors 
which have immediate relationship with the faulty node. 
Complete recovery is done by collective efforts from various 
associated processors to retrieve the corrupted tasks. 


2.1 Determinacy 


Determinacy, or referential transparency, is the characteristic 
of applicative programs which makes them attractive for 
distributed execution. A program is called determinate if an 
identical answer always results from any function invocation 
for given arguments. In other words, a functional program is 
free from side effects. 


Determinacy suggests that an appropriate time for a functional 
checkpoint is when a parent task spawns a child function. A 
task packet is formed for the new function and then waits for 
execution. The packet contains al/ necessary information, 
either directly or indirectly accessible, to activate the child task. 
Furthermore, determinacy insures that different activations of 
the same task packet will always yield the same result. Thus, 
even if a task is aborted during computation, a new invocation 
will not be contaminated by its predecessors. 


2.2 Checkpoint Properties 


Periodic checkpointing is a synchronous operation whereas 
functional checkpointing is asynchronous. Each processor 
holds the privilege and responsibility of checkpointing its 
offspring tasks. A processor may opt to arrange the 
checkpoints in a partial order such that more efficient recovery 
can be implemented (section 3). Checkpoint coordination 
between processors is not necessary. 


Functional checkpointing can be implemented implicitly. As 
a child task is spawned to a new node, the parent task may 
retain a copy of the task packet. This retained copy is all that 
the parent needs to regenerate the child task, should the node 
evaluating the child task fail. Therefore, functional 
checkpointing can be fully embedded in the evaluation 
process. 
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3. Rollback Recovery 


Using functional checkpointing as a framework, a simple 
rollback recovery mechanism can be devised. An applicative 
call tree is mapped onto a set of processors. Each processor 
may have an arbitrary number of tasks. When a processor 
fails, the call tree may be broken into pieces. However, the 
piece that contains the root task is always capable of 
regenerating all severed pieces. 


Suppose that an applicative program has been spawned into 
the call tree as shown in Figure 1. For ease of discussion, 
tasks Ai (i = 1, 2) are mapped onto processor A, tasks Bi are 
executed in processor B, etc. Suppose that processor B fails. 
Then tasks Bi are destroyed. The call tree is thus fragmented 
into three pieces: {A1,C1,C2,C3,D3}, {A2,D1,D2,C4}, and 
{D4,D5,A5}. 


B 


D 


Figure 1: A call tree mapped onto processors A,B,C, and D, 
and corresponding distribution of checkpoints 


Assuming the check point of an application is kept on the 
processor of its parent. Processor A contains the functional 
checkpoint for B1, processor C contains checkpoints for B2, 
B3 and B5, and processor D contains checkpoints for B7. To 
recover from the failure of B, the system needs to command 
processor A to respawn B1, and command processor C to 
regenerate B2 and B3. Task B2 will in turn generate new 
tasks which are equivalent to D4 and A2. Since an applicative 
program has no side effects, it does not require any undo 
operation, and hence there is no domino effect [13]. 


Note that task C4 holds the checkpointing data for BS. 
Processor C may regenerate B5 when B fails. However, the 
recovery of B5 is not fruitful because antecedent task A2 
cannot report its result to B2. Reactivation of B5 only 
increases the system overhead. Therefore, an efficient way to 
salvage a group of genealogical dependents is to redo only the 
most ancient ancestor and ignore the rest. 


3.1 Level Stamps 


Genealogical dependencies among tasks can be monitored by a 
simple level numbering scheme. Assume that the root task 
carries a null level number, a task at level one will bear a 
unique one digit identification. Tasks in subsequent levels are 
stamped by appending one more digit to the number of their 
parents. The term "digit" is used here generically and is not 
limited to a specific radix representation. 


Since each task is associated with a unique level stamp, it is 
obvious that ancestor-descendant relationships can be 
observed by comparing stamps. Note that a level stamp is not 
a time stamp. Its uniqueness is guaranteed by the program 
structure. Stamping of tasks can be fully asynchronous. 


3.2 Recovery Scheme 


Each processor maintains a table of linked lists. The Nth entry 
of the table contains all topmost checkpoints from the host 
processor to processor N. Referring to Flgure 1, for example, 
when processor C spawns task B2 to processor B, C 
compares the level stamp of B2 with all checkpoints in entry 
B. If B2 is a descendant of an existing functional checkpoint, 
C does nothing. Otherwise, processor C makes a checkpoint 
for B2 in entry B. 


When processor C identifies the failure of processor B, C 
simply reissues all the checkpointed tasks found in entry B of 
the table. By doing so, processor C fulfills its responsibility 
of recovering B. Other processors take similar actions to 
recover their descendant tasks being trapped in B. The 
complete recovery of a faulty processor is a collective effort 
from processors which have checkpointed applications on the 
failed processor. 


During task evaluations, a processor is required to abort a task 
if new arguments of the task cannot be obtained due to failures 
of other processors. A task is also aborted if the result of the 
task cannot be forwarded to the parent task. The aborted tasks 
and their descendants may be recollected during garbage 
collection operations. 


3.3. Dynamic Allocation and Recovery 


The possibility of discarding intermediate results without 
extensive undo operations is a property of applicative 
programs. However, the ability to recover by simply 
- reissuing checkpointed tasks depends on the availability of a 
dynamic allocation strategy, such as the gradient model 
approach [10]. 


Recovering tasks in a static allocation environment requires 
manipulations of some linkage information. For example, 
tasks being allocated to a failed processor have to be 
reassigned to other processors. Descendants of the reassigned 
task have to modify their return addresses accordingly. 
Furthermore, the balanced state derived from the static 
allocation method may not be maintained easily after a 
processor fails. 
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Dynamic allocation does not distinguish between tasks 
generated for recovery and original tasks. All tasks are treated 
equally during load-balancing activities. The parent-child 
linking information is dynamically produced. Hence, there is 
no need to update these linkages when the task is reassigned. 


3.4 Orphan Tasks 


Rollback recovery inevitably leaves a few orphan tasks after 
some recovery has taken place, e.g., task D4 in Figure 1 
becomes an orphan when processor B fails. The problem is 
that a task might not know whether it is an orphan without 
expenditure of a considerable amount of system resources. 


Returns from orphan tasks are theoretically harmless since 
they are forwarded to a faulty processor and no side-effect can 
be induced. However, the partial results produced by orphan 
tasks are in fact correct answers of their associated functions. 
Failure of a node does not contaminate these incomplete 
answers; it just breaks the linkage among them. These partial 
results are usable if the regenerated parent task knows where 
to retrieve them, or if the orphan tasks know the new address 
to which to forward their answers. The desire to salvage 
partial results motivates the design of the following recovery 
scheme. 

4. Splice Recovery 

Applicative systems facilitate evaluation of a functional 
program by dynamically unfolding the underlying structure of 
the algorithm and disseminating parallel tasks to many 
processing nodes. At any instant, task distribution in a system 
represents a snapshot of the program structure. Generation of 
a task creates a new substructure and establishes a linkage 
between the parent and children. Return packets from a child 
task normally eliminate the children that are no longer needed. 


The simple rollback scheme cuts off the branch or branches 
originating from a faulty node and regrows new branches. 
The method basically abandons all intermediate results 
computed by the orphan tasks. This section suggests a 
different approach, splice recovery , which attempts to retrieve 
all possible intermediate results. 


4.1 Resilient Evaluation Structure 


The splice approach is to continuously establish a resilient 
evaluation structure during program computations. A resilient 
structure is one containing redundant information which 
allows a system to rebuild the original structure after a failure 
has been identified. By rebuilding the structure, the system 
may salvage many partial results. 


We have seen that when a processor fails, an applicative call 
tree may break into several pieces. The idea of splice recovery 
is to provide necessary bridging information such that broken 
pieces can be put together again. When a parent discovers the 
failure of a child task, the parent task generates a twin task of 
the faulty child. This twin task inherits all offspring of the 
faulty task with the help of the grandparent pointer. 


A grandparent pointer of a task is a pointer from the task to its 
ancestor in the grandparent processor. For example, the 
grandparent pointer of task B3 in Figure 1 points to task Al, 
and task D4 has a grandparent pointer to C1 (Figure 2). 


Figure 2: Grandparent pointers 


Assuming as before that processor B fails, processor C may 
start recouping the loss of B2 as soon as C realizes that node B 
is dead. A twin task of B2, say B2’, is created by the parent 
C1 to inherit tasks D4 and A2 (Figure 3). A full emulation of 
task B2 would require task B2' to possess physical binding 
information between B2 and D4, and between B2 and A2. 
Unfortunately, this information must be embedded not only 
inside the faulty node, but also within every descendant 
processor. Changing the return addresses of every descendant 
task at various sites could be very tedious. 


Figure 3: Task B2 is inherited by task B2' 


Instead of fully emulating a faulty task, we opt to make B2' 
inherit descendant tasks of B2. Suppose that when D4 tries to 
return the evaluated answer to parent B2, it detects that node B 
is dead. The algorithm commands D4 to forward the result to 
grandparent C1. Processor C receives these unexpected partial 
answers from grandchildren and asserts that the parent of these 
grandchildren is faulty. Then, processor C forms the recovery 
task B2' by duplicating the task packet of B2. 
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If processor C has already reproduced B2' when the return 
from D4 arrives, task A simply forwards what it has received 
to step-child B2'. The role of a grandparent node in this 
recovery scheme is two-fold: it reproduces the dead task and it 
transports the orphan results to their step-parent when these 
returns become available. Having the grandparent relay partial 
results eliminates the problem of updating return addresses in 
every orphan task. 


A recovered task, like any other, starts executing its function 
code as soon as it is committed to a physical processor. When 
it encounters a function call, it forms a task packet and spawns 
the child. However, offspring of a recovered task may or may 
not have been demanded by the preceding faulty task. Let P 
represent a faulty task, and C a child task of P. Let P’ be the 
recovery task for P. C' is generated by P' and is the 
equivalent of C, as suggested in Figure 4. 


Figure 4: Tasks in splice recovery model 


The relationship between child task C and its clone C’ has the 
following possibilities: (Figure 5) 


(1) C has never been invoked; 

(2) C will never complete; 

(3) C completes before P dies; 

(4) C completes after P dies, but before P' is invoked; 

(5) C completes after P’ is invoked, but before C’ is invoked; 
(6) C completes after C’ is invoked; 

(7) C completes after C' has completed; 

(8) C completes after P’ has completed. 


7 completed 
n completed 

Ic invoked 

Pp invoked 

: fails 


Figure 5: All possible orderings with respect to completion of C 


Case 1 or 2: C has never been invoked or C will never 
complete. In either case, no result of C is produced. Task C 
is practically nonexistent and will be garbage collected. Only 
C' may produce an answer. 


Case 3: C completes before P dies. Task C may have already 
finished the computation and returned the answer back to 
parent P before P breaks down. The result of task C is stored 
inside the parent P. When P fails, the system loses all partial 
results which have been saved in P. The recovery task P' 
must recalculate C by activating task C’. 


Case 4 and 5: an old result comes before the new 
invocation. Task C finishes computation after the parent P 
dies. C sends its result to the grandparent task which transfers 
the result to the step-parent P’. 


The difference between cases 4 and 5 is that in case 4, the 
grandparent has to reproduce P’ first. When child task C’ is 
executed by task P’, P' will not spawn C’ because the answer 
is already there. There is still only one result C’' in the system. 


Case 6: C completes after C' is invoked. Suppose that P' 
has already spawned C’' when the result from C arrives at P’. 
Theoretically, the result from C and the would-be answer from 
C’ are identical. Therefore, parent P' takes the answer from C 
and proceeds with the execution. The addition of C' may 
produce a duplicate answer to P’. Since they are identical, the 
second copy is simply ignored. 


Case 7: C completes after C' has completed. This is the 
reciprocal situation of case 6. Note that due to the asynchrony 
nature of task evaluations, late invocation of an identical task 
may yield a result faster than the earlier invocation. 


Case 8: old result arrives after everything is completed. The 
processor which contained P’ may no longer recognize the 
arrived answer. The result is discarded. 


4.2 Protocol for Splice Recovery The main idea of the 
splice recovery method is to build a resilient structure along 
with a program evaluation. The redundant information must 
be in place long before a recovery is initiated. This section 
describes the usage of these redundancies from the view of a 
single processor. 


LOOP 
CASE received packet OF 
forward result: 
Interpret the level stamp. 
CASE level stamp OF 
child: Place data at the location indicated by the 
level stamp. If a task can be continued, 
resume the task. 
grandchild: Create a step-parent for the grandchild 
if there isn't one already. 
Transfer the result to its step-parent. 
others: Ignore the packet 
ENDCASE 


fetch data: If the location has been evaluated, forward the 
data. Otherwise, DEMAND IT. 
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task packet: Execute the task. DO each instruction 
If an unevaluated function encountered, 
DEMAND IT. 
If cannot proceed, suspend the task. 
UNTIL completion. Send the result to the parent. 
If the parent is dead, 
notify the grandparent and 
send the result to the grandparent. 


error-detection: Find the topmost offspring of all 
branches, respawn all of these apply tasks. 
Establish transport mechanism for relaying 
partial results. 
ENDCASE 
ENDLOOP. 


The routine DEMAND _IT is the fundamental evaluating 


process of an applicative program. We elaborate the algorithm 
in the following form. 


DEMAND IT: 
Create a task packet. 
Level-stamp the task packet 
Attach parent and grandparent identifications 
to the task. | 
Queue the task packet to load balancing manager. 
Functional checkpoint the packet. 
End DEMAND _IT. 


As a tule of thumb, if a processor receives a packet and cannot 
find a proper rule to handle it, the processor simply ignores the 
received message. Note that the overhead of splice recovery 
protocol is small. By using the level stamps as tags for a 
program structure, the apparent overhead for establishing a 
resilient structure is a physical identification of grandparent 
node which may be just an integer. 


4.3 Correctness of the Recovery Scheme 


The successful evaluation of an applicative program is signaled 
by the completion of the root task. A necessary condition for 
completing the root evaluation is to satisfactorily compute all 
immediate descendants of the root. This observation, when 
applied recursively, implies that all tasks must be evaluated 
correctly. 


In order to guarantee a task can be evaluated, the task has to be 
generated in the first place. This means that every task is 
flawlessly reproducible even if some processor may fail during 
the evaluation. Reproducibility of tasks is the main criterion 
for a resilient applicative system. 


4.3.1 Reproducibility 


If the failed processor contains the root of a task tree, the 
regeneration of the root does not come naturally with recovery 
schemes. The user must restart the program, or a 
preevaluation functional checkpoint needs to be implemented. 


One simple method to generate a preevaluation checkpoint is to 
create a super-root which acts as the parent processor of all 
user programs. When a user program is initiated, the super- 
root checkpoints the program so that a duplicate copy of the 
program can be found in the system should the root fail. 


With this modification, every task in an applicative program 
has a parent. A parent task is capable of generating and 
regenerating any immediate child task as long as the parent is 
informed by some error detecting mechanism. This satisfies 
the reproducibility requirement of a correct recovery al gorithm. 


4.3.2 Residue Effects 


Without loss of generality, evaluation of an applicative tree 
can be typified by scrutinizing the spawning process of a 
three-task sequence. Figure 6 shows the state transition 
diagram of spawning and reduction of task G. Task G 
spawns task P which subsequently spawns C. Note that states 
b and d are transient. The existence of transient states is a 
result of the dynamic load balancing method. 


G unevaluated 


G evaluated 


(a) | 
G spawns P 
(b) : 
(c) a 
(d) Ef 
(e) : 
(f) : 
(g) | : : 


Figure 6: State transition diagram for evaluating G 


The pointers being produced and reduced among tasks of each 
state are depicted in Figure 7. The pointer from P to its 
grandparent and the pointers to P from its grandchildren are 
omitted for clarity. Assume that task P fails during the 
evaluation of G, residue effects may affect any one of the 
related tasks at any stage of the state transition. A residue-free 
fault tolerant measure must assure that tasks G and C are not 
affected by the failure of P from state a through state g. 


The failure of P obviously has no effect in state a. In state b, 
failure of the processor which absorbs P means that parent 
task G will not receive a positive acknowledge from task P. 
As a result, processor G times out and reissues a new task P. 
ii System acts as if the first invocation of P did not take 
place. 


In state c, task G receives an acknowledge from P and 
establishes a parent-to-child pointer to P. The new pointer 
may provide additional fault detection capability, but the 
impact of the failure remains similar to that of state b. 


Residual effects, resulting from the failure of P, may happen at 
state d and successive states. Parent task G is left in the same 
- Situation as if it were in state b orc. However, there is a child 
task C lingering around the system. Task C may be stranded 
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due to incomplete information from parent P. In this case, C 
commits suicide and the recovery from G is free from residual 
effects, or task C may complete the evaluation and try to return 
the result. C sends the result to G after failing to communicate 
with parent P. The case analysis in section 4.1 applies here. 


State e has exactly the same recovery condition as state d, 
since the transient state d becomes state e as soon as task C 
finds an idle processor. The discussions on state d can be 
applied here and will not be repeated. State f is similar to state 
c as far as recovery is concerned. 


(a) 6 [a] 

' 

(c) 
Ue ele 
O LGhe ole leer 
o 

(g) [e] 


Figure 7; Available pointers among tasks 


5. Discussion and Related Research 
5.1 Robust Storage Structures 


Using a resilient structure for fault-tolerant computing is not a 
new idea. Waldbaum [19], and Taylor, et al. [16, 17] 
proposed a robust storage structure to ensure data integrity in 
uniprocessor or shared memory machines. Item count, 
identifier field, and/or additional pointers are commonly added 
for error detection and recovery purposes. This paper extends 
the concept of resilient structure to a distributed applicative 
evaluation graph. 


Conceptually, an evaluation structure is quite different from a 
storage structure. A storage structure is an object manipulated 
by programs, while an evaluation structure is a program itself. 
Furthermore, most techniques developed for resilient storage 
structure seem to be impractical in distributed systems. For 
example, item count of a linear list is a convenient way for 
checking broken links in a shared memory machine. To 
maintain a correct item count and verify it regularly in a 
network of processors would require significant traversing 
overhead. 


5.2 Multiple Faults 


Both the rollback and splicing recoveries use functional 
checkpoints to tolerate hardware failures. Although single 
node failure is assumed throughout the discussion, it is 
obvious that rollback recovery is not limited to tolerate only 


single node failure. The difference between multiple faults and 
single fault in the rollback algorithm is the placement of the 
recovery border in the evaluation graph. 


Splicing recovery can handle some combinations of multiple 
faults gracefully. For example, multiple failures on different 
branches of a structure do not disturb the recovery algorithm at 
all. Separate recoveries take place at different parts of the 
program in parallel. However, if both the parent and 
grandparent processors of a task fail simultaneously, the 
orphan task would be stranded. It is noted that the resilient 
structure concept can be further extended to include pointers 
to the great grandparent and beyond to tolerate multiple failures 
on one branch of the graph. 


5.3 Hardware Redundancy 


In a hardware redundant fault tolerant system, several 
redundant machines execute an identical program on replicated 
data objects. An applicative system can emulate hardware 
redundancy by simply replicating the task packets. 
Eventually, a task is executed by several processors at random 
times. The results are sent back to the originating node 
asynchronously. The originating node compares these results 
and selects a majority consensus as the correct answer. 


A fundamental difference between applicative replicated task 
redundancy and pure hardware redundancy is that applicative 
systems execute redundant tasks asynchronously, while most 
hardware redundant systems employ synchronous operation. 
Asynchronous operations are subject to timing delays because 
a node has to wait for the return from slower processors. But 
a node does not have to wait for the slowest answer if it has 
received the identical results from the majority of replicated 
tasks. Replicating tasks provides a means of emulating 
hardware redundancy in applicative systems. The user may 
specify certain critical sections of a program for such a highly 
reliable operation. 


5.4 Related Research 


Fault tolerant problems in data-driven systems have been 
studied [6, 7, 11, 14]. Misunas proposed a triple modular 
redundancy implementation of a dataflow machine [4, 11]. 
Three complete copies of the program are stored in the 
memory. Copies of each instruction are carefully distributed 
so that each copy is executed by a different processor and 
utilizes different communication paths. Thus, the failure of 
any single block affects at most one copy of the program. 


Hughes [7] described a variation of periodic checkpointing, 
where a host processor periodically stored the whole system 
state. Also discussed was a recovery technique, node-by-node 
correction, which used a control unit of the system as a 
monitoring device. Erroneous packets were recomputed and 
re-sent. 


Srini [14] suggested a node reassignment algorithm for error 
recovery purposes. The algorithm depends on a global system 
memory for collecting and communicating recovery messages. 
The checkpointed node state is stored in the global memory. 


Grit [6] proposed a structural recovery method where each 
node in the system is limited to spawning child tasks to its 
immediate neighbors. At system initialization time, a node 
receives a list of recovery sites for each of its immediate 
neighbors. When a node fails, a neighbor notifies the 
recovery site. The recovery node polls all possible parent and 
ae oo of the failed processor and tries to reconstruct the 
ost task. 
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6. Summary 


This paper discusses the reliability aspect of applicative 
multiprocessor systems and suggests means for fail-soft 
treatment. The concept of functional checkpointing is 
proposed. Unlike conventional checkpoint schemes, 
functional checkpointing is concise, distributed and 
asynchronous. Two fault recovery techniques based on the 
notion of functional checkpointing are proposed. The thrust of 
these recovery models is to minimize the overhead while the 
system is in a normal, fault-free operation. 


The simple rollback recovery method attempts to reconstruct 
the faulty section of the program structure by redoing the 
functions from the most efficient parent task or tasks. In other 
words, the recovery starts from the most recent functional 
checkpoints. The scheme is simple and has very little 
overhead in a normal operation. But, if a fault happens at a 
later stage of the evaluation, the rollback recovery may be 
costly. 


The splice recovery scheme also uses the most recent 
functional checkpoints for error recovery as in the rollback 
method. In addition, the splicing scheme tries to salvage as 
much intermediate partial results as possible. The salvage is 
made possible by a backward grandparent linkage along with a 
program graph stamping mechanism. This approach enables 
the parent tasks of a faulty processor to regenerate the 
corrupted substructure of the program and splice the recovery 
results into the framework preceding the failure. 
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ABSTRACT 
Parallel machines in which the Processing 
Units (PUs) are connected with a nearest neigh- 
bor mesh connection structure are ideal for solv- 
ing partial differential equations. Here, a 
dynamical fault recovery scheme with such a 
structure is proposed. 


In the case of the detection of a faulty PU 
within the execution process, connections of 
PUs are changed; the model is re-discretized by 
using the calculus of finite differences; the exe- 
cution is continued from the faulty point. 


This scheme was implemented on PAX-32, 
a parallel computer consisting of 32 microcom- 
puters, and 2-dimensional Poisson equation was 
solved. As a result of simulation, the effective- 
ness of this scheme is demonstrated. 


1. INTRODUCTION 


Parallel computers with the nearest neighbor mesh 
(NNM) connection are ideal for solving partial differen- 
tial equations (PDEs) by using iterative methods. For 
instance, PAX[1,2] or NASA’s_ Finite Element 
Machine[3,4] has been proposed. 


Through progress in VLSI technology, highly parallel 
computers with a great number (e.g., thousands or mil- 
lions) of microprocessors can be developed. In these sys- 
tems, the realization of fault tolerant computing is 
indispensable, and some inter-PU connection networks 
have been proposed. Most of these are a sort of multi- 
stage network (e.g... OMEGA[S], DELTA[6]), or have 
some redundancies (e.g., Chordal Ring[7], Lens[S5]). 


There are only a limited number of references on the 
dynamic fault recovery within the execution processes. 


When solving PDEs with an NNM_ structured 
machine, the processes may be divided broadly into two 
categories as follows: 


1. Discretization Process: A PDE is converted with 
one of discretizing methods. 


2. Iteration Process: The converted equations are 
solved by using some iterative methods. 
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These processes require a great number of computations; 
this type of recovery method is considerably important. 


In this paper, we discuss the dynamic fault recovery 
within only the latter process, where it takes much more 
time than the former essentially. In the case of discretiz- 
ing 2-dimensional Poisson equation by using the calculus 
of finite differences, the fault recovery algorithm and 
simulation results with PAX-32 are also proposed. 


2. Fault Recovery in NNM Structured Machine 


When trying to perform the fault recovery dynami- 
cally in an NNM< sstructured machine, such actions as 
below may be taken: 


1. Calculations for a faulty PU are executed by other 
fault-free PUs (e.g., HOBONET[8]). If these 
fault-free PUs are the surrounding PUs, a load bal- 
ance for each PU may be unequal. On the other 
hand, if one of the redundant PUs takes over the 
executions for the faulty PU, irregular data- 
transfers may occur, or the fault recovery algo- 
rithms may be more difficult. These schemes are 
inefficient. 


2. <A faulty PU is logically disconnected, and 
processes belonging to this PU cease from execu- 
tions. This scheme causes worm-eaten results, but 
has the advantage of easy recovering. In the case 
of numerical methods, solutions are obtained only 
on the mesh points. This means that a loss of data 
on a few mesh points can be negligible, if the 
accuracy loss is small. 


Faulty PU 


Fig. 2.1 Fig. 2.2 Links for 


Bypass Recovery 


Bypass Recovery 


The scheme in question for us is the latter one. In 
our scheme, damages caused from the faulty PUs can be 
reduced to the minimum, and executions of fault-free PUs 
continue. As shown in Fig. 2.1, with the detection of a 
faulty PU, interconnections of the surrounding PUs are 
changed and the faulty PU is disconnected. If there have 
been redundant links in the NNM in advance as shown in 
Fig. 2.2, this type of fault recovery can be executed 
easily. We call this a Bypass Recovery. 


3. Discretization on irregular intervals 


We first use the calculus of finite differences as a 
discretizing method, because, the bypass recovery can 
then be applied efficiently. 


Suppose that any PU is assigned to one node of a 
PDE model respectively. If the bypass recovery men- 
tioned above is performed, the model is changed around 
the faulty PU (Fig. 3.1). So in the process of the fault 
recovery, re-discretization of the model should be per- 
formed by using the finite difference method on irregular 
intervals. In the model of Fig. 3.1, the discrete equation 
on x-direction is 


2u 


4 - 3u 


tay ae 
i,j 


U, e 
i+2,j 
+ o(h) 


3h" 


where fh is the interval between the two neighboring 
nodes. This way of discretization can also apply to the 
y-direction’s or a model with various physical constants; 
as well as, to sum up, all differential terms. The order 
of error changes from o(h ) to o(h) around the faulty PU 
by re-discretization. 
obtained will only be slightly affected quantitatively. 


However, the accuracy of results 


4. Method of Assigning PU to Nodes of Model 


A PU assigning algorithm is needed, where the 
number of nodes of a model exceeds the number of PUs. 
Generally, an algorithm is taken as shown in Fig. 4.1(a). 
In this method, however, some continual nodes belonging 


to a faulty PU get lost. On the other hand, in the modu- 
lar mapping as shown in Fig. 4.1(b), the number of lost 
nodes by one faulty PU is as the same as those in Fig. 
4.1(a). In this method, it takes more communicating time 
since all the nodes on a PU must communicate with all 
the nodes in the 4 adjacent PUs. The modular mapping, 
however, has the advantage that lost nodes do not con- 
tinually exist but scattered. Thus damages on the results 
of calculations by a faulty PU can be kept lower in this 
method. In conclusion, this mapping method is adopted 
for a PU assigning algorithm. 
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5. Implementation of Poisson Equation on PAX-32 


The fault recovery algorithms mentioned above are 
implemented on the PAX-32 computer. PAX-32 is a 
parallel machine, consisting of 32 microprocessors com- 
bined into a nearest neighbor connected torus array, as 
shown in Fig. 5.1. As for a PU faulty, one PU out of 32 
PUs is selected randomly within the iteration process. 
This machine does not have redundant links mentioned in 
chapter 2; corresponding data-transfers are supported with 
software routines. The target model is a 2-dimensional 
Poisson equation, . 


Au —1.0 


A region of the model, boundary conditions, and physical 
constants are shown in Fig. 5.2. 


i+2,j 


Fig. 3.1 Finite Difference Method 
on Irregular Intervals 
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Fig. 5.1 PAX-32 System 
Configuration 


Fig. 5.3 illustrates the calculation time versus the 
number of nodes per PU. This figure shows that the cal- 
culation time is proportional to about the square of the 
number of nodes. The calculation time depends upon the 
iteration count and the number of nodes per PU. 
Undoubtedly, this time is directly proportional to the 
number of nodes. So, if the iteration count is propor- 
tional to the number of nodes, this rate is in agreement 
with those of theoretical analysis. 


Fig. 5.4 illustrates the iteration count versus the total 
number of nodes. This count K is proportional to the 
number of nodes. The K can be calculated from the 
theoretical study, and is expressed as follows: 


4+ log, (1—-— A) 


log)» 


1. Region and Boundary Conditions 
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where A is the maximum eigenvalue of the coefficient 
matrix of the model. Some of the theoretical K is illus- 
trated in Fig. 5.5. In this figure, the slight difference 
between the theoretical K and the actual K is due to the 
approximation of the theoretical equation. As a result, 
both the calculation time and the iteration count agree 


with those of theoretical analysis. 


TABLE 5.1 shows such error rates as defined below: 
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3. Number of Nodes within a Region 
1. 32 ( 8* 4) 
2. 128 (16* 8) 
3. 800 (40*20) 
4. 2304 (48*48) | 
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Note that lost results belonging to a faulty PU are left 
out of account. This is a result of simulations with 10 
kinds of initial numbers of random number. This table 
shows that effects of a faulty PU are almost negligible. 


6. Concluding Remarks 


In this paper, we proposed a new scheme called 
Bypass Recovery for disconnecting faulty PUs, which is 
useful with an NNM connection structure. This is a 
dynamic fault recovery scheme, and efficient recovering 
can be performed by using the calculus of finite differ- 
ences on irregular intervals as discretization. 


Simulating this scheme on PAX-32, it appears that 
the proposed scheme realizes high efficiency and reliabil- 
ity. 

It is desirable that this scheme can apply to the 
FEM. Basically, re-creation of the coefficient matrix of 
the model at the faulty point can be performed; the appli- 
cation will be possible with a small modification. 
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ABSTRACT 


The Traveling Salesman Problem (TSP) is computationally expen- 
sive to evaluate. It can, however, be readily decomposed into subproblems 
that can be computed in parallel. Developing a distributed program taking 
advantage of such a decomposition, however, remains a difficult problem. 


We developed such a distributed program to compute the TSP solu- 


tions, using a new set of distributed program performance tools to better 


understand our TSP program. These tools allowed us to discover the per- 
formance bottlenecks in our program and to revise the program to 
significantly improve its execution speed. This paper is a case study of the 
use of these performance tools. 


1. Introduction 


When a programmer implements a program that uses parallel com- 
putation, the goal is typically to increase the speed of the computation. 
Unfortunately, this goal is difficult to obtain. One reason for this is that 
communication between processes can be slow. This cost becomes greater 
when processes on different machines communicate. To obtain the goal of 
increased speed, the programmer must develop algorithms that minimize 
the effect of this communication cost. 


It is essential for the programmer to be able to measure the perfor- 
mance of his or her program. The programmer needs to debug the pro- 
gram, and to know where the program is spending its time so that it can be 
modified or redesigned to increase its execution speed. The problems of 
debugging and performance monitoring become more difficult in a distri- 
buted environment due to a general lack of tools to aid the programmer. 
Tools for measuring distributed program performance have been imple- 
mented for Berkeley UNIX [4]. These tools allowed the development of 
the distributed program described here. 


To examine the process of program development, we chose a prob- 
lem that lends itself to this examination, the Traveling Salesman Problem. 
This problem is computationally intensive, but has well known solutions. 


We developed a program to find solutions to the problem without using 


parallel computation and, using that program as a basis, developed pro- 
grams that would solve the problem using parallel processing. We 
debugged and measured our programs’ execution using tools designed to 
passively measure parallel performance [5]. Using the information gained 
from these measurements, we improved our programs and measured how 
much these improvements enhanced performance. The TSP program 
development is a case study of the use of these measurement tools and of 
the program development process. 


2. The Traveling Salesman Problem 


2.1. Description of the Problem 


The Traveling Salesman Problem (TSP) is a popular problem among 
both operations research experts and computer scientists. In general, the 
problem can be stated as follows: given N cities and the costs associated 
with going from each city to each of the other cities, find a minimum cost 
circuit that visits each of the N cities once and only once. Formally we 
have a problem graph G = (V, E) where V is the set of vertices (cities) and 
E is the set of edges, each representing the cost of going from each city to 
another. In a problem of size N, we have N vertices and m = N(N - 1) 
edges. The problem can also be represented as an N x N matrix, with each 
element C; j being the cost of going from city i to city 7. This matrix is 

called the representative matrix of the problem. 


We address the problem in its most general form. In particular, we 
address the case of the asymmetric non-euclidean TSP. The problem is 
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asymmetric because, in general, cj; does not equal c ji- The problem is 
non-euclidean because the costs are arbitrary nonnegative numbers, uncon- 
strained by the triangle inequality. There are many possible algorithms 
that can be used to efficiently solve the TSP. We have decided to follow 
the algorithm presented in [8]. 


2.2. General Algorithm 


We use a branch-and-bound algorithm to solve the TSP. This algo- 
rithm is based upon the development of a state tree in which we record the 
paths that the TSP program has examined and the costs associated with 
those paths. Each node in the state tree corresponds to a collection of 
edges in the problem graph that make up a partial path through the prob- 
lem. Each node of the state tree also contains a lower bound cost. The 
lower bound cost is the least cost that can be obtained by including that 
partial path in a full circuit. It is calculated by taking the representative 
matrix, and for each edge c;; setting all elements in row 1 and column / of 
the matrix to 0, and then reducing the matrix. The constants used to reduce 
the matrix are then added to the summation of the cost of each edge in the 
partial path; the result is the lower bound cost. 


By examing the lower bound costs associated with the nodes of the 
State tree, the program can intelligently choose the best path to examine. 
This allows us a potentially large reduction in computation time with 
respect to that needed by solutions which compute the cost of every possi- 
ble circuit through the problem. A decision to examine a particular path is 
called a branching decision, and it is based upon an examination of the 
lower bounds of all of the leaf nodes of the state tree [1]. Every time a 
branching decision is made, new nodes are created in the state tree that 
correspond to the possible extensions of the selected partial path. 


Figure 2.1 shows an example of how a TSP problem is translated 
from a problem graph to a representative matrix to a state tree. In this 
example we have a small (three vertices, six edges) problem. The first 
representation is the problem graph, which shows the vertices and the 
directed edges of the problem. The representative matrix shows the cost of 
going from each of the three vertices to each of the other reachable nodes. 
Finally, the figure shows the development of the state tree. Vertex A of 
the problem graph is arbitrarily selected as the root node of the state tree, 
and, by reducing the representative matrix, we find that the problem’s 
lower bound is 17. We next set up the state tree nodes that correspond to 
the possible edges that include A , namely, AB and AC. After setting row 
A and column B in the representative matrix to 0, we reduce the matrix 
with constants that sum to 10. Adding these constants to the cost of edge 
AB , which is 7, we find that the lower bound for a solution containing the 
edge AB is 17. Similarly we find that the lower bound for AC is 22. 
Examining all of the leaf nodes of the state tree, we decide to pursue the 
possibilities of a path containing AB , and set up nodes in the state tree for 
all of the problem graph edges that may be used after including AB ina 
solution. There is only one such problem graph edge, BC, and a state tree 
node is set up for this edge, and its lower bound is computed to be 17. 
Once again, we examine the lower bound costs of the leaf nodes of the 
state tree, BC’ and AC’, and choose BC because it has the lowest lower 
bound. Only one more edge exists beyond BC : CA , which completes the 
circuit and solves the problem with a cost of 17, which is lower than any 
other possible circuit through the problem. 


3. Computational Environment 


The program was developed on a VAX 11/750 with 2 megabytes of 
memory. Occasionally, a other VAX’ss of the same configuration were 
used for data collection. These VAX computers are connected by a 3 
megabit Ethernet. The operating system running on these VAXs is Berke- 
ley UNIX 4.2BSD-. This version of Berkeley UNIX supports virtual 
memory and processes with address spaces up to 16 megabytes. Most 


importantly, it supports interprocess communication [3]. We used the 


interprocess communication facilities built into 4.2BSD UNIX. The pri- 
mary object of communication under this facility is the socket. A socket is 
an endpoint of communication. These sockets can be connected to form 
bidirectional communication streams. 


4. Measurement Tools 


The DPM measurement tools were used for this performance study. 
These tools are described in [4] and [5]. The raw data produced by the 
measurement tools is analyzed by a program that builds a graph of all 
interactions between the processes. The program uses this graph to com- 
pute the amount of parallelism achieved by the program, given various 
assumptions about message transmission costs. It can also project perfor- 
mance in the case of multiple CPU’s for various assignments of processes 
to processors. The parallelism analysis is described in detail in [6]. 


The measurement procedure is an iterative one. First, we execute 
the distributed program, collecting the performance trace data. We may 
repeat this step, varying (for the TSP program) the problem size (number 
of cities) and number of processes that are used to computed the solution. 
Second, we analyze the performance data, computing the actual amount of 
parallelism, and computing the projected amount of parallelism for other 
assignments of processes to processors. Third, we use this data to modify 
our program. These three steps can be repeated until a satisfactory solu- 
tion is reached. 


The data analyses are being used to predict program performance. 
The question arises: How accurate are these predictions? While we do not 
test our predictions in the TSP study, previous case studies have shown the 
accuracy of these predictions to be within 5% of the actual program perfor- 
mance {7]. 


5. Evolution of a Program 


The unit of measurement of performance that we use in this paper is 
the parallelism factor, P. This is a measure of speed-up in the computa- 
tion, or how much of our computation is done in parallel. In the case 
where there is no parallel execution, P is 1; in the case where there are n 
processes in a computation and each process is concurrently performing 

1/n_ of the work, P is equal to n. For this study, improving the perfor- 
mance of the computation means, for a given problem size (number of 
cities), increasing the value of P. In section 5.6 we comment on the effect 
that speed-up has on machine utilization. 


In the measurements of the TSP programs, we use two forms of the 
factor P. The first form of P assumes that communication delays between 
processes are zero (i.e., communication is instantaneous), and that no pro- 
cess competes for the CPU with any other process. This gives an upper 
bound on the amount of parallelism that could be achieved by the meas- 
ured execution of the program. The second form of P incorporates com- 
munication delays and contention caused by multiple processes executing 
on the same machine. The parallelism analysis can project the value of P 
for different assignments of processes to machines. 


During the course of this research project, we developed three pri- 
mary versions of the TSP program. These are referred to (in order) as the 
Non-Blocking Server, the Low-Node Caching and the Overlapping 
Requests versions. Note that each of these versions represents successive 
improvements to its predecessors, and not complete re-implementations. 


Due to time considerations, we were not able to measure the perfor- 
mance of all of our implementations over a large variety of TSP problem 
sizes. For comparison purposes, all the implementations were measured 
for problems of size sixteen, with four, eight, twelve and sixteen processes. 


5.1. Single Process Solution 


We began by writing a single process implementation of our algo- 
rithm. This familiarized us with the TSP while ignoring the complexities 
of interprocess communication. 


The basic algorithm of the program is as follows: First a matrix 
representing the problem is created. Second, the root node of the state tree 
is created. Each node of the state tree is identified with an edge’s entry 
from the representative matrix, and represents a partial path through that 
matrix. This partial path can be obtained by traversing up the state tree to 
the root node. At each step of this traversal, the edge that the state tree 
node represents is a part of the partial path. The node also holds the lower 


bound cost associated with the partial path of which it is a part, and the 


lower bound cost of the solution that could be obtained by following that 
partial path. A node of the problem is arbitrarily selected as the starting 
point of the solution. 


In each iteration of the main loop of the program, a node is selected 
as having the best possible chance of being on the optimal path. This is 
done by traversing the entire state tree and choosing the leaf node with the 
lowest lower bound value. This node represents the path most likely to 
yield an optimal circuit. 


Child nodes are now set up for the chosen node. For each node that 
could be reached from the chosen node (that is, all of the reachable cities 
that are not already in the path), a child node is created, and lower bound 
costs for the child node are computed. 


This loop is repeated until a complete path is found. The first com- 
plete path found will be an optimal path because the cost associated with it 
will be less than or equal to the cost associated with any other path. 


5.2. Multiple Process Solution 


There are many possible approaches to converting this computation 
into a parallel computation. We introduce parallelism by creating child 
processes that do some of the calculations needed by the parent process for 
each iteration. In other words, each child performs the part of the calcula- 
tion that it is given, and all of the decisions are made by the parent process. 


Most of the computation time is spent when new nodes are added to 
the state tree and the lower bound cost for each new node must be calcu- 


lated. Our implementation of this program calculates these lower bound 


418 


costs for each new node in parallel. When the program starts, a number of 
child processes are created. Instead of having the single parent calculate 
the lower bounds, these computations are handed to the child processes in 
the form of requests. Later, the results are returned to the parent, which 
stores the values in the appropriate node. There are other possible parti- 
tionings of this problem, but for this case study we chose a single design 
and attempted to improve its performance. single 


In the first multiple process version, the parent sends requests to 
each of the children and then awaits responses from all of the children 
before sending out the next batch of requests. Although this version of the 
program ran and produced accurate answers, we did not start the actual 
performance measurements on our program until we had solved the simple 
problem described in the next section. 


5.3. Non-blocking Server 


After examining the program and its performance, it became 
apparent that most of the benefits of parallel execution were being lost 
because the parent process waited until all children sent back answers 
before sending out the next batch of computation requests. This was 
solved by having the parent hand out new requests to a child as soon as it 
returned an answer. We considered this version to be our first successful 
implementation, and the following measurements show its performance. 


Figures 5.1, 5.2 and 5.3 show the projection of how this implementa- 
tion of the program performed. Figure 5.1 shows the achievable degree of 
parallelism for problems with various matrix sizes and numbers of 
processes. This graph assumes that communication costs are zero, and that 
the processes do not compete with each other for the CPU. Figure 5.2 
presents the performance of the program when each process is run on its 
own machine, over varying problem sizes and numbers of processes. Fig- 
ure 5.3 illustrates the amount of parallelism achieved by our program when 
two processes are allocated to each processor under a variety of problem 
sizes and number of processes. 


The results show that the amount of parallelism achieved increases 
with the size of the problem. As the problem size increases, a larger per- 
centage of the program’s time is spent computing the lower bounds. Since 
we do the computation of the lower bounds in parallel, it is natural that the 
degree of parallelism should increase with the problem size. 


If we compare the corresponding curves from graphs 5.1 and 5.2 for 
a single problem size, we can see the cost (loss of parallelism) from com- 
munications delays. Comparing graphs 5.2 and 5.3, we see the loss of 
parallelism when two processes contend for a single CPU. 


5.4. Low-Node Caching 


The process status of the running program showed that it was using 
large amounts of memory. This memory was obviously being used for the 


state tree. However, this meant that we were spending a large percent of 
program time searching through the state tree for the lowest node. During 
this time the child processes were sitting idle, and parallelism was being 
lost. 


To eliminate this waiting time, we decided to save 100 of the previ- 
ously calculated nodes with the lowest values in an array. For each child 
node that is created, if the array is not full or if the value of the child node 
is less than the cost of the last (highest value) element in the array, this 
value is inserted, in order, into the array. The result is an ordered array of 
the leaf nodes from the state tree with the smallest values. Our branching 
decision is made by selecting the leaf node of the state tree that has the 
smallest value associated with it, in this case the first node in the array. 
When this node is selected, it is eliminated from the array. 


Performance measurement of this Low-Node Caching version of our 
program shows a significant improvement in the parallelism achieved by 
the program. Our measurements where made for problems of size 16 with 
4, 8, 12 and 16 processes. Figure 5.4 compares this Low-Node Caching 
version with the Non-Blocking Server version. These figures show that 
this Low-Node Caching version performed significantly better than the 
non-blocking server implementation. We also find that this improvement 
in performance is more pronounced as the number of processes increases. 


This improvement is the result of the parent process doing less com- 
putation relative to the computations it gives to its children processes. 
That is, less time is spent in the parent process in this version, while the 
children are still doing the same amount of computation as in the Non- 
Blocking Server version. 


5.5. Overlapping Requests 


Experimental measurements show that communications costs are 
actually very high, taking approximately 8 ms for communications 
between processes on a single machine and 20 ms for processes on dif- 
ferent machines [2]. This means that a child process remains idle while 
information is transferred back to its parent and while waiting for a new 
request to arrive from the parent. 


To eliminate the idle time in the case where the number of requests 
exceeds the number of servers, we overlapped the requests to the children 
by initially sending out two requests to each child. Children no longer had 
to wait for the parent to receive the result before beginning work on a new 
request, since another request was usually waiting. This is an important 
innovation, and is aimed solely at trying to cut down the effect of the com- 
munications cost. As the following results show, this minor change 
enhanced the program’s performance. 


Figure 5.5 shows the performance of the Overlapping Requests 
implementation, as well the performance of the previous two versions, for 
a problem of size 16 with 4, 8, 12 and 16 processes. 


Note that the Overlapping Requests version performs better than the 
Low-Node Caching version when the number of processes is small, but 
that the difference in performance between the two diminishes as the 
number of processes approaches the size of the problem. If there are as 
many processes as the size of the problem, then all of the computations are 
done by having each child do only one computation, thus there is no 
opportunity to use overlapping to reduce communication delay effects. 


5.5.1.1. Is Faster Always Better? 


In this study, we have based our evaluation of program performance 
on the amount of parallelism or speed-up achieved in the different versions 
of the TSP program. By using P as our only evaluation criterion, we have 
stated: faster is better. This view of performance dictates that even a small 
increase of parallelism is worth the addition of more hardware. 


It is also important to know how well we are using our computing 
resources. This can be stated as: how much of our available computing 
resources are we using? For the TSP programs, we ask: how much of the 
machines (CPU’s) involved in the computation are we utilizing? We can 
compute CPU utilization from the parallelism factor. We define 
Utilization = P/N, where N is number of machines used in the compu- 
tation. Figure 5.6 graphs the values of the utilization (corresponding to the 
graphs in Figure 5.5) for the three versions of the TSP program. 


Our first observation from the graph in Figure 5.6 is that when two 
processes run on each machine, we have a higher utilization (and from Fig- 
ure 5.5, a lesser amount of parallelism) than when one process runs on 
each machine. This means that while running one process per machine is a 
less efficient use of resources, it results in a faster execution. 


[7] 
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Our second observation from Figure 5.6 is that each successive ver- 
sion of the TSP program made better (more efficient) use of the CPU’s. 
We know that the successive versions are better (and not just requiring 
more computation time) since the amount of CPU time needed for each 
successive version also decreased. The fact that the curves for the Over- 
lapping Requests version and the Low-Node Caching version meet when 
16 processes are used reflects the fact that the problem size is the same as 
the number of processes. 


A person who uses a distributed program, such as the TSP program, 
wants his/her program to execute as quickly as possible. The parallelism 
factor, P, gives a measure of how much we can increase the speed of exe- 
cution. It is also important to know how efficient is a program. We need 
metrics such as CPU utilization to evaluate efficiency. 


6. Conclusion 


The purpose of this study was to evaluate a collection of measure- 
ment tools for distributed programs through the development of a simple 
program. The question was: could these tools help us better understand 
our program, and help us improve its performance? 


While the main intent of this study was to evaluate the performance 
monitoring abilities of the measurement tools, these tools also proved use- 
ful for debugging. When the first version of the TSP program was being 
tested, its execution speed seemed slower than expected. A visual exami- 
nation of the trace records produced by the measurement tools quickly 
(that is, in less than one minute) showed that the master process in the 
computation was only creating a single child process. This was caused by 
a simple typing error in the program. After this error was corrected, the 
performance seemed unchanged. A second inspection of the traces 
showed that, while the correct number of child processes were being 
created, the master was sending requests only to a single child — again a 
typing error. After this, the program worked correctly. The discovery of 
these errors is an indication that the measurement tools are providing some 
of the additional information needed for developing distributed programs. 


The main direction of this study was the investigation of the perfor- 
mance of the TSP program. In particular, we wanted to maximize the 
amount of parallel activity in the program. The distributed program meas- 
urement tools, in conjunction with other existing tools, provided the infor- 
mation necessary to measure the program’s performance, identify 
bottlenecks, correct the problem, and evaluate the corrections. 


Traditional performance tools provide information as to how much 
time a program spends in its various procedures and modules. Our meas- 
urement tools in conjunction with the analysis of parallelism also provide 
an insight into when a program spends its time in these modules. 
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Figure 5.6 
Machine Utilizations for the Three Implementations 


Number of Processes 
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Figure 5.3 
Parallelism vs. Matrix Size, Two Processes per Machine. 
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Abstract 


This paper presents a new program transformation 
algorithm for functional programs named the mixed-mode 
fixpoint evaluation method. This method realizes op- 
timized demand-driven computation on a dataflow ma- 
chine. This evaluation scheme takes advantage of both 
data-driven and demand-driven evaluations in that it of 
fers sufficient parallelism and safeness for nonstrict se- 
quential functions. This can be achieved by eliminating 
demand propagation predictable during compilation time, 
and by replacing the lengthy recipe creation and evalua- 
tion required in demand-driven function applications with 
the call-by-value method. These optimizations are based 
on an inter-function global dataflow analysis called the 
Extended Dependency Property Set. 


1. Introduction 


Functional programming languages have various at- 
tractive features facilitating the writing of short and 
clear programs, as well as verifying and transforming 
programs automatically due to their formal mathemat- 
ical semantics. For the efficient execution of functional 
programs, several language-oriented machines based on 
dataflow and reduction models have been proposed and 
implemented.!!51 

Although dataflow machines inherently have con- 
formed excellently to a special subset of functional 
programming languages!®!, several drawbacks existl#] in 
dataflow models compared to reduction models. One of 
these is the difficulty in implementing the fixpoint compu- 
tation rulel®] for nonstrictl1® functions. In order to over- 
come this drawback, several studies have been conducted 
in an attempt to achieve demand-driven computation on 
dataflow models. These studies have focused on demand- 
controlled stream creation,!17] the lazy cons operator?! 
and the semi-result concept.!!7] These approaches extend 
the computation domain to nonstrict data structures such 
as infinite sequences. However, they have only limited ca- 
pability for handling nonstrictness of sequentiall!®] func- 
tions. 

In addition, several inefficiencies commonly arise dur- 
ing demand-driven evaluation. One is the cost of demand 
propagation from the outermost expression to the inner- 
most expressions. An algorithm based on simplified global 
dataflow analysis has been proposed.!!4] Such a simple 


0190-3918/86/0000/0421 $01.00 © 1986 IEEE 


method, which first makes the assumption that all non- 
primitives are nonstrict (or strict), however, has been 
shown to have only a restricted analysis capability.!4] An- 
other inefficiency is the cost required for creating and eval- 
uating recipes.|§] The recipe-based demand-diven method 
(call-by-recipe) requires much more complicated compu- 
tation than does the simple call-by-value method adopted 
in the data-driven evaluation. 

This paper proposes a new method for realizing 
an optimized fixpoint computation rule on dataflow 
machines, named the mixed-mode fixpoint evaluation 
method. The name is so chosen because the method real- 
izes the fixpoint computation rule for the class of sequen- 
tial functions, and because both data-driven and demand- 
driven evaluations are scattered throughout the computa- 
tion process. 

The method consists of two principal algorithms: the 
demand propagation algorithm and the recipe creation 
elimination algorithm. The demand propagation algo- 
rithm predicts successive demand propagation sequences 
by analyzing the dependency between functions and val- 
ues. Then, it transforms dataflow graphs such that each 
demand is propagated directly toward the innermost ex- 
pressions as long as the safeness property is preserved. 
The recipe creation elimination algorithm analizes the 
lifetime of recipes and detects all cases where a call-by- 
recipe can be safely replaced with a call-by-value. 


To insure the safeness of the above optimizations, 
these algorithms make use of an inter-function global 
dataflow analysis named the Parameter Dependency 
Property Set (PDPS).I%-1°] In this paper, the concept of 
the PDPS is generalized to the Extended PDPS (EDPS) 
in order to process the recipe lifetime property. In or- 
der to reduce the amount of computation required for the 
optimization, a hierarchical approach is adopted. As the 
first step, a program is analyzed in the inter-function level. 
Then, each function is analyzed and optimized internally 
based on the collected global data. 

The algorithms proposed in this paper make 
use of a hardware-implemented lazy synchronization 
mechanisms,!?] such as that used in the Structure 
Memory!!)5] for storing structured data. In combination 
with this kind of intelligent operation memory, the pro- 
posed method realizes the fixpoint computation rule for 
the class of sequential functions on nonstrict data struc- 
tures. 


2. Background 
2.1 Global Dataflow Analysis 


Mixed-mode fixpoint evaluation requires that eval- 
uation of a function application be initiated only if 
the result contributes essentially to the final computa- 
tion result. In contrast, the hardware-supported redex 
(reducible erpresston) detection mechanism in dataflow 
machines is so restricted that only local information about 
value availability is used as a criterion for initiating com- 
putation. This gap inevitably leads us to a global dataflow 
analysis which gathers the property data of program frag- 
ments (function, function application, etc.) that can be 
made apparent only by analyzing the entire program text. 

For reducing the computation required for such 
global analysis, the concept of autonomousness is intro- 
duced. Suppose that an inter-function analysis computes 
the property P for each function in a program. Property 
P is said to be autonomous if the property P of a function 
f is computable using only the property data of functions 
referred to by the function f (i.e., computable without 
using the definitions of referred functions). 

The result of the conventional strictness analy- 
ses,!13] j.e., requisite parameters, is not autonomous. For 
example, suppose the program 


f(z,y) =9(@,0+y,2—-y); 
g(z,y,z) = if c >0 then y else z fi. 


Function f refers to g, and the requisite parameter 
of g(x,y, z) is x. The requisite parameters for f(x,y) are 
x and y. However, this fact cannot be deduced from the 
requisite parameter information concerning g. 
| The autonomousness of property P is desirable, since 
computation required for the inter-function analysis can 
be reduced using the caller-callee relation of the func- 
tions in a program. In addition, autonomousness means 
that the property contains enough information to enable 
computation of the property of the elements referring to 
it. Therefore, inter-function property data can be propa- 
gated to clarify the property of each element in a function 
without referring to the entire program text. 


2.2 Parameter Dependency Property Set (PDPS) 


The PDPS is the generalization of requisite parame- 
ter in such a way that it satisfies autonomousness. The 
PDPS for a function f is a set of Minimally Sufficient 
Parameter Sets (MSPSs) for f that is defined below. 


[Definition 1] Minimally Sufficient Parameter Set 
Given an (m+ n)-ary function, f(21,22,...,2m,Y1, 

...,Yn), a set of parameters {x1,...,2m} is said to be 

minimally sufficient if and only if there exist values aj, 

..+,;@m such that 

(a) f(a1,...,@m,Y1,---,Yn) gives a defined value even if 
Y1,+--,Yn are undefined. 

(b) Suppose y; = w,..., Yn = w where w represents an 
undefined value. Then, f(a1,...,@m,Y1,---, Yn) be- 
comes undefined if at least one of the values aj,..., 
Qn are replaced with w. 


The PDPS computation algorithm for mono- 
tonic recursive function systems has been previously 
provided.!919] The PDPS of an expression, defined simi- 
larly regarding free variables of the expression as param- 
eters, can be computed using the algorithm called PDPS 
reduction.!®! It is apparent that the intersection of all 
MSPSs in the PDPS of function f becomes the set of 
requisite parameters of f. | 

For example, consider the function, g(x, y, z), defined 
in Subsection 2.1. D(g(z,y,z)) = {{z, y}, {z,z}}, where 
D(e) stands for the PDPS of e. The PDPSs of sub- 
expressions z+ y and z — y are both {{z,y}}. The PDPS 
of the function f(z, y) becomes {{z, y}}. This means that 
the requisite parameters of f(z, y) are x and y. 


3. Realization of Mixed-mode Fixpoint Evaluation 
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3.1 Value Property Classification 


Parameters and free variables (hereafter, simply re- 
ferred to as values) of a function can be discussed in terms 
of two properties in combination with each other: requi- 
siteness and recipe lifetime. Requisiteness refers to the 
causal relation between values, whereas recipe lifetime. 
means the form of value references in the function body. 

Regarding requisiteness, a value is classified as either 
requisite, suspended or irrelevant. A requisite value is a 
value such that if it becomes undefined, the result always 
becomes undefined. The intersection of all MSPSs in the 
PDPS of the function gives the set of requisite values. If 
there exists a case such that the result becomes undefined 
when the value becomes undefined, but the value is not 
requisite, it is said to be suspended. S(f), the set of 
suspended values for a function f, can be computed as 

S(f) ={ Union of all MSPSs of f } 

— ( Requisite values of f ) , 
where “—” is a set difference operator. A value is said to 
be irrelevant if the result is computed independently of 
the value. 

From the viewpoint of recipe lifetime, a value is clas- 
sified as either scalar or nonscalar. A value is said to be 
scalar if the value always requires evaluation when it is. 
referred. In contrast, a value is said to be nonscalar if a 
case exists such that the result of a function application 
is computable using the value only in its recipe form. In 
other words, if the demand to the value does not neces- 
sarily mean the evaluation of the expression that gives 
the value, the parameter is said to be nonscalar. To give 
an example, numerical values in arithmetic expressions 
are scalar, while parameters of a lazy cons operator are 
nonscalar. 


3.2 Mixed-mode Fixpoint Evaluation 


The principle of mixed-mode fixpoint evaluation 1s 
that actual parameter passing is deferred until the param- 
eter becomes requisite for computing the resulting value, 
and that the parameter is passed on in an evaluated form 
whenever evaluation of the actual parameter is safe. 

To achieve the above principle, two primary function 


application criteria are adopted. First, if a parameter is 
requisite, the actual parameter is unconditionally passed 
on to the function body. If a parameter is suspended, it 
is passed on only if the expanded function sends a de- 
mand for it. If a parameter is irrelevant, no code for it is 
generated. 

Second, if a parameter is scalar and it is requisite or 
demanded, the actual parameter is evaluated, and the re- 
sulting value is passed on to the expanded function. Thus, 
if the actual parameter is a recipe, it is forced at first, and 
then passed on. If a parameter is nonscalar and it is req- 
uisite or demanded, a recipe for the expression that gives 
the actual parameter is created (if not already created), 
and a pointer to the recipe is passed on to the expanded 
function. If the recipe is forced in the body of the ex- 
panded function, the evaluation of the recipe is initiated 
with the resulting value taking the place of the recipe as 
in call-by-need computation.!®! In this paper, the recipe 
creation and evaluation mentioned above are assumed to 
be executed in an intelligent operation memory such as 
a Structure Memory. For details of the implementation, 
please refer to [2]. 


The function application method of the mixed-mode 
fixpoint evaluation is an optimized version of a joint data- 
driven, demand-driven scheme in terms of three principal 
points. First, evaluated values are passed on instead of 
recipes to an expanded function if scalar parameters are 
requisite or demanded. Second, recipe creation for a sus- 
pended nonscalar parameter is delayed until it is actu- 
ally demanded. Third, the method completely eliminates 
computation of irrelevant parameters. 


3.3 Function Expansion Strategy 


The function expansion strategy controls the function 
expansion time during function application. One strategy 
is that a function is expanded when all requisite parame- 
ters are available. This is a natural extension of the data- 
driven computation for strict functions, and is called the 
strict expansion strategy. Functions may be expanded at 
the time the first requisite parameter becomes available. 
This is called the hasty expansion strategy. In this pa- 
per, the strict expansion strategy is assumed. If the hasty 
expansion strategy is adopted, the algorithm in Section 4 
should be slightly modified. 

Figure l(a) is a graphic representation of function ap- 
plications. Primitives and strict functions are computed 
in a data-driven manner. For each suspended parameter 
of a nonstrict function, a demand arc (dashed arrow) ex- 
tends from the node. When a token is put on this arc, 
the corresponding parameter is demanded. Conditional 
function “if_then_else” is also represented as one of the 
nonstrict functions, although it is a macro-node expanded 
in-line during program compilation. | 

In demand-driven computation, the case exists in 
which function expansions should be controlled by de- 
mands. This is because a function should be applied if 
and only if the resulting value of the function application 
is requisite. For this purpose, we introduce a new notation 
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for the dataflow graph as shown in Fig. 1(b). The hori- 
zontal dashed arrow is a demand arc for function expan- 
sion. This notation is referred to as demand expansion. 
In the strict expansion context, a function is expanded 
if and only if all requisite parameters are available and 
a demand token exists on the demand arc. In contrast, 
conventional notation is referred to as self expansion. 


4. 


During the function application of the mixed-mode 
fixpoint evaluation, the necessity for a suspended param- 
eter is signalled by a demand token from the expanded 
function body. For a scalar parameter, this demand is 
propagated to evaluate the corresponding actual param- 
eter. The proposed method optimizes the propagation 
using the result of inter-function analysis. This section 
describes the outline of this optimizing algorithm. 


Demand Propagation Algorithm 


4.1 Principle of Demand Propagation 


In the evaluation process of a function application, 
value tokens are generated as a result of sub-expression 
evaluation. Such tokens are named Partial Resulting 
Values (PRVs). In the following, two important concepts 
regarding PRVs and demands are introduced. 


[Definition 2] Total Requisite Value Set (TRVS) 

For a PRV v, a set of PRVs and/or parameters (here- 
after referred to as values) exists such that the values in 
the set are requisite to the evaluation of v. This set is 
named the Total Requisite Value Set of v. The PRV v 
itself is also included in T(v), where T(v) stands for the 
TRVS of v. 


[Definition 3] Demand-generated 
Requisite Value Set (DRVS) 
Suppose that the function definition contains a func- 
tion application of a nonstrict function, and that dv is a 
demand arc for the suspended parameter v of this func- 


tion application. A set of values would then exist such 


that the values in the set become requisite when a demand 
is generated on the demand arc dv. This set is named 


the Demand-generated Requisite Value Set (DRVS) of dv, 


and is indicated by R(dv). 

Using the above concepts, the optimization principle 
is described as follows. 
(1) Suppose that PRV v is a result of some function ap- 
plication, and that PRV r is an actual parameter cor- 
responding to the requisite parameter of the applied 
function. If PRV v becomes requisite, the PRVs in 
T(r) are evaluated in a data-driven manner. 
Suppose that PRV s is an actual parameter corre- 
sponding to the suspended parameter of some func- 
tion application. If this parameter becomes requi- 
site in the applied function body, it is signalled to 
the caller of the function through the demand arc 
ds. With this demand signal, the PRVs in R(ds) are 


evaluated in a data-driven manner. 


(2 


A 


An outline of the computation algorithm for the 
TRVS and the DRVS will be discussed in Subsection 4.3. 


4.2 Example of Demand Propagation 


The optimized demand propagation described in the 
previous subsection is accomplished through the transfor- 
mation of dataflow graphs. This transformation process 
and the operation of the transformed program are intu- 
itively explained using an example. 

Consider the following function: 

P(z,y,z) = {e= folz); t= fa(y); 

c= ho(gi (y, z), C, g2(c, t)); 

j= hi(fi(2), e, 2); return 1 ’ 
where f, (k = 1,2,3) are unary strict functions and gx 
(k = 1,2) are binary strict functions. In addition, the 
PDPS of the nonstrict function h;(2z,y,z) (k = 1,2) is 
assumed to be {{z,y},{z,z}}. All values are assumed to 
be scalar. 

Figure 2 shows a data-driven version of the function 
codes. For each PRV in this figure, a unique name is at- 
tached. All parameters for p should be evaluated prior to 
the function application. At the beginning of the transfor- 
mation, a demand arc is created for each requisite parame- 
ter. Figure 3 shows this partially transformed graph. The 
PDPS of p(z,y,z) is computed to be {{z,y}, {z, y, z}}. 
Therefore, z and y are requisite parameters, and z is a 
suspended parameter. In this figure, the demand arc for 
z is represented as dz, a naming convention which will be 
used throughout this paper. Nonstrict functions hy and 
hz should also be transformed in the same way as func- 
tion p. Strict functions require such transformation only 
if they contain nonstrict functions. 

The completely transformed program is shown in Fig. 
4. In the figure, a new primitive node, “Demands Merge, 
Value Distribution” (DMVD), is introduced. This node 
corresponds to the combination of a d-union operator!!? 
and a distribution node. The DMVD operator accepts 
multiple (demand input, value output) arc pairs as well 


x dz *) DMVD : Demands Merge , 
Value Distribution 


=— ow oe om oe 
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as one (demand output, value input) arc pair. The opera- 
tor receives at most one demand token from each demand 
input arc. After receiving the first demand token, the de- 
mand token is forwarded through its demand output arc. 
When a value comes to its value input arc, it is distributed 
to all value output arcs that have demanded the value 
through associated demand arcs. If the value is available 
when a demand comes, it is immediately forwarded to the 
associated value arc. Another new primitive is the gate 
(represented as G_ in the figure). The gate primitive 
is defined as a strict function, gate(z,y) = x. The arc 
corresponding to parameter y of gate(z,y) is called the 
control arc (described as a horizontal arc in the figure). 

When function p is applied, PRV 7, the final PRV to 
be returned, is requisite. In addition, T(j) = {z,y,a,7}. 
Therefore, parameters x and y are passed on in evaluated 
forms, and the PRVs a and j are computed in a data- 
driven manner. When the second parameter of function 
h, becomes requisite, demand de is generated. Since de is 
connected to dz, the suspended parameter z is evaluated 
and passed on. The PRVs }, c and e are then evaluated 
in a data-driven manner because R(de) = {b,c, e, z}. 


4.3 Outline of Demand Propagation Algorithm 


Since the demand propagation algorithm explana- 
tions are considerably lengthy and complicated, the al- 
gorithm is explained only intuitively in this paper. For 
details, please refer to [11]. 

The demand propagation algorithm consists of four 
phases. In Phase 1, the PDPS is computed for every func- 
tion defined in a program. This phase is an inter-function 
analysis. In Phase 2, each function is analyzed using 
PDPSs. As a result, the Value Dependency Property Set 
(VDPS) is computed. As will be described below, the 
VDPS is the generalization of the TRVS in the same way 
as requisite parameters are generalized into the PDPS. 
This phase and the succeeding phases are intra-function 
analyses. Phase 3 analyzes each suspended parameter of 
the function application in a function definition, and com- 
putes the DRVS for every demand arc. In Phase 4, de- 
mand arcs are connected to proper points such that each 
demand arc triggers evaluation of values belonging to its 
DRVS. 

In the following, Phases 2-4 are explained using the 
function example p(z, y, z) introduced in the previous sub- 
section. Here, we assume that the PDPSs of all functions 


referred to in p have already been computed as a result of 
Phase 1. 


[Phase 2] The final value returned by function p is PRV 
jy. In Fig. 2, parameters z, y, z and PRVs a,...,e, 2 
are also included. However, not all values are required to 
compute PRV j, since functions h;(zx, y,z) (k = 1,2) are 
nonstrict. For example, if function hy demands the third 
parameter, only values x, y,a and 2 are required to com- 
pute PRV 7. Such a set of values is named the Minimally 
Sufficient Value Set (MSVS) of PRV j. PRV j itself is in- 
cluded in each MSVS of jy. The set of all MSVSs of 7 
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is called the VDPS of j, and is indicated by V(j). For 
example, V(j) in Fig. 2 is computed as 

V(j) = {{a, b, C, d, C5459; L,Y, z}, 

{a, b, C, ey Y, ale {a, 9: x, y}} 7 

The intersection of all MSVSs in V(j) gives T(j). Asa 
result of Phase 2, the VDPS of the final value is computed 
for each function in a program. Since the dataflow graph 
of a function is acyclic, the VDPS computation algorithm 


is similar to the PDPS computation algorithm for a block 
expression. |] 


[Phase 3] In Phase 3, the DRVS is computed for each 
demand arc. In general, when a demand dv is generated 
for value v, some values have already been demanded or 
evaluated. A set of these values is named the Pre-demand 
Requisite Value Set of dv, and is indicated by Pre(dv). 
Similarly, the Post-demand Requisite Value Set is defined 
as a set of demanded / evaluated values after demand dv, 
and is indicated by Post(dv). The DRVS of dv, namely, 
R(dv), is computed as 


R(dv) = Post(dv) — Pre(dv) . 

For example, consider the demand di in Fig. 3. In or- 
der to generate demand di, evaluation of PRV a as well as 
parameter z must be finished. In addition, PRV j must be 
demanded. Furthermore, parameter y is requisite to com- 
pute PRV 3. Therefore, Pre(di) = {a,j,z,y}. Similarly, 
Post(di) becomes {a,i,j,z,y}. Therefore, R(di) = {i}. 
Similarly, R(de) = {b,c,e,z}, R(dc) = {} and R(dd) = 


d,t}. 


Let us now discuss the algorithm for Pre(di) and 
Post(di) using Fig. 3. The demand dz is a demand for 
PRV i, and the PRV 7 is used to compute the PRV 7. At 
the time the demand di is generated, PRV 7 has already 
been demanded. After this demand, PRV : also becomes 
demanded. Assume V, to be the VDPS of the final PRV 
of function p. In this case, Vp is equal to V(j). Then, 
each MSVS in V, represents a possible combination of the 
PRVs demanded / evaluated during the function applica- 
tion of p. Thus, Pre(dz) is computed as the intersection of 
all MSVSs in V, that contain PRV j. Similarly, Post(dz) 
is the intersection of all MSVSs in V, that contain PRVs 
both 2 and j. 


[Phase 4] Each demand arc is used as a trigger for the 
evaluation of values belonging to its DRVS. In this phase, 
the values in a DRVS of each demand are classified by 
the availability of requisite values. This algorithm 1s ex- 
plained below using the example presented in Figs. 3 - 
4, 

Consider the demand de whose DRVS is {b,c, e, z}. 
The parameter z is a suspended parameter of function p, 
and demand arc dz is associated with z. PRVs b, c and 
e cannot be evaluated without parameter z. In addition, 
if z is evaluated, the requisite parameters for these PRVs 
become available. Therefore, demand arc de is connected 
to dz. 

Next, consider the demand dd where R(dd) = {d,7}. 
PRV d cannot be evaluated without PRV 2. In contrast, 


the requisite value for PRV 2, namely parameter y, is 
passed on in an evaluated form. Thus, the function appli- 
cation f3(y) is demand expanded using demand arc dd. 

For the demand dc, the DRVS is empty. This fact 
means that although dc is a demand for PRV c, the value 
is already requisite even before it is demanded. Therefore, 
a gate node, G_ is inserted as shown in Fig. 4. The 
parameter passing of PRV c is deferred by this node until 
demand dc is generated. . 

If more than two demand arcs are connected to one 
node, a DMVD node is inserted for merging demands and 
distributing the resulting value. For example, in Fig. 3, 
function application f3(y) should be demand expanded 
by both demands dd and di. Therefore, a DMVD node is 
inserted in Fig. 4. 


5. Recipe Creation Elimination Algorithm 


5.1 Extended Dependency Property Set 


Parameters for primitives for arithmetic, logical and 
relational operations such as “+”, “and”, “>” are all 
requisite and scalar. By contrast, nonstrict data construc- 
tors such as lazy cons require all parameters to be requi- 
site and nonscalar (i.e., recipes should always be passed). 
In this section, we describe the computation algorithm 
for these properties for the class of monotonic recursive 
functions. 

For non-recursive functions, the recipe lifetime anal- 
ysis is rather self-evident. For example, consider the func- 
tion f such that 


f(z,y,z) = if >0 then x+y else cons(z, NIL) fi. 


In function f, parameter x is requisite, and param- 
eters y and z are suspended. Furthermore, x and y are 
scalar, and z is nonscalar. 

An Extended Dependency Property Set (EDPS) no- 
tation is introduced in order to describe both requisiteness 
and recipe lifetime properties in one data structure. The 
EDPS is defined as a set of Extended MSPSs (EMSPSs). 
The EMSPS is a set of ordered pairs (value name, recipe 
lifetime property). For example, E(f(x,y,z)), the EDPS 
of above function f(z, y, z), is described as 


E(f(z,y,2)) = {{(2, 5); (ys S)}, (2, 5), (2, N) FI, 


where (z,S) and (z,N) mean that value z is scalar, and 


value z is nonscalar, respectively. 


5.2 EDPS Computation for Recursive Functions 


Consider the composite function, f(z) = fi(fe(z)). 
Parameter z is scalar only if the parameters of both f; 
and f2 are scalar. For example, parameter z of f(z) = 
cons(2 * x, NIL) is nonscalar, whereas f(x) = (2* 2) +3 
is scalar. 

For general composite functions, the EDPSs can be 
computed using the above rule and the general PDPS 
reduction rule.!] At this time, the same parameter can 


be used in both scalar and nonscalar form. An example. 


of this conflict is shown below as 
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f(z,y,z) = if x > 0 then cons(z,y) else y + z fi. 
The EDPS of f(z, y,z) becomes 


{{(z, S), (2, N), (y, N)}, {(z, 5), (ys 5S), (2, S) FS 
if (x, S) and (z, N) are distinguished. 

If the same parameter is classified as scalar and non- 
scalar in a single EMSPS, it can be safely classified as 
scalar. This is because if the computation of x diverges, 
the final value of the function application always becomes 
undefined. Therefore, 


E(f(x,y,z)) = {{(z, 5S), (y, N)}, (2, S), (y, S), (2, S) FH. 
If the same parameter is classified as scalar in one EM- 
SPS, and nonscalar in another EMSPS, the parameter is 
classified as nonscalar. This is because computation may 
exist which requires the parameter to be only in a recipe 
form. For example, y in the above example is classified as 
nonscalar in the “then” part, and as scalar in the “else” 
part of the conditional expression. The first EMSPS cor- 
responds to the “then” part, and the second to the “else” 
part. Thus, the y of f(z,y,z) is classified as nonscalar. 
As a result, the parameters of f(x,y, z) are classified as 

Requisite : {z,y}, Suspended : {z}, 
Scalar =: {z,z}, Nonscalar : {y}. 

The EDPS of recursive functions can be computed 
similarly to the PDPS computation in another work.!®! 
The only difference is the reduction domain, which is not 
PDPS but EDPS. To show an example, consider the fol- 
lowing functions which create an infinite sequence of fac- 
torial values: 


seqf(n) = str_cons(fact(n), seqf(n +1)) ; 
fact(n) = ifn <1 then 1 else nx fact(n —1) fi. 
where str_cons is a stream-oriented asymmetrical cons 


operator whose EDPS is {{(z,S), (y,N)}}. The EDPSs 


of these functions are found to be 


E(seqf(n)) = {{(n, 5) }} and E(fact(n)) = {{(n, S)}}. 


5.3 Recipe Creation Elimination based on EDPSs 


As described in Subsection 3.2, scalar parameters are 
passed on in an evaluated form. For example, a graph for 
functions segf and fact described in Subsection 5.2, is 
shown in Fig. 5. The parameter n of function fact is req- 
uisite and scalar. Therefore, in recipe A in Fig. 5, the 
expression n + 1 is evaluated before it is passed on to the 
function body. In the figure, the “create recipe” node is 
a macro-node which accepts the pointer of the machine 
code for an expression as well as the environment for the 
expression. It then creates the recipe, and returns its 
pointer. The “replace recipe with value” node is also a 
macro-node which replaces the recipe with the evaluated 
value. After replacement, the value is immediately avail- 
able for any expression that refers to the recipe without 
repetitive evaluation. An example of dataflow machine 
implementation of these macro-nodes as well as lazy_cons 
/ str.cons functions is described elsewhere.!2] 


(a) Function seqf 


5.4 Autonomousness of the EDPS 


When only the function application method is con- 
cerned, the EDPS can be said to be redundant for scalar / 
nonscalar distinction. For example, for function f(z, y, z) 


in Subsection 5.2, the EDPS 


{{(z,S), (y, N)}, {(z, S), (y, S), (2, S) FF 
can be reduced to {(2, 5S), (y, N), (z, S)}. 
However, such a reduced property is problematic in 
that it is not autonomous. As an example, consider the 


functions f;(z,y,z) and fo(z,y,z), whose reduced prop- 
erties are both {(z, S), (y, N), (z,S)}: 


fi(z,y,z) = ifc>O then x+y else 

if z > 0 then cons(z, z) else cons(y, z) fi fi; 
fe(z,y,z) = if c > 0 then cons(z,y) else 

if z>0O then ++2 else y+ fi fi. 


For the function applications f;(a,5,b), actual pa- 
rameter 6 is scalar, since b is used as scalar in both the 
“then” part and the “else” part of the outermost con- 
ditional expression. However, b is nonscalar in f2(a, b, b). 
Such a distinction cannot be drawn when the reduced 
property is used. In contrast, when EDPS is used, these 
cases are correctly distinguished. The EDPSs of f; and 
fo are 
E(fi(z,y,z) = {{(z, 5), (y, S)} {(z, 5), (ys NW), (2, S$} 
E( fo(z, y,z) = {{(z, 5), (y, N)}, {(2, 5), (y, S), (2, S)}} - 
Therefore, the EDPS of f,(a,6,b) becomes {{(a, S), 
(b, S)}}, indicating that bis scalar. In contrast, the EDPS 
of f2(a,b,b) becomes {{({a,5),(b, N)}, {(a, S), (b, S)}}, 
confirming that 5 is nonscalar. 

The inter-function level properties used in this paper, 
namely requisiteness and recipe lifetime, can be described 
using only the EDPS. Since the EDPS is autonomous for 
both requisiteness and recipe lifetime properties, the com- 
putation required for inter-function optimization can be 


nisafr 


a 
variable ! 


(b) Function fact 
Fig. 5 Program Example of Infinite Sequence 


significantly reduced. 


6. Conclusion 


This paper proposed a new program transformation 
algorithm for functional programs, named the mixed- 
mode fixpoint evaluation method. This method realizes 
optimized demand-driven computation on a dataflow ma- 
chine. The evaluation scheme takes advantage of both 
data-driven and demand-driven evaluations in that it of- 
fers sufficient parallelism and safeness for nonstrict se- 
quential functions. This can be achieved by eliminating 
demand propagation predictable during compilation time, 
and by replacing lengthy recipe creation and evaluation 
incurred in demand-driven evaluation with the call-by- 
value method. 

The optimization algorithm makes use of two kinds 
of properties, namely, requisiteness and recipe lifetime. 
These properties are described using an Extended De- 
pendency Property Set (EDPS). Since the EDPS is au- 
tonomous, the computation required for the optimization 
can be significantly reduced. 

The algorithm proposed in this paper assumes a lazy 
synchronization mechanism, such as that used in a Struc- 
ture Memory for storing structured data. In combination 
with this kind of intelligent operation memories, the pro- 
posed algorithm realizes the fixpoint computation rule for 
the class of sequential functions on nonstrict data struc- 
tures. 

This paper intentionally leaves the machine architec- 
ture abstract. Further research will be carried out on 
the construction of an effective and high-speed dataflow 
machine that can execute all operations described in this 


paper. 
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Optimizing Matrix Operations on a Parallel Multiprocessor with a Hierarchical Memory System 
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ABSTRACT - Memory organizations of supercomputers 
(CRAY 2, CEDAR) tend to become more and more complex, 
and correspondingly data management in these memories 
becomes a crucial factor for achieving high performance. We 
study here an architecture combining vector and parallel 
capabilities on a two-level shared memory structure. For this 
class of architecture, we analyze and optimize matrix multi- 
plication algorithms so as to obtain high efficiency kernels 
which can be used for many numerical algorithms such as LU 
and Cholesky factorizations, as well as Gram-Schmidt and 
Householder orthogonal factorization schemes. The _ perfor- 
mance of such kernels on the Alliant FX/8 multiprocessor is 
described in detail. 


1. INTRODUCTION 


One of the main issues in supercomputer architecture is 
the design of a memory system that is able to provide the 
arithmetic operators with operands at a sufficient rate. 
Several solutions have been proposed; first fully parallel 
memories (implemented either in a global way: BSP or in a 
distributed one: ILLIAC IV or more recently Hypercube 
machines [15]), followed by highly interleaved memories 
(CRAY, CDC CYBER 205, FUJITSU) and now complex 
hierarchical memory systems which consist of two levels - a 
first level of small fast memory and a second level of slower 
larger memory (which is, in fact, generally interleaved itself). 
Implementations of these multilevel memory organizations for 
parallel processors have been studied [5][17] and are currently 
being used in several supercomputers; the ST-100 provides a 
fully programmable cache, the CONVEX Cl and ALLIANT 
FX/8 a hardware managed cache, and the CRAY 2 a pro- 
grammable local memory for each processor. 


While the use of global parallel or interleaved memories 
only slightly affects the design of numerical algorithms (the 
main constraint is to avoid accessing vectors by using incre- 
ments of a power of two , in order to avoid memory bank 
conflicts {11]), the use of distributed or hierarchical memories 
requires a careful design of algorithms; more precisely, data 
management becomes a key factor in realizing high perfor- 
mance. Significant progress has been achieved in developing 
efficient algorithms for the distributed memory case and in 
understanding the impact of the communication medium on 
the algorithm design [7][13][14]. On the other hand, much 
work remains to be done in the case of a hierarchical memory 
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system especially when it is combined with both vector and 
parallel capabilities (e.g ALLIANT FX/8). For those architec- 
tures, algorithms must not only be designed as to be suitable 
for vector and parallel processing but also to provide good 
data locality. These requirements may be contradictory. For 
example, increasing the vector length may destroy data local- 
ity and so give a poor performance. In this paper, our goal 
will be to study the tradeoffs involved in designing high- 
performance algorithms for such architectures. 


For reducing the complexity of the problem, the decom- 
position of algorithms into high level modules (e.g. 
vector/vector, matrix/vector and matrix/matrix operations) 
seems to be a promising approach.This technique has been 
successfully used on vector machines, e.g. BLAS, [3] and on 
MIMD machines (extended BLAS). In our case, the use of 
BLAS (vector/vector operations) or even extended BLAS 
(matrix/vector operations) may not be efficient since they 
mainly contain primitives involving an amount of data of the 
same order as the number of floating point operations; for 
example DAXPY (one of the BLAS) operating on vectors of 
length N will manipulate 3N-+ data elements for executing 
only 2N operations. This may often result in an inefficient 
use of the cache, since less than one floating point operation 
per data accessed has to be performed. However, multiplying 
two square matrices of order N, involves 3N’ data elements 
for (2N-1)N” operations, so data elements fetched from the 
memory can be used several times before they are stored back 
again. We study this primitive in depth since many algo- 
rithms may be expressed in terms of matrix / matrix opera- 
tions as e.g. the Modified Gram-Schmidt algorithm [10] or the 
Householder reduction [2](8]. The use of the here described 
methods instead of the conventional BLAS within these two 
algorithms improved their performance on the ALLIANT 
FX/8 by a factor of two to three. 


In the first section, we describe our target architecture. 
In the following sections we develop and analyze algorithms 
for multiplying large matrices. Finally, we present some com- 
putational results on an Alliant FX/8 that support our 
theoretical results in the preceding sections. 


2. THE TARGET ARCHITECTURE 


In this section, we describe briefly the main features of . 
the architecture we plan to study. Our target machine con- 
sists of p vector processors (CE = computational element) 
sharing a memory which is organized in two levels: first a fast 
small one (cache), second a slower bigger one (main memory). 
The processors are connected to the cache through a network 
allowing each of them to read or write any of the cache loca- 
tions. The set of processors is provided with a synchroniza- 
tion mechanism enabling it to distribute the computations of 
a single program unit among the p CE’s. 


A good example of the architecture described above is 
the ALLIANT FX/8. This machine consists of p=8 pipelined 
CE’s, each of them having a register-oriented architecture 
(vector register length: 32) elements. The CE’s share the 


physical memory as well as a 16K-word cache. The 
bandwidth between main memory and CE’s is half of that 
between the cache and the CE’s. However, due to managing 
cache misses, accessing a vector from main memory may 
require a time larger than twice the time required for access- 
ing the same vector if it is present in cache. 


3. DESCRIPTION OF THE MATRIX MULTIPLI- 
CATION ALGORITHM 


For the sake of simplicity, we consider here only the 
matrix operation: C=C+A*B. The similar cases 
C=A*B and D=C-+A *B can be easily derived. 


In order to evaluate the matrix multiplication algo- 
rithms it is necessary to model the conditions under which we 
assume they will be executed. We suppose that the replace- 
ment policy of the cache blocks is optimal (similar to Belady’s 
MIN algorithm [1]); this assumption yields a standard of com- 
parison on which the design of several algorithms can easily 
be based. 


Previous studies [12][16] have shown that data locality 
can be improved efficiently by partitioning matrices into 
square submatrices. This method was primarily used for 
reducing the number of page faults on a sequential computer 
using a virtual memory system (first level: transistor memory, 
second level: disk memory). Our case presents two major 
differences: first the two different kinds of parallelism avail- 
able in our machine and second the management of data 
motion between the two memory levels. In a virtual memory 
system, the mapping of data arrays into pages is crucial and 
leads to use square submatrices each of which can be con- 
tained in one page. We do not restrict ourselves to square 
submatrices and we will even see later that the best perfor- 
mance is obtained by using "very" rectangular blocks. 


Let the n,Xn,-matrix A be partitioned into m,Xm,- 
blocks A;;, the m,Xm,-matrix B into n,Xn,-blocks B;, and 
the n,Xn,-matrix C into m,Xm,-blocks C;;. 

The matrix multiplication is performed as follows: 

do i=1,k, 
do j=1,k, 
do k=1,k, 
end do 
where n,=k,m,, n.=k.m, and nz=k,m, with k,, k, and k, 
being integers greater than 1. 


Fig. 1. Partitioning of A, B and C 
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The block operations C; ;=C; ;+A;, *B, j contain a 
reasonable amount of potential parallelism, so our algorithms 
proceed by first partitioning the matrices and then executing 
the block operations one after another; parallelism and vec- 
torization being inside the multiplication of two submatrices. 


Several points with respect to the above algorithm need 
to be addressed. These are: 


(1) 
(2) 
(3) 


the order in which the submatrix computations are per- 
formed, 


the algorithm used for multiplying two submatrices of A 
and B, and 


the sizes of the submatrices. 


Concerning the first point, the three loops may be inter- 
changed leading to a different sweep of the matrices. We will 
only consider the defined case where the inner loop is the k- 
loop. The cases where the i-loop or the j-loop are the inner 
loops can be derived accordingly. 


The second and the third points are strongly related. 
The choice of submatrix sizes is obviously vital since poten- 
tial parallelism is a direct function of these sizes. On the 
other hand, a bad design of the algorithm (second point) may 
prevent using the submatrix sizes which provide a good data 
locality. 


The total time required to perform the matrix multipli- 
cation can be expressed as: 


where / denotes the total number of loads executed directly 
from the memory, 7, the time of one load, n, the number of 
submatrix. multiplications that have to be performed, and T 5 
the time for one submatrix multiplication with the three sub- 
matrices kept in the cache. The first term represents the total 
time spent in fetching data from the memory (pure transfer 
time) while the second represents mainly the time spent in 
computations (pure computation). Each of them is a complex 
function of the three parameters m,, m, and mg, and trying 
to minimize their sum is difficult. For overcoming this 
difficulty, we decouple the problem in two independent sub- 
problems: minimization of the transfer time and minimization 
of the computation time. For each, we determine a region in 
the parameter space where the value of the cost function is 
close to the minimum. By choosing a set of parameters within 
the intersection of the two regions, near-optimal performance 
can be achieved in most cases. 


4. COMPUTATION TIME OPTIMIZATION 
Let us describe the algorithm for multiplying A, and 


B,;: 

do r=1,m, 

do s=1,m, 
do t=1,m, 
Cor =C, +a, + b 

end do 
where c,,, @,, and 6,, denote the elements of C;;, A, respec- 
tively B,,. 


We introduce parallelism by performing the r-loop con- 
currently on all p processors using vectorization for the s-loop 
in each of them. Each processor computes an m,-adic 


operation, the product of the submatrix A, with a column of 
B,,. 

Let us analyze the effect of varying the three parameters 

on the performance of this kernel, in order to determine gen- 
eral guidelines for the best choice of m,, m, and mg: 
(1) Since the operations on the columns of the submatrices 
of A are vector operations of length m,, m, should be 
chosen as large as possible. For CRAY-like computers 
however it is preferable to choose m, as a multiple of 
the length of a vector register. 


Since each processor computes a linear combination of 
m, columns of A;,, we observe here the effect of the 
m,-adic operation. The performance is an increasing 
function of m,, and as shown in [3][4] performance close 
to the peak is obtained for small values (less than 50) of 
mo; for example, an optimal value of 16 has been 


observed for CRAY, and 32 for the ALLIANT FX/8. 


Since we operate simultaneously on the columns of the 
submatrices of B, the peak performance is reached when 
m, is a multiple of the number of processors, i.e. the 
computational load is perfecly distributed among the 
processors. ? 


(2) 


Another possibility would be to introduce concurrency 
as well as vectorization for one loop, e.g. the s-loop. The 
disadvantage of this method is that a large value of m, may 
be required to achieve good performance in the submatrix 
operations. This tends to conflict with the data locality 
requirements. 


We will not consider the problem of how to obtain a 
minimal 7, any further here as it is too computer dependent. 
A practical approach for the ALLIANT FX/8, however, is 
given in [10], a more general approach might be to use a more 
precise parametrization of vector and parallel processors as in 
[9], however it will only modify slightly our main results. 


5. MINIMIZATION OF THE TRANSFER TIME 


In this section, we determine the optimal values for sub- 
matrix sizes in order to minimize the number of data loads 
from main memory. For doing this, it is necessary to express 
the number of data loads as a function of the submatrix sizes. 
We will consider first the case of large n,, n, and mg, since 
for small n,, n, and n, all three matrices can be kept in the 
cache. 


In order to determine the total number of loads into the 
cache by using the given partitioning scheme we assume the 
following : 


(1) 
(2) 
(3) 


Multiplication of the single blocks A, and B,; is per- 
formed according to the method described in section 3. 


Each block C;, is loaded into the cache only once and 
kept there for the duration of the k-loop. 


The blocks A; and B,, have to be loaded each time 
they are involved in a submatrix multiplication. 


Assumption (2) results from the order of the loops that 
we chose. C; is independent of the inner loop. If we reorder 
the loops and choose e.g. the J-loop as the inner loop, A, 
could be kept in the cache. 
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Now the number of loads for computing each submatrix 


Ci; is m,n,+m,n,+m,m, and therefore the total number of 
loads for computing C' is 
m,tms, 
[= 0 Nong tn ng (5.1) 
LS Le 


Note that | is of the same order of magnitude as the number 
of floating point operations for this matrix multiplication: 
fl=2n,n.n,. This is not the case if the three matrices A, B 
and C are small enough to be kept in the cache during the 
computation of C’, where the number of loads equals the total 
number of data elements: n,n,1n,n,1M Nz. 


Since n,, n. and n, are given, the problem of minimiz- 
ing / is reduced to minimizing 
mim, 1 1 
Sears (5.2) 


under the constraint 


m,m,tm,m, < CS (5.3) 


where C'S denotes the cache size. Inequality (5.3) is justified 
by regarding hypotheses (2) and (3), and by assuming that 
the submatrix multiplication is performed so that A,, is 
swept columnwise. In this case only p row elements of B,; are 
used at the same time and are not needed afterwards hence, 
they are neglected in (5.3). If m,—p, A;, can also be 
neglected, but since a small m, will lead to a bad data local- 
ity, as can easily be verified, we concentrate here on the case 
M2>p. 


The solution of (5.2) that can be obtained by assuming 
equality in (5.3) gives us the following estimate of the optimal 
parameters as a function of m, and the cache size CS: 


CS 


ae x VOS. (5.4) 
The optimal p is then given by 
ie 2 
Poot ™ Be a (5.5) 


(5.4) requires m, and m, to be large which fits well the con- 
straints for the computation time. (5.5) indicates that the 
effect of m, is not dominant, allowing us to choose m, large 
enough to get an efficient m,-adic operation. 


Note also that enlarging the cache by a factor of 4 leads 
to reducing Poot only by a factor of 2. 


Let us consider the case where one (or two) of the three 
parameters n,, mn. and ng, is (are) small. In these cases at 
least one of the matrices to be multiplied is either tall and 
narrow or short and wide, i.e. "very" rectangular. The idea 
here is to choose the small dimensions as submatrix sizes: i.e. 
when n; is small, set m,=n;, 1=1,2 or 3. The optimal 
parameters m,, m,, mz and / for these cases can be derived 
according to the standard case. It can easy be verified that in 
these cases | is close to the lower bound achievable (i.e. each 
matrix has to be loaded yat least once). For the cases where 
one parameter is small, the optimal parameters and the 
corresponding / are given in Table 1. 


Table 1. Parameters for rectangular matrices 
(one matrix size small) 


6. COMPUTATIONAL RESULTS 


In this section, we apply the theoretical results obtained 
above to the ALLIANT FX/8. We deal only with the case 


where n,, n, and ng are large. 


Let us determine here our optimal partitioning. A good 
choice seems to be m,=96, m,—32 and m,=128. Note that 
m, is a multiple of the length of a vector register, m, the 

optimal value for the m,-adic operation and m, a multiple of 
the number of CE’s. Fig.7 shows the experimental results for 
our optimal partitioning and for two other partitionings 
achieving the same kernel performance as the optimal one but 
involving a larger number of loads. We can observe, as 
predicted by the theory a good correspondence between the 
decrease of the value p (p=0.15625, for m,—32 and m,=8; 
p=0.03125, for m,=m,=32; and p=0.01823, for m,—96 and 
m,==128) and the improvement in performance. Also note 
that our optimal partitioning achieves a constant high perfor- 
mance for a large range of matrix sizes. 


MFLOPS 


31 
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p=0.01823 


96 256 512 768 1024 n 


Fig. 2. Performance of the multiplication of two square 
matrices of order n 
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Multiprocessor Jacobi Algorithms for Dense Symmetric Eigenvalue and Singular Value Decompositions 


Michael Berry and Ahmed Sameh 
Center for Supercomputing Research and Development 
University of Illinois, Urbana, IL 61801 


Abstract — — We present two parallel algorithms 
based on Jacobi’s method for real symmetric matrices 
to determine the complete eigensystem of a dense real 
symmetric matrix and the singular’ value 
decomposition of rectangular matrices on a 
multiprocessor. Our intent is to study the advantages 
of using Jacobi and Jacobi-like schemes over new and 
existing EISPACK and LINPACK routines on an 
Alliant FX/8 computer system. For the dense 
symmetric eigenvalue problem, we show promising 
results for small-order matrices. A “one-sided” 
Jacobi-like algorithm which produces the singular 
value decomposition of a rectangular matrix is shown 
to provide superior performance for rectangular 
matrices in which the number of rows is much larger 
than the number of columns. 


1. Introduction 


Consider the standard eigenvalue problem 


Ax =x 


(1.1) 


where A is a real n Xn dense symmetric matrix. One 
of the best known methods for determining all the 
eigenvalués-and eigenvectors of (1.1) was developed by 
the nineteenth century mathematician, Jacobi. We 
recall that Jacobi’s sequential method reduces the 
matrix A to diagonal form by an infinite sequence of 
plane rotations 


T 
Axa ren U,A,U,, k=1,2,---, 


where A, =A and U,= U,(t, 7,8;;) is an orthogonal 
plane rotation matrix which deviates from the identity 
matrix only in the elements 

k k 


= mest, Gee Oy 


) and uy, = us = $, = sin(6;,). 
The angle 0 is determined so that a; 


can be shown that for i <j 


+1 k+1 
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k+1 k k k ky . k 
, = a;, cos 26,, - (a; - a;;) sin 26;,, 


45 J 


° ° ° k+1 ° 
and hence the annihilation of a;; requires 
k 


2a..- 

k J 
tan 26; = 3 ; 
as; — 45; 


where Oe is chosen such that lo. < 1/4. 
For numerical stability, we determine the plane 
rotation by 


1 
C, =; and 5, = ¢, t,, 


Vi+t; 


where ¢, is chosen as the smaller root (in magnitude) 
of the quadratic equation 


(1.2) 


2 k 
Hence, t, may be written as 


(sign a) 


la, | + V1 +O 


With each A,,, remaining symmetric and differing 
from A, only in rows and columns i and j, the 
modified elements are 


(1.3) 


ty 


k+i k k 
a, = a; +h,.4;;, 
er ee 
7 a: ie tL 
and 
k+1 k k 
a; = C4 Aj, TS, A5y5 (1.4) 
ki k k 
a;, — —S, G;, + C4 4,,. (1.5) 
where r 47,7. If we represent A, by 
T 


where D, is diagonal and EH, is strictly upper 
triangular, and 


uw, = NE, lp, 


where Ill, denotes the Frobenius norm, then from 


(1.4) and (1.5) it follows that 
, 2 2, k 
Gay = 4 — (4,5) 


2 
and | 

2 2 k,2 
O41 = % +(4;;) . 
If at each step & we annihilate an element a, which is 
at least of average magnitude, i.e.,; 


ky2 2 2 
(a;,) 2 Oh, 
n(n-1) 
then 
das |---| ¢ 
hae n(n-1) 
and A, approaches the diagonal matrix 
A = diag(d,»,---),). Similarly, (U, ---U,U,)* 
approaches a matrix whose j-th column is the 


eigenvector corresponding to d; : 

Several schemes are possible for selecting the 
sequence of elements a; to eliminate via the plane 
rotations U,. Unfortunately, Jacobi’s original scheme, 
which consisted of sequentially searching for the 
largest off-diagonal element, is too time consuming for 
implementation on a multiprocessor. Instead, a 
simpler scheme in which the off-diagonal elements 
(7,7) are annihilated in the cyclic fashion 
(1,2), (1,3), > -- (1, ); (2,3), >: (2 cre 4 (n-1,n) is 
certainly more suitable. We refer to each sequence of 
n rotations as a sweep. Quadratic convergence for 
this sequential cyclic Jacobi scheme has been well 
documented, see [13] and [15]. Convergence usually 
occurs within 6 to 10 sweeps, i.e. from 3n° to bn” 
Jacobi rotations. 

A parallel version of this cyclic Jacobi algorithm 
is obtained by the simultaneous annihilation of several 
off-diagonal elements by a given U, rather than only 
one as is done in the serial version. For example, let 
A be of order 8 and consider the orthogonal matrix U, 
as the direct sum of 4 independent plane rotations in 
Figure 1, where the c,’s and s,’s for i =1,2,3,4 are 
simultaneously determined. 


Figure 1. Sample U,, of Multiprocessor 
Jacobi Algorithm for n = 8. 


434 


Then, 
U;, or R,(1,3) ® R,(2,4) © R,(5,7) ® R, (6,8), 


where R,(i,j) is that rotation which annihilates the 
(1,7) off-diagonal element. If we consider one sweep to 
be a collection of = orthogonal similarity 
transformations that annihilate the element in each of 
the n(n-1)/2 off-diagonal positions (above the main 
diagonal) only once, then for a matrix of order 8 the 
first sweep will consist of 8 successive U,’s with each 
one annihilating 4 elements simultaneously. For the 
remaining sweeps, the structure of each subsequent 
transformation U,, k > 8 is chosen to be the same as 
that of U; where j =1+(k-1) mod 8. In general, 
the most efficient annihilation scheme consists of 
(2r-1) similarity transformations per sweep, where 
r =|[(n+1)/2],_ in which each transformation 
annihilates different [n/2] off-diagonal elements, see 
[10]. Although several annihilation schemes are 
possible, the Multiprocessor Jacobi Algorithm we 
present in the next section utilizes an annihilation 
scheme which requires a minimal amount of indexing 
for computer implementation. Whereas Brent and 
Luk [2] have demonstrated the implementation of 
similar Jacobi schemes using systolic arrays on a 
multiprocessor, the success of using block-matrix 
Jacobi schemes on a linear array of processors has also 
been shown (see [14]). However, the algorithms 
presented in this paper have yet to be implemented 
with blocking techniques suitable for the Alliant FX/8 
computer. 


2. A Multiprocessor Jacobi Algorithm 


The algorithm we present requires n similarity 
transformations per sweep for-a dense real symmetric 
matrix of order n (n may be even or odd). Each U, is 
the direct sum of either [n/2] or |(n-1)/2] plane 
rotations, depending on whether k is odd or even, 
respectively. 


Algorithm 1, 


Step 1 (Apply orthogonal similarity transformations 
via U, for current sweep) 


1.a For k =1,2,3,...,.n-1 (serial loop) 
Simultaneously annihilate elements in 
positions (7,7), where 


‘i = 1,2,3,..., [(n-k)/2]| 

j =(n-k+2)-i 
and for k >2 

' =(n-k+2),(n-k+3),...,.n-Lk/2| 
j =(2n-k+2)-i 


1.b For k =n 
Simultaneously annihilate elements in 
positions (7,7), where 


‘ = 2,3,...,[n/2] 
j =(n42)-7 


Step 2 (Convergence test) 


2.a Compute IID, ll, and IIE, Il, as in (1.6). 


NE, lp 
2.b If < (tolerance), then stop. 
| ID, lp 
Otherwise, go to Step 1 to begin next 
sweep. 


The annihilation patterns for n =8 is shown in Figure 
2, where the integer k denotes an element annihilated 
via U,. 


Figure 2. Annthilation Scheme for 
Multiprocessor Jacobi Algorithm. 


In the annihilation of a particular (7,7) element in 
Step 1 above, we update the off-diagonal elements in 
rows and columns i and j as specified by (1.4) and 
(1.5) in Section 1. With regard to storage 
requirements, it would be advantageous to modify 
only those rows or column entries above the main 
diagonal and utilize the guaranteed symmetry of A,. 
However, to take advantage of the vectorization 
supported by the Alliant FX/8 computer system, we 
disregard the symmetry of A, and operate with full 
vectors on the entirety of rows and columns: and 7 in 
(1.4) and (1.5) i.e., we are using a full matrix scheme. 
To avoid the necessity of synchronization on the 
Alliant FX/8, all row changes specified by the |n/2] 
or |(n-1)/2] plane rotations for a given U, are 
performed concurrently with one processor updating a 
unique pair of rows. After all row changes are 
completed, we perform the analogous column changes 
in the same manner. The product of the U,’s, which 
eventually yields the eigenvectors for A, is 
accumulated in a separate two-dimensional array by 
applying (1.4) and (1.5) to the n Xn identity matrix. 

In Step 2, we monitor the convergence of the 
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algorithm by using the ratio of the computed norms to 
measure the systematic decrease in the relative 
magnitudes of the off-diagonal elements with respect 
to the relative magnitudes of the diagonal elements. 
For double precision accuracy in the eigenvalues and 
eigenvectors, a tolerance of order 10°° will suffice for 
Step 2.b. If we assume convergence, this 
multiprocessor algorithm can be shown to converge 
quadratically by following the work of Henrici [5], and 
Wilkinson [15]. In the next section, we discuss the 
adaptation of a "one-sided" Jacobi algorithm to 
compute the singular value decomposition on a 
multiprocessor. 


3. Si ] V ] D iti 


Suppose A is a real m Xn matrix with m >>n. 
The singular value decomposition of A can be defined 
as 


A=UZV'" (3.1) 


where U'U=V'V=I, and 3 = diag(a,,---,0,). 
The orthogonal matrices U and V _ define the 
orthonormalized eigenvectors associated with the n 
eigenvalues of AA’ and A’A, respectively. The 
singular values of A are defined as the diagonal 
elements of © which are the nonnegative square roots 
of the n eigenvalues of AA’. 

As indicated in [11] for a ring multiprocessor, 
using a method based on the one-sided iterative 
orthogonalization method of Hestenes (see also [7] and 
[9]) is an efficient way to compute (3.1). Luk 
recommended this scheme for the singular value 
decomposition on the Iliac IV in [8], and systolic 
algorithms associated with this scheme have been 
presented in [2] and [1]. We now consider a few 
modifications to the scheme discussed in [11] for the 
determination of (3.1) on the Alliant FX/8. 

Our main goal is to determine an orthogonal 
matrix V as a product of plane rotations so that 


AV — Q = (q,,99,93, nee q,) ? (3.2) 
and 
T 2 
GW Fy = 9% 8 


where the columns of Q, q, are orthogonal, and 6, is 
the Kronecker delta. We then may write Q as 


Q=t5 wih OT =1, , 
and hence 
A=Uzxv' . 


Thus, we can obtain (3.1) if we can determine the 
matrix V in (3.2). 

We construct the matrix V via the (i,j) plane 
rotation 


(a, a5) j j = (4,8) i<j , 


so that - 
So og ca 7! 
a, a, = 0, and lal > Wat, (3.3) 


where a, designates the 7-th column of matrix A. This 
is accomplished by choosing 


Vo 
C= sn and s =— = if B>0 ’ (3.4) 
al 2c 
or | 
Ye 
s= rs and c = = ifB<0O , (3.5) 
a7 27s 


where a = 2a,'a,, 8 =IlaIP - lal’, and y= (+p) 
Note that (3.3) requires the columns of Q to decrease 
in norm from left to right, and hence the resulting oa, 
to be in monotonic non-increasing order. To 
orthogonalize the columns of A there are certainly 
several schemes that can be used to select the order of 
the (7,7) plane rotations. Following the annihilation 
pattern of the off-diagonal elements in the sequential 
Jacobi algorithm mentioned in Section 1, we could 
certainly orthogonalize the columns in the same cyclic 
fashion and thus’ perform the one-sided 
orthogonalization serially. This process is iterative 
since orthogonality between columns established in 
one rotation may be destroyed in subsequent 
rotations, and convergence is governed by the column 
norm ordering in (3.3). Each sweep consists of the 
n(n-1)/2 plane rotations selected in cyclic fashion. 
Due to the similarity with the sequential cyclic Jacobi 
algorithm for real symmetric matrices and _ the 
postmultiplication of matrix A only, we refer to this 
procedure as a one-sided" Jacobi method. 

By implementing the annihilation scheme of the 
Multiprocessor Jacobi Algorithm presented in Section 
2 as the orthogonalization scheme for the one-sided 
Jacobi method discussed above, we obtain a parallel 
algorithm for computing the singular value 
decomposition on a multiprocessor. For example, let 
n=8 and m >>n so that in each sweep of our One- 
Sided Multiprocessor Jacobi Algorithm we 
simultaneously orthogonalize pairs of columns of A 
according to the diagram in Figure 2. In other words, 
for n =8 we can orthogonalize the pairs (1,8), (2,7), 
(3,6), (4,5) simultaneously with the postmultiplication 
of matrix V, which consists of the direct sum of 4 
plane rotations. In general, each Vi will have the 
same form of U, in Figure 1 so that at the end of any 
particular sweep s, we have 


Vz =V,V_ °° 
and hence 


V =V, V., eee (3.6) 


where ¢ is the number of sweeps required for 
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convergence. We present a formal description of this 
algorithm in the next section. | 


4. A One-Sided Multi incohi Aleoril 
This algorithm is an adaptation of the 


Multiprocessor Jacobi Algorithm from Section 2 for 
the singular value decomposition of real mxn 
matrices, A, where m >>n. Each Vi is the direct 
sum of either |[n/2] or |[(n-1)/2] plane rotations, 
depending on whether k is odd or even, respectively. 


Algorithm 2. 


Step 1 (Postmultiply matrix A by orthogonal 
matrix V, for current sweep) 


1.a Initialize the convergence counter, istop, 
to zero. 


1.b For k =1,2,3,...,n-1 (serial loop) 
Simultaneously orthogonalize the column 
pairs (¢,j7), where « and 7 are given 
by l.a in Step 1 of Algorithm 1. , 


provided that for each (7,7) we have 


E42 
(a; a;) 
———— > (tolerance) . (4.1) 
T T 
(a; a,) (a; a;) 
Note: if (4.1) is not satisfied for any 
particular pair (%,7), then stop is 
incremented by 1 and that rotation is not 
performed. 
l.c For k =n 


Simultaneously orthogonalize the column 
pairs (7,7), where « and 7 are given 
by 1.b in Step 1 of Algorithm 1. 


Step 2 (Convergence test) 


If istop = n(n-1)/2, 
o, = V(A'A) 


aw? 


then compute 
1 = 1,2,...,.n, and stop. 


Otherwise, go to beginning of Step 1 to start 
next sweep. 


In the orthogonalization of columns in Step 1 we 
are implementing the plane rotations specified by (3.4) 
and (3.5), and hence guaranteeing the ordering of 
column norms and singular values upon termination. 
Whereas the Jacobi algorithms of Sections 1 and 2 
must update rows and columns following each 
similarity transformation, this one-sided scheme 
performs only postmultiplication of A by each Ve and 
hence the plane rotation (1,7) changes only the 


elements in columns i and j of matrix A. The 
changed elements can be represented by 


at = cay +sa; (4.2) 
a = say" + cay : (4.3) 


where a, denotes the 7-th column of matrix A, and c, 
s are determined by either (3.4) or (3.5). Since no row 
accesses are required and no columns are 
interchanged, one would expect good performance for 
this method on a machine such as the Alliant FX/8 
which can apply vector operations to compute (4.2) 
and (4.3). As with the Multiprocessor Jacobi 
Algorithm in Section 2, no synchronization among 
processors is required for the implementation on 
multiprocessors. Each processor is assigned one 
rotation and hence orthogonalizes one pair of the n 
columns of matrix A. 

Following the convergence test used in [9], in 
Step 2 we count the number of times the quantity 


T 
ay ay 
= s (4.4) 
(a, a,)(a, a,) 
falls, in any sweep, below a given tolerance. The 


algorithm terminates when the counter istop reaches 
n(n-1)/2, the total number of column pairs, after any 
sweep. Upon termination, the matrix A has been 
overwritten by the matrix Q from (3.2) and hence the 
singular values 0; can be obtained via the n square 
roots of the diagonal entries of A'A. The matrix U in 
(3.1), which contains the left singular values of the 
original matrix A, is readily obtained by scaling the 
resulting matrix A (now overwritten by Q = Us) by 
the singular values o,, and the matrix V, which 
contains the right singular vectors of the original 
matrix A, is obtained as in (3.6) as the product of the 
orthogonal V,,'s- This product is accumulated in a 
separate two-dimensional array by applying the 
rotations specified by (4.2) and (4.3) to the n Xn 
identity matrix. It is important to note that the use 
of the ratio in (4.4) is preferable over the use of ay ay, 
since this dotproduct can be necessarily small for 
relatively small singular values. On the Alliant FX/8 
computer, the number of sweeps required for 
convergence using (4.4) ranged between 3 to 8 for a 
tolerance of order 10. 

Although we have restricted our discussion of the 
One-Sided Multiprocessor Jacobi Algorithm to the 
singular value decomposition, this method is certainly 
applicable for solving the eigenvalue problem (1.1) for 
real nonsingular symmetric matrices. If m =n, Aisa 
positive definite matrix, and Q is given by (3.2), it is 


not difficult to show that | ,2_ 2 
P74 
eS To 2g. 4; 
% 
x,=—-" 
my 
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where , denotes the 1-th eigenvalue of A, x, the 
corresponding normalized eigenvector, and q, the :-th 
column of matrix Q. 

Two advantages of this one-sided Jacobi scheme 
over the "two-sided" Jacobi method discussed in 
Section 2 are that no row accesses are needed and that 
the matrix V need not be accumulated. Before 
discussing the performance of the Jacobi algorithms on 
the Alliant FX/8 computer, we present an overview of 
the architectural and software characteristics of the 
Alliant FX/8 computer in Section 5. 


5. The Alliant FX/8 


Both multiprocessor Jacobi algorithms described 
thus far have been implemented along with several 
new and existing EISPACK and LINPACK routines 
on an Alliant FX/8 computer at the Center for 
Supercomputing Research and Development, 
University of Illinois at Champaign-Urbana. On the 
FX/8 computer, 8 computational elements (CEs) 
deliver 94.4 MFLOPS (millions of floating point 
operations per second) peak single precision (32-bit) 
vector, 46 MFLOPS peak double precision (64-bit) 
vector, and 35 MIPS (millions of instructions per 
second) scalar performance. Each CE is a 
microprogrammed general purpose computer with a 5 
stage pipelined instruction processor; a pipelined 
vector and floating-point unit; eight 64-bit, 32 element 
vector registers; a 16 KB (kilobyte) instruction cache 
and concurrency control hardware. Parallel processing 
is achieved by the application of all processors in the 
computational complex to the execution of a single 
program. 

The memory of the Alliant FX/8 consists of a 
physical memory which is expandable to 64 MB 
(megabyte) in 8 MB modules, and 2 expandable cache 
systems that support the CEs. Four-way interleaving 
on each physical memory module permits any module 
to supply the full memory bus bandwidth of 188 MB- 
per-second on sequential read access (and 150 MB- 
per-second on sequential write access). The cache 
systems use a write-back architecture to reduce 
memory bus traffic, and the computational processor 
cache expands to a four-way interleaved, 128 KB 
physical memory buffer that allows up to eight 
simultaneous accesses in each 170 nanosecond cycle. A 
crossbar interconnect links the computational 
processor cache to the CEs and dynamically connects 
any CE to any cache port at 376 MB-per-second 
sustained bandwidth. 


6. Performance On The Alliant FX/8 


In this section we evaluate the performance of our 
two multiprocessor Jacobi algorithms on the Alliant 
FX/8. For comparison purposes, we refer to the 


Multiprocessor Jacobi Algorithm from Section 2 as 
MUJAC, and the One-Sided Multiprocessor Jacobi 
Algorithm from Section 4 as OMJAC. All experiments 
on the Alliant FX/8 were performed using 64-bit 
double precision floating-point numbers. For Figures 
3 and 4 we solve the dense symmetric eigenvalue 
problem 


AX=XA , (6.1) 


where A = [a,,] is n Xn and a;, = dfloat [| maz(i,7)]. 


In Figure 3 we plot the number of MFLOPS 
achieved by the Alliant FX/8 when MUJAC and 
OMJAC are used to compute eigenvalues and 
eigenvectors for matrices of order n. While MUJAC 
and OMJAC are implemented completely in 
FORTRAN, MUJAC(A) and OMJAC(A) — use 
ASSEMBLER subroutines to compute the row and 
column updates as well as column inner products. 
The decrease in performance for MUJAC and OMJAC 
for matrix orders greater than 100 can be greatly 
attributed to the memory limitations of the 
computational processor cache described in Section 5, 
i.e., double precision two-dimensional arrays of order 
100 or greater cannot be stored continguously in the 
128 KB cache. Although peak performance of 12 and 
17 MFLOPS for MUJAC and MUJAC(A) for n = 50 
were much larger than the 11 and 14 MFLOPS for 
OMJAC and OMJAC(A), the variation from the peak 
performance for all n was certainly much smaller for 
the latter pair. Since the number of multiplications 
required by MUJAC to determine and perform a 
rotation (on a full matrix) is twice that of OMJAC, we 
would expect that OMJAC would require only one- 
half the computational time that MUJAC does. This 
was usually the case when both algorithms were used 
to compute eigenvalues and eigenvectors for dense 
symmetric matrices. Although MUJAC is generally 
50% slower than OMJAC, the potential for 
vectorization per rotation is much greater for MUJAC 
and thus we obtain the disparity in MFLOPS between 
the algorithms for n < 200. 
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Figure 8. Performance of Jacobi Schemes 


on Alliant FX/8. 
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For Figure 4, let ¢,, ¢,, and t, represent the execution 
time on 8 processors with vectorization, 1 processor 
with vectorization, and 1 processor with only giobal 
optimization, respectively, and define 


te te 
0, = and 0, = ; 
t, t, 
where o,, represents speed-up for vectorization and 
concurrency, and o, represents speed-up for 


concurrency only. From the graphs in Figure 4, we 
observe that MUJAC can maintain a larger o,, than 
OMJAC for n < 400, and yet o, for MUJAC 
decreases immediately from its peak of 5 at n = 50. 
Vectorization is the major reason for this result. 
Comparing the values of o,, and o, for MUJAC and 
MUJAC(A), we notice that o,, ¥ 2 Xo,, and so we can 
increase the speed-up of MUJAC and MUJAC(A) by a 
factor of 2 if we use vectorization. However, for 
n > 100 we observe the memory limitation of the 
computational processor cache by the sharp declines in 
o,, and o, for both MUJAC and MUJAC(A). These 
declines are due not only to the cache memory 
limitiations incurred by large n, but also to the row 
accessing that must be done by MUJAC. OMJAC and 
OMJAC(A), on the other hand, do not show such 
sharp declines in o,, and o,, and so the cache effect is 
not nearly as great when only column accesses are 
required. For OMJAC and OMJAC(A) we note that 
the graphs of o,, and o, are very similar, and so the 
vectorization effect is not as substantial as it was for 
MUJAC. However, where in the case of MUJAC and 
MUJAC(A) we can expect the speed-ups o,, and a, to 
significantly drop for large n, OMJAC and 
OMJAC(A) show little variation from their peak 
performance for large n. In fact, OMJAC(A) is nearly 
ideal in that it maintains a speed-up close to 8, the 
number of processors, for all n. 
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Figure 4. Speed-ups for concurrency with 
and without vectorization on Alliant FX/8. 


7. Comparisons With EISPACK and LINPACK 


We now ‘compare the performance in speed and 
accuracy of MUJAC and OMJAC with that of new 
and existing EISPACK and LINPACK routines on the 
Alliant FX/8. For the dense symmetric eigenvalue 
problem we first compare MUJAC, OMJAC, and 
TRED2+TQL2 from EISPACK [12]. In order to 
compare a set of highly efficient subroutines that are 
optimized for vector as well as parallel processing, we 
compare MUJAC(A) and OMJAC(A) which use 
ASSEMBLER routines for applying rotations and 
computing dotproducts, with TQL2 and the new 
matrix-vector implementation of TRED2, TRED2V, 
developed by Dongarra et al. in [4]. On the Alliant 
FX/8, MUJAC(A) and OMJAC(A) are 30% faster 
than MUJAC and OMJAC, and TRED2V is on 
average 40% faster than TRED2. In Figure 5, we 
compare the timing of these three algorithms on 8 CEs 
with full optimization (global and vector) for 

AX =XA , 
where A = [4;,] is 
a;, = dfloat |(t+j-1)/n]. 


nxXn and 


The largest n for which MUJAOC(A) executed 
faster than TRED2+TQL2 was 90, while OMJAC(A) 
consistently out-performed the other two algorithms 
and required only one-half the execution time of the 
EISPACK routines for each n. For n> 100 we 
exceed the Alliant F'X/8’s cache memory capability 
and consequently obtain sharper increases in time 
especially for MUJAC(A). Convergence for both 
MUJAC(A) and OMJAC(A) occurred after 3 to 4 
sweeps, and for each n the order of accuracy in the 
eigenvalues and eigenvectors computed by these 
Jacobi algorithms was identical to that obtained by 
TRED2V+TQL2. 
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Figure 5. Execution Times for Highly 
Optimized Routines. 
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For any symmetric matrix of order greater than 100, 
the advantage of using MUJAC (and OMJAC) rather 
then TRED2+TQL2 will primarily depend on whether 
or not convergence can be obtained within 2 to 3 
sweeps. This rate of convergence is more prevalent 
not only in the more diagonally dominant matrices 
but also in symmetric matrices with a large number of 
multiple eigenvalues (see [16]). 

For the singular value decomposition of a real 
m Xn matrix A(m >> n) 


A=UZV , 


where U'U=V V=L, and 5 = diag(o,, -- - oy), 
we compare the speed and accuracy of OMJAC and 
OMJAC(A) with that of the appropriate routines from 
EISPACK and LINPACK, SVD and DSVDC. Recall 
that both routines SVD and DSVDC reduce the 
matrix A to bidiagonal form via Householder 
transformations and then diagonalize this reduced 
form using plane rotations. We will also compare our 
results with OMJAC and OMJAC(A) with the new 
matrix-vector implementation of SVD, SVDV, 
developed by Dongarra et al. in [4], which has been 
demonstrated to achieve 50% speed-up in execution 
time over SVD on machines such as the CRAY-1. 

Suppose that the elements of the m X 32 matrix 
A= [a;; | are given by 


a;, = dfloat {(i+j-1)/n] 


for m >> 32, and that we wish to determine all the 
singular values and singular vectors of the matrix A. 
In Figure 6, we present the speed-ups for OMJAC(A) 
over each routine used to compute the singular values 
and singular vectors of the matrix A on the Alliant 
FX/8 with 8 CEs and vectorization for each m. With 
regard to accuracy, OMJAC(A) was somewhat less 
accurate than SVDV and .DSVDC for the smaller 
values of m, but very competitive for m > 512. 
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Figure 6. Speed-ups for OMJAC(A) on Alliant FX/8 
in Double Precision. 


8. Conclusions 


For solving the dense symmetric eigenvalue 
problem on multiprocessors such as the Alliant FX/8, 
we have presented a parallel algorithm (based on 
Jacobi’s method for symmetric matrices), MUJAC, 
that can produce accurate eigenvalues and 
eigenvectors significantly faster than the popular 
EISPACK routines, TRED2+TQL2, for matrices of 
order less than 100. We have also presented a Jacobi- 
like one-sided multiprocessor algorithm, OMJAC, that 
can be used to solve not only the symmetric eigenvalue 
problem but also the singular value decomposition of 
mXn matrices with m >>n. When used to 
determine eigenvalues and eigenvectors (though 
somewhat less accurate) on the Alliant FX/8, a highly 
efficient implementation of OMJAC using FORTRAN 
and ASSEMBLER, OMJAC(A), executes not only 50% 
faster than MUJAC but also substantially faster than 
the most efficient EISPACK routines, 
TRED2V+TQL2, for all matrix orders considered. 
When compared with the new and existing EISPACK 
and LINPACK routines for computing the singular 
value decomposition, OMJAC(A) is at least 2 to 3 
times faster. . 

Future work with the Jacobi algorithms 
presented in this paper will involve the use of blocking 
schemes to optimize MUJAC and OMJAC for better 
cache management of the Alliant FX/8 as the order of 
the matrix gets large. 
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1. Introduction 


The problem of solving a real, symmetric and 
positive definite matrix of size n, on vector 
computers of type Cray-1 and Cyber-205 is 
considered. Three parallel algorithms based on 
simple Gaussian elimination (hereafter refered to 
as GE algorithm), Gaussian elimination with scaled 
partial pivoting (PGE algorithm) and the Cholesky 
decompositon (LDLT algorithm) are analyzed and 
compared in terms of their speedup and efficiency. 
Actual performance of these three algorithms on 
Cray-1 and Cyber-205 is also included. 

A study based on these algorithms for MIMD 
(multiple-instruction multiple-data) type machines 
is presented by Kumar and Kowalik [4]. Details 
about Cray-1 and Cyber-205 architecture can be 
found in Hockney and Jessehope [2], Kascic [3] and 
Gentzsch [1]. 


2. The algorithms 
Six algorithms to solve a randomaly generated 


symmetric positive definite matrix (diagonally 
dominant) in the vector computers Cray-1 and Cyber- 


results that the factor, whether the substitution 
is vectorized or not, does nog affect the CPU time 
much. This is because the 0(n°) operation counts in 
the factorization dominates the 0(n“) operation 
counts in the substitutions. 


The second algorithm is essentially the same 
as GE algorithm except we introduce the "scaled 
partial pivoting" strategy. In general, the 
pivoting should give us a more accurate solutions. 
Thus, it is useful to investigate the price of a 
pivoting Strategy on the vector machines. In this 

“scaled partial pivoting" strategy, the bookkeeping 
is stored in a vector and used later on in the 
forward substitution. The pivoting is a scalar 
process, therefore in the PGE program only the 
factorization and the backward substitution can be 
vector ized. 

The final algorithm is the Cholesky method 
(LDLT). In this decompositon, A = LDL where L is a 
lower triangular with unit elements and the D is a 
diagonal matrix with positive elements. To avoid 
the square root computation is the reason for 
introducing D matrix. The algorithm is as follows: 


205 are analyzed. These 6 programs are the scalar d(1) = a(1,1) 

and vectorized version of the GE, PGE and LDLT For i= 2,----- 

algorithms. The algorithms are arranged in a way so L(i,1) = a(i, y7atay 

that the vectors are contiguous in the memory. For k = 2,----- »n-1 do; _ 

Thus, ‘identical’ codes are compared on the two k- 

machines. d(k) = a(k,k) - £2 d(p)*L(k,p)**2 
The first algorithm is the Gaussian p=l 

elimination method (GE) without pivoting. The For i = k+l,----- »n do; 


factorization part of the algorithm is as follows: 


L(i,k) = (a(i,k) - , L(i,p)*d(p)*L(k, p)/aCk) 
For k=l,----- wn-1 do; p=l 
For i=k+l,----- wn do; k-1 
a(i,k):=a(i, k)/a(k.k) d(n) = a(n,n) - = L(n,p)**2*d(p) 
For j=k+l,----- wn d05 p=1 
For i=k+1, ----- 


a(i,j): =a( i, j) es *Ktali, j) 


Note that the 7 loops of the above algorithm 
are the vectorized loops in both machines because 
of parallel computations and contiguous memory 
locations involved. The backward and forward 
substitution in this code are vectorized according 
to the “column sweep" method. It is noticed in the 
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The i loops are again the vectorized loops for both 

computers. 3 
The operation coynts for sequential GE is n°/3 

while it is roughly n°~/6 for the Cholesky method. 


3. Computational Results 


The speed up of an algorithm is defined here 


as its the sequential CPU time divided by its 
vectorized CPU time. The normalized CPU time is 
defined as the CPU time divided by the order of 
Operation counts (here is N©). All programs were 
tested on a randomly generated positive definite 
matrix. Table 1 gives the CPU time (in seconds) 
for the GE, LDLT and PGE programs on the Cyber-205 
and the Cray-1 for different values of N. From 
Table 1 we can see there is not much difference in 
the CPU. time for the scalar PGE program and the 
scalar GE program(within 10% difference). The 
speed up values on the Cray-1 never exceed 10 while 
the speed up values on the Cyber-205 can be greater 
than 20 (e.g. the GE algorithm for N = 200). 


The speed up factors as functions of the 
matrix size N for all three algorithms on both 
vector machines are shown in Figure 1. The GE 
algorithm has larger speed up compare to the other 
two algorithms on both computers. This may imply 
that the GE algorithm has more parallelism in the 
computations. The GE algorithm in the Cyber-205 
especially has the largest speed up values (e.g. 
maximum speed up of 22.3 for N = 200) except when N 
is very small. On the Cyber-205, the LDLT has 
better speed up than the PGE algorithm due to the 
fact that certain percentage of the code in PGE can 
not be vectorized. The speed up of the LDLT 
algorithm is about the same as the speed up of the 
PGE algorithm in the Cray-1( it ranges from 2 to 5 
from small N to large N ). This probably is due to 
the Cray-1 has better scalar process to handle the 
pivoting. In general, the Cyber-205 has better 
speed up than the Cray-1 especially for large N. 
This means there is a substantial difference in 
program performance as one change from the scalar 
environment to the vectorized environment on the 
Cyber-205. The results are reasonalbe since Cyber- 
205 favors long vector length for peak performance 
and its scalar codes do not run as fast as their 
counterparts on the Cray-1l. 


Table 1. The CPU time (in seconds) for the vectorized 
and scalar versions of the GE, PGE and LDLT algorithms 
on both vector computers. The S columns are the speed up 
values for the particular algorithm. 


CYBER-205 


PGE LDLT GE 
N SCALAR VECTOR S SCALAR VECTOR S SCALAR VECTOR S 


50 0.033 0.009 3.7 0.023 0.006 3.8 0.029 0.004 6.7 
100 0.236 0.037 6.4 0.170 0.025 7.0 0.221 0.018 12.2 
150 0.765 0.086 8.9 0.56 0.057 9.8 0.73 0.043 16.9 
200 1.99 0.168 11.8 1.335 0.109 12.2 1.853 0.083 22.3 

CRAY-1 


PGE LDLT GE 
N SCALAR VECTOR S SCALAR VECTOR. S SCALAR VECTOR S 


50 0.025 0.007 3.4 0.015 0.005 3.4 0.022 0.005 4.9 
100 0.175 0.036 4.9 0.106 0.020 5.2 0.161 0.027 6.1 
150 0.594 0.092 6.5 0.322 0.053 6.0 0.532 0.074 7.2 
200 1.381 0.215 6.4 0.74 0.11 6.7 1.278 0.178 7.2 


442 


0.0 


100 
N 


2 SO 75 I25 150 (75 200 


Figure 1. The speed up values as a function of 


the matrix size N for the GE, PGE and 
LDLT algorithms on both vector com- 
puters. The solid lines are the results 
from the Cray-1. The dashed lines are 
the results from the Cyber-205. 


Figure 2 illustrtes the normalized CPU time as 
a function of the matrix size N. The upper four 
curves are the results of the scalar codes on both 
computers. The PGE results are so close to the GE 
results thus not shown here. The scalar LDLT and GE 
programs run faster on the Cray-1 than on the 
Cyber-205. The scalar LDLT program runs faster 
than scalar GE program in both machines. These 
results are due to better scalar process on the 
Cray-1 and less operations involed in scalar LDLT 
program. We also observe from these scalar curves 
that the Cyber-205 curves reach their asymptotic 
performance with smaller N compare to the Cray-1 
curves. This can be interpreted as the faster 
reach of Cyber-205's scalar optimal efficiency than 
the Cray-1. . 

- The bottom six curves the Figure 2 are the 
results of the normalized CPU time for the 
vectorized codes on both computers. Non of the 
testing codes reached its asymptotic performance 
for the experimental values of N. The vectorized 
LDLT program runs faster than the vectorized GE 
program on the Cray-1. On the other hand, the 
vectorized GE program performs better than 
vectorized LDLT program on the Cyber-205. Actually 
the vectorized GE performs best on the Cyber-205 
for N greater than 100. The PGE program performs 


better on Cyber-205 than the PGE program on the 
Cray-1 only when N is somewhat greater than 100. 
The bigger size of matrix A provides Jong length 
vectors, therefore the performance is increased on 
the Cyber-205. 


In all cases, the computed error norms do not 
show significance difference from algorithms to 
algorithms and from machine to machine. If we are 
to solve a positive definite matrix that is not 
diagonally dominant, the accuracy then becomes a 
big concern, and we should use the vectorized PGE 
program. The pivoting stragtegy is cheap in the 
scalar environment (shown in Table 1), but jis as 
expensive as the factorization part of the program 
for larger N in the vectorized environments. The 
PGE is not recommended to solve a diagonally 
dominant matrix on the vector machines. 
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Figure 2. The normalized CPU time (in seconds) as 


a function of matrix size N on both 
vector computers. The solid curves are 


from the Cray-1 while the dashed curves 


are from the Cyber-205. The upper 4 
curves are scalar results while the 
lower 6 curves are vector results. 


4. Conclusions 


A comparative performance analysis of three 
different algorithms to solve a positive definite 
system of linear equations on two vector computers 
ts presented. Our experimental results for vector 
computers show that the vectorized Gaussian 
elimination performs best of the three methods on 
Cyber-205 vector computer, whereas the vectorized 
Cholesky method is best suited to Cray-1. All the 
corresponding scalar codes ran faster on Cray-1 
than on Cyber-205. On the other hand, the scalar 
codes reach their optimal asymptotic efficiency 
with smaller matrix size N on the Cyber-205. In 
general,the Cyber-205 has better speed up than the 
Cray-1. The difference in CPU time for scalar PGE 
and GE programs on both machines were negligible. 
The price of the pivoting in PGE program can be as 
expensive as the factorization part of. the GE 
Program in the vectorized environment. 
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Abstract 


Several combinatorial optimization problems are 
known to be NP-complete. As a consequence, fast parallel 
algorithms for finding optimal solutions for such problems 
using a polynomial number of processors are unlikely. An 
important practical approach to solving such problems on 
parallel machines is to seek approximate algorithms for 
them. In this paper, we present an approximate algorithm 
for the 0-1 knapsack problem on the parallel random 
access machine model. Our algorithm takes an €,0 < € < 
1, as an input parameter and finds a solution such that the 
ratio of its deviation from the optimal solution is at most 


e. Our algorithm requires O(log?n + log’nlog +) time and 
€ 


nz 
uses at most —— processors. In contrast to a naive algo- 
el 


rithm that requires significantly more processors, our algo- 
rithm exploits the relationship among the sets of solutions 
to the problem generated in the intermediate steps to 
reduce the number of solutions considered. This in turn 
results in a reduction in the processor requirements. 


1. Introduction 


There is a growing interest in the design of parallel 
algorithms for SIMD machines. In particular, the design of 
SIMD computer algorithms for problems such as sorting, 
matrix computations, and graph and network computa- 
tions has received widespread attention. In contrast, very 
little attention has been paid to designing SIMD computer 
algorithms for such combinatorially hard problems as 
knapsack problems, vertex covering, and set covering. In 
this paper we will describe an approach to the design of 
parallel algorithms for one such problem, namely, the 0-1 
knapsack problem (see Section 2 for definition). This prob- 
lem is known to be NP-complete [GJ79] and as a conse- 
quence, a polynomial time parallel algorithm that finds an 
optimal solution using a polynomial number of processors 
is unlikely. An important practical approach to solving 
such problems has been to design fast algorithms that find 
approximate instead of optimal solutions. Here we present 
parallel algorithms for finding approximate solutions to the 
knapsack problem. 
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Several approximate algorithms on _ sequential 
machines for various NP-complete problems have been 
reported (see [GJ76] for a bibliography of some early 
work). Many of these algorithms use a greedy strategy 
[HS78, Chapter 4]. It has been shown by Anderson and 
Mayr [AM84] that greedy algorithms for several graph 
problems are inherently sequential and hence, fast parallel 
implementations of these sequential algorithms are 
unlikely. Several approximate algorithms currently known 
appear to be highly sequential and hence, there is a need to 
develop new techniques for finding approximate solutions 
to some of these NP-complete problems on parallel com- 
puters. 


The model of parallel computation we use here is a 
parallel random access machine (PRAM) [FW78]. We 
allow several processors to read from the same location 
simultaneously. However, simultaneous writing into the 
same memory location is disallowed. Such a model is usu- 
ally known as a concurrent read exclusive write (CREW) 
PRAM. An important goal of research on algorithms for 
this model has been to design poly-logarithmic time (that 
is, O(log*n) for some k) algorithms. The algorithm 
reported in this paper finds an approximate solution to the 
knapsack problem in poly-logarithmic time using a polyno- 
mial number of processors. The solution found by our 
algorithm will be worse than the optimal solution by at 
most a factor of €, for any given €, 0 < € < 1. Our algo- 


rithm requires O(log*n + log*nlog = time and at most 
€ 


2.5 
Te Processors. A naive algorithm would generate several 
ee 


partial solutions and merge them to obtain the final solu- 
tion. As explained in Section 3.3, our algorithm exploits 
the relationship among these partial solutions in order to 
reduce the number of such solutions to be generated. This 
leads to a reduction in the processor requirements of our 
algorithm. The processor bound is established using an 
interesting combinatorial result. 


The only known parallel algorithm for this problem is 
by Peters and Rudolph [PR84]. Their algorithms achieve 
speedups using a limited number of processors by exploit- 
ing the explicit parallelism in the inner loop of the sequen- 
tial algorithm of Lawler [LE79]. The divide-and-conquer 
algorithm presented there uses a simple merging procedure 
and hence requires substantially more processors compared 
to our algorithm. Our attempt has been to explore new 
properties of the problem that can be exploited in the 
parallel algorithm. 


The rest of the paper is organized as follows. In the 
next section we introduce the knapsack problem. In Sec- 
tion 3, we present a parallel approximate algorithm for the 
0-1 knapsack problem. The combinatorial result that is 
made use of in proving the processor bound is also 
presented in that section. Some concluding remarks are 
given in section 4. Several implementation details as well 
as proofs of some theorems are omitted in this preliminary 
report for the sake of brevity. Proofs of other theorems 
are sketched briefly. Complete proofs and other implemen- 
tation details will appear in a forthcoming paper (GRK86| 


2. The 0-1 Knapsack Problem 


The 0-1 knapsack problem is defined as follows. We 
are given n elements having positive integer valued profits 
P1,P2,°°°,P, and positive integer valued weights 
W},Wo, °° *,W,. We are also given a positive integer valued 
knapsack capacity K and the optimization problem is to 
find aset S C {1, 2, ..., n} that maximizes }}p; subject to 

i€S 
the constraint })w, < K. 
i€S 

We can assume that all the weights are at most equal 
to the capacity K. Any solution to the knapsack problem 
is a set of integers which is a subset of the set { 1, 2, ..., n 
}. We will denote the profit of a solution S by P(S) and 
accumulated weight by W(S). Thus, P(S) = S}p;, and 


icS 

W(S) = }>w. S is a feasible solution if W(S) < K. A 
i€S 

feasible solution S” is an optimal solution if P(S") > P(S) 
for any other feasible solution S. We denote the profit of 
an optimal solution by P”. In the rest of the paper the 
term solution will stand for a feasible solution unless expli- 
citly stated. , 


An approximate solution to the 0-1 knapsack problem 
takes as its input a real number ¢, 0 < € < 1, and finds a 
solution with profit P such that P” - P < eP". Several 
such algorithms on sequential machines have been reported 
in the past [IK75, LE79, MO81, S8S75]. Our aim in this 
paper is to design such an approximate algorithm on the 


PRAM model. 


3. Parallel Approximate Algorithm 


In this section we first present a simple parallel algo- 
rithm that finds an optimal solution to the problem. How- 
ever, this algorithm would require an exponential number 
of processors to find an optimal solution in polynomial 
time. We then outline a scheme for obtaining an approxi- 
mate solution in poly logarithmic time using polynomial 
number of processors. This approximate algorithm is 
further refined to obtain a more efficient algorithm with 
reduced processor requirements. 


We will use the following notation in the description 
of the algorithms below. R,, S;, T;, and U; (i = 1, 2, ...) 


will denote solutions to the knapsack problem and G,, H; (i 
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= 1, 2, ...) will denote sets of solutions. We will also 
assume, for the sake of simplicity that n is a power of 2. 


All our algorithms use the following dominance rela- 
tion between solutions in order to discard partial solutions. 
Definition 
Let S, and Sp, be two solutions. Then S,; < Sy (S, dom- 
inates S,) if P(S,) < P(S,) and W(S,) > W(S,). 

e 


Observe that if S|; < S; and S, is a solution that is 
disjoint from both S; and S; then S; U S, will be dominated 
by S; U S,. Thus, for every feasible solution that can be 
obtained from 8; by adding more elements, there is another 
feasible solution that can be obtained from S; having possi- 
bly greater profit. Hence, given any set of feasible solu- 
tions, we can remove all solutions from this set that are 
dominated by other solutions in the same set. 


Note that the dominance relation is transitive. Using 
this property, it is easy to show the following. 


Lemma 3.1: Let I be a subset of {1, 2, ..., n} and let G, 
be the set of all feasible solutions that are subsets of I and 
are not dominated by any other subset of I. Let S; be any 
feasible solution that is also a subset of I. Then, either Sj 
€ G; or there is a solution S, € G, such that Sj < Sy. 


e 
Let F(i,j) denote the set of all feasible solutions that 
are subsets of the set {i, i+1, ..., j} and are not dominated 


by any other feasible solution in the same set. 


3.1. A Simple Exact Parallel Algorithm 


Algorithm 1 below is a simple parallel algorithm for 
finding an exact solution to the 0-1 knapsack problem. 


Algorithm 1 


[1] Call FEASIBLE(1,n) (given below) to find the set 
F(1,n). 
[2] 


Find the solution with the maximum profit from 
procedure FEASIBLE(i,j) 


F(1,n) in parallel. 
(* This procedure returns the set of solutions F(i,j) *) 


[3] If i = j then return the set containing the solutions ¢ 
and {i} with P(¢?) = W(¢) = 0, and P({i}) = p,, 
W(ti}) = wi. 
otherwise 

[4] Compute G, =F(i,(i+j)/2) using FEASIBLE(i,(i+j)/2) 
and Gp = F((i+j)/2 + 1, j) using FEASIBLE((i+j)/2 
+ 1,j) in parallel. 

[5] For each S; € G, do in parallel - 


For each S; € Ge do in parallel - 

Compute S; U §;. Since this is a disjoint union, 
the profit of S; U S; is P(S;) + P(S,) and its weight 
is W(S;) + W(S;). If W(S,\US;) > K then discard 


this infeasible solution, else add it to the set of 
solutions Gg. 


[6] Remove all dominated solutions from G3 and return 
the set of solutions thus obtained 
e 
Analysis 


The correctness of the procedure FEASIBLE(i,j) is 
easy to establish using induction on j-i. Clearly, F(i,j) is 
correctly set up when i-} — 0. Assume now that 
FEASIBLE(i,j) returns the set F(i,j) for all values of i-j] < 
n—1. Now, let i-] = n—1. Consider any solution S that is 
a subset of {i, i+1, ..., j}. Let the intersection of S with 
{i,..., (it+j)/2} be S, and its intersection with {(i+j)/2 + 1, 
wy J} be Sg. Since F(i, ..., (i+-j)/2) and F((i+j)/2 + 1, ..., j) 
are assumed to be correctly computed by the recursive 
calls, by lemma 3.1, S, is either present in F(i, ..., (i+j)/2) 
or there is some solution S, in that set that dominates S}. 
In the latter case S will be dominated by S,US,. A similar 
argument can be applied to Sy. We can thus show that a 
feasible solution is in the set returned by FEASIBLE(1,n) if 
and only if it is not dominated by any other solution. 


Observe that if two solutions have the same profit 
value, one of them will be discarded because it will be 
dominated by the other. Similarly, if two solutions have 
the same weight, one will be discarded. Thus, there will be 
at most one solution for each integer value of the profit 
and for each value of the weight in the set of solutions 
returned by procedure FEASIBLE. Therefore, there will 
be at most ¢ solutions in these sets, where c = min {P’, 

The recursive procedure FEASIBLE goes through log 
n stages off recursion. In each stage, step [5] requires O(1) 
time and c? processors as we require one proces or for each 
pair of solutions $;, 5; and there are at most c? aun pairs. 
Step (6] can be axesuted in O(log c) time using c? proces- 
sors as removing the dominated solution involves finding 
the minimum among at most c” values (details are omitted 
here). Since the two recursive calls in procedure FEASI- 
BLE are executed in parallel, at most O(n) sets of solutions 
will be constructed simultaneously. Hence, the oe pro- 
cessor requirement for generating the set F(1, n) is ne” and 
the time required is O(log nlog c). Finding the maximum 
profit solution from F(1,n) will require O(log c) time and c 
processors and hence the time complexity of Algorithm 1 is 


O(log nlog c) and the processor requirement is ne’, 


. Approximation Scheme 


Since c can be arbitrarily large, the time complexity 
of Algorithm 1 is not a true logarithmic function of n and 
the processor requirement is not a true polynomial of n. 
However, we can scale down the profits [IK75, LE79, SS75] 
- to obtain a bound on c that is a polynomial in n. We 
divide all the profit values by a scale factor \. For the 
problem with these scaled down profits, the optimal profit 
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-% 
The scale factor X is chosen to 


* 


cannot be more than 


obtain an upper bound on that is a polynomial in n. 
The scale factor is derived as follows. 


Let the profits and weights be arranged such that 


Pu > a2 > > Ea . Now, choose j such that 

Wi Wo Wh 

(wy t Wot -°* +wj)) SK < (wy t+ wot 0° * + Wj + Wisi) 
Let , | 


Po = max{py + Po + °° * +P)» Pm}, Where pm = max{pi} 
Let € be the allowed deviation from the optimal solution, 0 
< € <1. That is, we are required to find a solution with 
profit P such that P* - P < «P”. 


Theorem 3.1: If the profits are scaled down using a scale 
factor \) = €P)/n then an optimal solution to the problem 
with the scaled profits will be an approximate solution to | 
the original problem satisfying the desired bounds. More- 
over, the profit of any solution to the scaled provi will 
not exceed 2n/e. 


Proof: See [LE79]. 


A parallel algorithm based on the above derivation is 
given below. 


Algorithm 2 
[1] Sort the weights and profits into non-increasing order 
of the ratio p;/w- 


Compute Pp and scale factor \ using the expression 
given above. 


Scale all profits p; to obtain r; = [p;/A]. 

Use Algorithm 1 on the modified problem with profits 
(r},To,---,T,), Weights (w,,...,w,) and capacity K. Let P, 
be the profit of the solution to the modified problem 
found by Algorithm 1. 


The approximate solution to the original problem is 
the solution found above and its profit P is XP}. 


5] 


Complexity 


The sorting step [{1] in the above algorithm will 
require O(log n) time and n processors using the algorithm 
in {[AK83]. To compute Po, we need to find out the integer 
j such that 
Wit Wot +: tw) SK Cw, +wot - ‘Wj Wj4d. 
Such a j can be found in O(log n) time with n processors. 
Step {3} requires O(1) time and n processors. Thus the 
time and processor requirements of Algorithm 2 are dom- 
inated by the requirements of step |4]. From the analysis 
of the complexity of Algorithm 1, it follows that Algorithm 
2 requires O(log n.log c) time and nc’ processors. Since c 


is O(n/e) for the approximate algorithm, the time complex- 


ity 1s O(log?n + log n.log —) and the number of processors 
€ 


‘ . nN 
required is — !. 
€2 


3.3. An Improved Algorithm 


We can obtain a more efficient parallel algorithm (in 
terms of the processor time product) by modifying the 
merging step used in procedure FEASIBLE of Algorithm 1. 
This modification, as we shall see later, will increase the 
time by another factor of (log n) and decrease the worst 


n n 
case processor requirement to — 
€ 


The processor- 


time product of the resulting algorithm is significantly 
smaller. This reduction in the number of processors is 
significant since the value of € is usually very small. 


Notice that the merging step of procedure FEASIBLE 
is highly wasteful since, out of the c* solutions that are 
generated, at most c are retained. We need not generate 
several of these solutions if we make effective use of the 
information from previous stages of the procedure FEASI- 
BLE. Recall that in procedure FEASIBLE(i,j), F(i,j) is 
computed by first computing F(i,(i+j)/2) and F((i+j)/2 + 
1, j) and then merging them. Let G, denote F(i,(i+j)/2) 
and let Gy denote F((i+j)/2 + 1, j). Instead of merging G, 
and Gp» directly by taking the union of every solution from 
G, with every solution from Gog, we first merge G, with the 
two sets that were merged to obtain Go, namely, F((i+j)/2 
+ 1, 3(i+j)/4) and F(3(i+j)/4 + 1, j). Let G3 and Gy, 
denote F((i+j)/2 + 1, 3(i+j)/4) and F(3(i+j)/4 + 1, 3) 
respectively. Let H, and H, denote the result of merging 
G, with Gg and Gy, respectively. That is, H, is the set of 
all feasible solutions that are subsets of the set {i, i+1, ..., 
(i+j)/2}U{(i+-j)/2 + 1, ..., 3(i+j)/4} and are not dom- 
inated by any other feasible solution in the same set. 
Similarly, H, is the set of all non-dominated feasible solu- 
tions that are subsets of the set {i, itl, ..., 
(i+j)/2}U{3(i+j)/4 + 1, ..., j}. Figure 3.1 shows the rela- 
tionship among these sets. The reduction in the number of 
solutions follows from the fact that when G, is merged 
with Gg and all dominated solutions are removed, any solu- 
tion that is discarded will not form part of a non- 
dominated solution in the set obtained by merging G, and 
Gy. The same is the case with the solutions that are dis- 
carded when merging G, and G4. This result is proved 
below. 

Each non-empty solution S; in Gg is either a solution 
Uj; in Gg, or a solution V; from Gy, or the solution U\UV; 
(where U; and V; are non-empty). If S, is Uj, its union with 
solutions in G, would already have been determined while 
computing H,. Similarly, if S; is Vj, its union with the 
solutions in G, will be present in Hy. So, we need now 


+Here we use the property that any algorithm that can be executed in 
time T(n) using O(p) processor can be executed in O(T(n)) time using p proces- 
sors. 
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Figure 3.1. Broken lines indicate merging. 


consider only the solutions U;UV; in Gp. 


Let {(U;) be the set of solutions from G, whose union 
with U; occurs in H,. That is, 


{(Uj) = {Ry | Ry € G; and U\UR, € Hy} 


Similarly, let {(V;) be the set of all solutions from G, whose 
union with V; occurs in Ho. That is, | 


{(V;) = {R, | Ry E G, and VjUR, = Ho} 


The following lemma is the key to the reduction in the 
number of solutions generated. 


Lemma 3.2: Let S; be U; U V; and let R, be a solution in 
G, that is not in f(U))Nf(Vj). Then one of the following is 
true - (1) 5; UR; < U; UT, for some T, in Hp, or (2) 
5; UR, < V; UT, for some Ty in Hj. 
Proof: Let R, be in f(U;) and not in f(V;). This implies 
that Ry U V; does not occur in He. By Lemma 3.1, there is 
a solution T,; in He such that R; UV; < T, . Now, 
5; UR; = U;UV;UR, < U; UT), since T; and Uj are 
disjoint. 

If S; U R, is a feasible solution then U; U T, is also a 
feasible solution. Similarly, if Ry is in f(V;) and not in f(U;) 
then there is a solution T. in H, such that U; UR, < To. 


(3.7) 


(3.8) 


‘This implies that S; UR, < V; UT». If Ry belongs to nei- 
ther f(U;) nor f(V;), then both conditions (1) and (2) hold. 
® 


Lemma 3.2 shows that the union of certain pairs of 
solutions from G, and Gg need not be generated. In par- 
ticular, for every solution UjUV; in Go, we need generate 
only its union with every solution in f(U;)Nf(V;). This is 
implemented by replacing the merging step in procedure 
FEASIBLE of Algorithm 1 (step [4]) with the following 
procedure : 


procedure MERGE(G,,Gg,i,j) 
(* Here Gg is F(i,j) *) 

[1] If i=j then merge G, and Gy, by taking the union of 
each element in Gg with every element of G, in paral- 
lel. Compute the union of the resulting set with G, 
and Gp» and remove all dominated solutions. Return 
the set thus obtained. (* There will be only two solu- 
tions in Gp» if i=j and hence, this step can be exe- 
cuted in O(1) time using c processors. *) 


otherwise 


[2] Let G3 = F(i(i+j)/2) and G, = F((i+j)/2 + 1, j). (* 
These are available from the previous stage of pro- 
cedure FEASIBLE. *) | 
Call MERGE(G,,G3,i,(i+j)/2) to obtain H, 
and in parallel call MERGE(G,,G,,(i+j)/2 + 1, j) to 
obtain Hp. 


[3] For each non-empty solution S, € G, that is of the 

form U; U Vj, where Uj € Gg and V; € Gy and Ui, V; 
are non-empty, do in parallel - 
Identify the set of solutions f(U,)Nf(V;). Generate the 
solution SjUR, for each Ry in f(U;)Mf(V;) in parallel. 
If W(S;UR,) > K then discard this solution, else add 
it to the set of solutions Hz 


[4] Compute the union of H,, Hy, and Hg and remove all 
dominated solutions to obtain Hy. 


[5] Compute the union of Hy, Gy, and G, and remove all 
dominated solutions. Return the resulting set. 


We prove the correctness of the MERGE procedure 
below. 


Theorem 3.1: Let G, be F(k,l) and G, be F(i,j), where k 
<1< i<j. Let Hs; be the set of solutions generated in 
step [5] of the procedure call MERGE(G,,Gz,i,j). Then H; 
is the set of all feasible solutions that are subsets of the set 
{k, .... }Uf{i, ..., j} and are not dominated by any other 
solution in the same set. 


Sketch of Proof: We prove the result using induction on 
(j-i). The base case is correctly handled by step [1]. Let 
us assume that the recursive calls in step [2] return sets H, 
and H» which contain all the feasible non-dominated solu- 
tions that are subsets of the set {k, ..., ]}}Ufi, ..., (i+j)/2} 
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and the set {k, ..., }U{(i+j)/2 + 1, ..., j} respectively. To 
show that H, is the set of all non-dominated solutions that 
are subsets of {k, ..., }Ufi, ..., j}, we have to show that for 
any feasible solution S C {k, ..., l}Ufi, ..., j}, S is in Hs if 
and only if it is not dominated by some solution S, in Hs. 


Since G, is F(k,l) and Gg is F(i,j), we need concern 
ourselves only with solutions S of the form R,UT,, where 
Ri is in G, and T, is In Go. | 
(I) Sufficiency: We need to show that if S ¢ H; then S < 
S, for some S, in Hg. 


If S € H; then either S was generated in step [3] and 
eliminated in steps [4] or [5], or S was not generated in 
step [3]. If S was generated and eliminated then obviously 
S < S$, for some S,; € Hs. Assume now that S was not 
generated at all. If S is of the form R,UT, and T, is either 
equal to U, for some U, € Gz, or equal to V, for some V; € 
G, then by the induction hypothesis and lemma 3.1, either 
S is present in H, or Hg, or S < 8S, for some 8S, in H, or Hg. 
Since H, and Hg, are used in step [4], there must be some 
solution in H; that dominates S. Now, if S is of the form 
T,UR,, and T,; = U,UV, for some U, € Gz, V; € Gy, then 
we can show, using Lemma 3.2 that S would be dominated 
by some solution that is already in Go, G,, Hy, or Hg, or is 
generated in step [3]. 


(II) Necessity: We want to show that if S < S, for some 8, 
in H; then S € Hs. 

This is trivial since all dominated solutions from the 
solutions that are generated are eliminated in steps [4] and 
[5]. 

Thus, the set H; generated in step [5] is the set of all 
feasible solutions that are subsets of {k, .... }Ufi, ..., j} 
and are not dominated by any other solution in the same 
set. 

e 


From Theorem 3.1 it is clear that if procedure 
MERGE is called with G, = F(1,n/2) and G, = F(n/2 + 
1, n), then the set of solutions returned would be F(1,n). 
Complexity 


Any call to procedure MERGE will go through (log n) 
stages at most. Step [1], where the recursion halts, will 
require O(1) time and c processors. At any instant, at 
most n/2 simultaneous executions of step [1] may be going 
on. Thus, this step contributes O(nc) to the overall pro- 
cessor requirement. We implement step [3] as follows. 


After the sets H, and Hy have been generated we con- 
struct, for each U; € Gs, the set f(U;) defined by equation 
(3.7) using c processors in O(log c) time. Also, the total 
number of solutions in this set is stored along with each U;. 
A similar computation is done for each V; in Gy. Now, for 
each S; in Gp we assign one processor. If S; is of the form 
UiUV; then the processor checks the number of solutions in 
the sets f(U;) and f(Vj). The minimum of these two 


numbers is found and these many processors are assigned 
to S;. We can use these processors to identify the solutions 
that are common to f(U;) and f(V;) in O(1) time. The solu- 
tions S;UR, are generated for each solution R, that is in 
this common set in O(1) time. 


The number of processors required for the computa- 
tion of the solutions SUR, is thus p = 
¥} min{number of solutions in f(U;), number of solu- 
UV; € Ge 
tions in f(V;)}. Taking into account the number of proces- 
sors needed for the initial computations on H, and Hp, the 
total number of processors required for step [3] is max(p,c) 
and the time required is O(log c). Since the two recursive 
calls in step {2] are executed in parallel, the overall proces- 
sor requirement of procedure MERGE would be n(max 
{p,c}) and the time required would be O(log n.log c). 
® 


Now, Algorithm 1 is modified by replacing the merg- 
ing step in procedure FEASIBLE with a call to procedure 
MERGE. We can now use this modified algorithm to find 
a solution to the scaled problem and get an approximate 
solution. The time required by the modified algorithm is 
given by Theorem 3.2 below. 


Theorem 3.2: An approximate solution having relative 
error at most € can be found using the modified parallel 


algorithm in time O(log?’n + log*nlog—). 


Sketch of Proof: The procedure FEASIBLE goes through 
(log n) recursive stages. In each stage, the MERGE pro- 
cedure requires O(log n.log c) time. Thus, the time 
required by procedure FEASIBLE is O(log’nlog c) and this 
is greater than the time required for the rest of the steps in 


Algorithm 2. Since c is O(—) for the approximate algo- 
rithm, the time required is O(log*nlog —) which is O(log*n 
+ log*nlog zy, 

@ 


The worst case processor bound for the modified algo- 
rithm is obtained using the following combinatorial result. 


Let G be an undirected bipartite graph (A,B,E) where 
A and B are the sets of vertices and E is the set of 
undirected edges. Let a; be a non-negative integer weight 
associated with each vertex i in A and b; be a non-negative 
integer weight associated with each vertex j in B. The 
edges and weights satisfy the following constraints - 


|E| <c (3.9) 
8 Sc (3.10) 
Pe =e (3.11) 


Theorem 3.3: For any bipartite graph G and set of 
weights satisfying constraints (3.9), (3.10), and (3.11), the 
sum M= )) min{a;,b;} is at most cvc. 
(i,j)€E 
Sketch of Proof: The result is proved by showing that 
given any bipartite graph and set of weights satisfying con- 
straints (3.9) - (3.11), we can transform the graph into a 
standard graph by applying a series of transformations, 
each of which increases the value M, or at worst, leaves M 
unchanged. This standard structure will be an almost 
complete bipartite graph having close to Vc vertices on 
both sides and nearly equal weights on the vertices. The 
value of M cannot be increased above the value obtained 
for this structure. The value of M for this graph can be 
shown to be at most cvc. 
® 


Theorem 3.4: Algorithm 2 using the modified merging 


procedure requires 


processors. 
ed 


Proof: From the complexity analysis of procedure 
MERGE, we see that this procedure requires n.(max {p,c}) 
processors, where p== 5) min{number of solutions in 
UUV; € Go 
{(U,), number of solutions in f(V,)}, and ¢ is O(+). The 
€ 


processor requirement of Algorithm 2 is dominated by the 
requirement of procedure MERGE. Theorem 3.3 can be 
used to determine p. We define a bipartite graph associ- 
ated with the set Gp». The vertices of this graph 
correspond to each Uj; in G3 and each Vj; in Gg. An edge 
between vertices Uj and V; is present in this graph if the 
solution U\UV; is present in Go. The weight associated 
with U; is the number of solutions in f(U;), where f(U;) is 
defined by equation (3.7). Similarly, the weight associated 
with V; is the number of solution in f(V;), where f(Vj;) is 
defined by equation (3.8) (see Fig 3.2). There are at most c 
elements in Gy and hence the number of edges in this 
graph is at most c. This satisfies constraint (3.9). The 
sum of weights on all the U; is at most equal to the 
number of solutions in the set H, which is less than or 
equal to c. This shows that constraint (3.10) is satisfied. 
The sum of weights on all the V; is at most equal to the 
number of solutions in H, which is at most c. Thus all 
three constraints are satisfied. Applying Theorem 3.3, we 


see that p, which is >) min{number of solutions in 
UV; € Go _ 
{(U,), number of solutions in. f(V;)}, is at most eve. 


Thus, the total processor requirement is at most 
neve. Since c is at most 2n/e for the approximate algo- 
rithm, the total processor requirement is O(==). Using 

et 
the property that any algorithm that runs in time t using 
O(p) processors can be executed in O(t) time using p pro- 
cessors, we see pat the processor requirement for Algo- 


rithm 2 is at most meee, 
elo 


V1, by 


V2, bo 


V3, bg 


Edge (Uj, Vj) = > UjUV; occurs in Go 


I| (U9) | 


aj 


as 
it 


Il £CV}) [I 


Fig 3.2. Construction for proof of processor bound. 


We have thus shown that using a parallel algorithm we 
can find an approximate solution to the given knapsack 
problem whose profit P is related to the profit P” of the 
optimal solution by the inequality P” - P < e«P”. The time 


required by the parallel algorithm is O(log’n + log’nlog -) 
2.5 


ei 


and the number of processors required is at most 


4. Conclusions 


In this paper, we have presented an approximate algo- 
rithm for the 0-1 knapsack problem on the CREW PRAM 
model. Our algorithm takes an €,0 < € < 1, as an input 
parameter and finds a solution such that the ratio of its 


deviation from the optimal solution is at most €. Our algo- . 


: : 9 ] : 
rithm requires O(log*n + log*nlog—) time and uses at most 
€ 
2.5 


processors. In contrast to a naive algorithm that 


gid 
requires significantly more processors, our algorithm 
exploits the relationship among the partial solutions to 
reduce the number of such solutions to be generated. This 
in turn results in reductions in the processor requirements. 
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(Notice that the divide-and-conquer algorithm presented in 
[PR84] has, in fact, the same processor and time bounds as 
the algorithm described in section 3.2 and hence is less 
efficient compared to the algorithm presented in section 
3.3.) 


Although the worst case processor requirement of our 


algorithm is 


iz we strongly conjecture that on an aver- 
gl. 


2 
age our algorithm would only require O(=) processors. It 


will be interesting to analytically establish an average case 
processor bound. 


The approach we have adopted in our algorithm is 
general enough to solve several other problems that can be 
formulated along similar lines as the knapsack problem 


(details will appear in a forthcoming paper (GRK86)). 
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ABSTRACT 


Parallel architectures, based on the two dimensional shuffle exchange 
(2DSE) network, to solve a class of quadtree problems are presented. 
The quadtree problems are those can be solved by using quadtree as 
data structure. The 2DSE approach enables us to have a spectrum of 
functionally equivalent configurations, which use different number of 
processors and have different time complexities. The various 
configurations give us more choices to best-fit to the constraints of 
real world applications. 


1. INTRODUCTION 


The quadtree is a well known data structure in the areas of 
computer vision and image processing. It has been used for the algo- 
rithms for computing geometric properties such as areas and moment 
[1], perimeter [2], Euler number [3], and distance transform 4'. In 
addition, it has been used in digital image processing applications 
such as edge enhancement [5], image segmentation (6), smoothing [7], 
shape approximation [8] et al. For a tutorial on quadtree, see [9]. 
However, the quadtree approach we adopt, is how the quadtree mani- 
pulates data rather than a data compression scheme for image stor- 
ing. Hardware architectures for solving similar problem can be found 
in the papers 10,11,12,13]. The distinctions between our approach 
and the previous ones are listed as follows. 


(1) Versatility of the architecture: The architectures, which are used 
to solve the quadtree problems in this paper, can also be applied 
to tackle a class of two dimensional problems such as 2DFFT. 
For more details see [14]. The versatility of the architecture 
give us valuable tools to investigate more complicated vision 
functions, and construct more advanced vision systems. 


(2) Various configurations for real world applications: The 2DSE 
approach enables us to have a spectrum of functionally 
equivalent configurations, which use different number of proces- 
sors and have different time complexities. The various 
configurations give us more choices to best-fit to the constraints 
of real world applications. 


(3) Common communication lines: All of the algorithms in this 
paper, plus a class of two dimensional algorithms, utilize the 
same set of lines among processors. This makes the 2DSE 
approach more economically feasible. 


2. 1DSE AND 2DSE NETWORK 


One-dimensional shuffle exchange (1DSE) network can perform 
the shuffle, unshuffle, exchange, and broadcast operations. A shuffle 
operation moves the data at the address x to the output address 
S,(x), where S,(x) is defined as follows: 


Si(x) = 


(2xMODN) +1 ifx => 
2xMODN if 0<x< = 


An unshuffle operation U, perform the function inverse to the 
shuffle operation. Where U,(x) is defined as follows: 
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U,(x) = 
a if x is an even number. 
x+N-1 


2 
The exchange operation is defined by E(x) as follows: 


if x is an odd number. 


E, (x) = Xn-1 Qe aE epee + X12 + Xo- 


The broadcast operation is defined by B,(x) as a mapping from x to 


two positions 
ee eae Se + X42 + Xo.. 


and 


A two-dimensional shuffle exchange network can perform the 
two dimensional shuffle, unshuffle, exchange, and broadcast opera- 
tions. A two-dimensional (2D) shuffle operation S, and unshuffle 
operation Uz move the data at the address (x,y) to the output 
address S,(x,y), and U2(x,y). Where S(x,y) is defined as follows: 


Sa(x,y) =(Si(x),S:(y)) 


U2(x,y) =(Ui(x),U;(y)) 


(2.1) 


(2.2) 


An 2D exchange element is a switch, with four inputs and four 
outputs, which can perform all kinds of interchanges of four inputs 
(t.e. it is equivalent to a 4x4 crossbar network). 


The 2D broadcast operation is defined by B2(x,y) as a mapping 
from x to four positions, which are the same as the inputs of the 2D 
exchange elements. 


38. MATHEMATICAL DERIVATION 

Due to the space limitation, we use two examples, area and cen- 
troid computation, to demonstrate the applicability of 2DSE architec- 
tures on the quadtree problems. 


3.1. Area Calculation 


The area of a image is defined as 


N-1N-1 


A= > *& f(x,y) 


x=0y=0 


(3.1) 
where f(x,y)=1 if the pixel at (x,y) is black. 
=(0 otherwise, 0 =x,y= N-1. 


X == Xp 72™ 71 + x_g2™—? +... x12 + Xo 


= Vaso a Yuson +... yi2 + Yo 


The equation (3.1) can be denoted by the following binary representa- 
tion: 


A= Dt LD f&m—-vYm-1- 


Xin -t%m—t Xo Yo 


- »XosYo) (3.2) 
The area can be computed in m steps, where m is equal to logN. The 
area computation results of the first stage can be calculated by equa- 
tion (3.3) and saved in buffers B,. B;, which saves the computation 
results of the 1-th stage, is a. two dimensional array of registers 
denoted by indices x and y. Bg saves the digitized image at the initial 
state. 


Ba (Xm—1:¥m—19%m—29¥ m— 294% 5¥1)0,0) = 
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SDl(Xm—1¥m—1 te »X0,Yo) 


Xa Yo 


(3.3) 


Similarly, the area in the s-th stage is computed by the equation (3.4) 
and saved in buffer B,. 


By (Xm—15¥m—19-+-%ss¥ 930,0,...,0,0,0.0) = 


> > Bai Xo yineaie Xs 19s. 1,0,0,...,0,0) 


KX, 14 a~1 


(3.4) 


fors = 2, 3.... m. 


The m-th stage area computations, which represent the total area of 
the image, are saved in the buffer B,, (0,0). 


3.2. Centroid Computation 
The centroid of a image is defined as 


a a 1 a a 
(Xcenteri+¥eenterJ]) — re SS (xf(x y)i+yf(x,y)j) 
x y 


(3.5) 


where A is equal to the area of the image. 


The results in the s-th stage is computed by the equation (3.6), 
and saved in buffers, A,, B, ,, and B,.,. 


Bis ( Xm —19¥m—19++-Xs—29Vs — 29Xs—19Ys—1)0,0,0,0,...,0,0 ) = 


> = By s—1 ( Xm —19¥m—19++-Xs—19Ys—130,0,.-.,0,0 ) a 


x. nm 


», > X,-12°-'A,_ 1 ( Xm —19Y m—19++-Xs—15¥s—150,0,...,0,0 ) 


x,- a~t 


(3.6) 


Where A are functionally the same as B, which are defined in equa- 
tions (3.3) and (3.4). The By, can be obtained by changing the the 
variable x to y in equation (3.6). 
The centroid of an image can be obtained by equation (3.7) and 
(3.8). 
By, m(0,0) 


A,,(0,0) al 


Xcenter 


By m(0,0) 


A, (0,0) a 


Ycenter — 


4. ARCHITECTURES AND ALGORITHMS 


© : Exchange 
Element 
(Processor) 


@ : Storege 
Element 
(Register) 


Se Unshuffie 
Network 


(between 


(e) Architecture w/ Network Register @) (b) Architecture w/o Metwork 


(c) Pipeline iterative 
Architecture 


(6) Peralie! iterative 
Architecture 


Figure 1: 2D Architectures 


Figure 1 illustrate the architectures that can solve the quadtree 
problems with maximal parallelism based on 2DSE network. The 
black nodes in Figure 1 represent storage elements. Use centroid com- 
putation as example, these nodes stand for the places where the values 
of B,, B,, and A are stored. The dotted nodes represent exchange ele- 
ments (i. ¢. processors) where all the computations are executed. In 
the case of centroid computation, these nodes executed the operations 
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shown in equation (3.6). 


Part (c) and (d) of Figure 1 show the two different ways: itera- 
tively pipeline and iteratively parallel, to integrate processors into a 
working system. 


Architectures in part (a) and (b) are functionally equivalent. 
The only difference is: in part (a), the window size of exchange ele- 
ments is a constant one, but the size doubles when stage advances in 
part (b). Where the window size of exchange elements are defined as 
the index distance between any accessed data inputed to the exchange 
elements. The role of the 2DSE network is to unshuffle the data so 
that the window size of exchange elements can remain the same. Two 
dimensional input and output is the common characteristics and 
necessary requirement of the 2D architecture shown in the Figure 1. 


The summation mechanism SUM3, are exchange elements 
difined to executed the operations in the equation (3.3), with window 
size 2', where i is the stage number. The function sums up the value 
from four subdived quadrants, is used in both the area calculation 
and centroid computation. 


The centroid computation mechanism CENXi, and CENY3, are 
exchange elements defined to execute the operations shown in the 
equation (3.6) with window size 2'. Thus, the centroid of the image 
can be computed with time complexity of O(logN) by the architecture 
shown in Figure (1.b). Where N is equal to the number 2™, which is 
the number of pixels in the x or y direction. 


We have the following algorithm: 
line Procedure CENTROID-b (A,B,, B,) 


//centroid computation in architecture (1.b)// 


1 for i = 1 to logN do 

2 begin 

3. for nl, n2 = 0 to N-1 

4 cobegin 

5 A{ni,n2] = (SUM}A[n1,n21) 
//do summation with window size 2'  ; 

6. B,{ni,n2] = (CENX3B,(n1,n2,) 
//do center-x with window size 2'/, 

7. B,(ni,n2] = (CENY3B,(n1,n2') 
//do center-y with window size 2'/ ; 

8. coend 

9. end. 


The role of 2DSE network (i.e. unshuffle) can be used to access 
data that have the window size doubled when stage advances. Thus 
the centroid computation can be mapped into a 2DSE architecture 
(i.e. Figure (1.a)) with time complexity O(logN). The mechanism 
SUM2, CENX.2, and CENY, are exchange elements with widow size 
fixed as two. 

The following procedure CENTROID-a is a parallel algorithm 
based on the architecture in Figure (1.a). 
line Procedure CENTROID-a (A,B,, B,) 


//centroid computation in architecture (1.a)// 


1 for i = 1 to logN do 

2 begin 

3. for nl, n2 = 0 to N-1 

4 cobegin 

5 A[n1,n2] = {U2(SUM,Ajn1,n2])} 
//do summation and unshuffle the results// 

6. B,(ni,n2] = {U,(CENX2{B,'n1,n2]})} 
//do center-x and unshuffle the results// 

is B,[n1,n2] = {U,(CENY2{B,[n1,n2}})} 
//do center-y and unshuffle the results // 

; coend 
9. end. 


In 1983, Liu proposed a way to emulate the 2DSE network so 
that he can implement the 2DFFT computation in VLSI [15]. His 
scheme takes advantage of the separability property formulated in the 
equation (2.1) of the 2D shuffle operation. The delay units, in his 


paper, are used to emulate the x direction of the shuffle operation. 
By modifying of his scheme, the emulated 2D architectures, which are 
shown in Figure 2, are obtained. 


| Y : switen 


eum: Deloy Unit 
(Register) 
CD : Exchange 
Element 
(Processor) 
i: Stege Wo. (c) 1D Emuloted iteratively 


Pereltiel Architecture 


(a) Emuteted Unshuffile Operation Demonstration 


@ee'e@ 
@@e'e¢0 


@ @\e @ 
@@e|e@ @ 


(b) 1D Emuleted Iteratively Pipeline Architecture 


Figure 2: 1D emulated Architectures 


Part (a) of Figure 2 illustrates the architecture that emulates 
the 2D unshuffle operation. Part (b) and (c) show the iteratively pipe- 
line and parallel fashion of the 1D emulated architecture. 


Table 1 summarizes the results of the quadtree problems we 
solved by using 2DSE pipelined (i.e part (c) in Figure 1) and 2D emu- 
lated architecture (i.e. part (b) in Figure 2). 

The detail algorithms for all these quadtree problems can be 
found in [16]. 


ARCHITECTURE 1 
2DSE PIPELINE 


ARCHTECTURE 2 
1D PIPELINE 


PROBLEM Processor Time Processor. Time 
Aes O(N?) OflogN) _O(NiogN) O(N) 
Centroid O(N?) OflogN) _—O(NlogN) O(N) 
Projection O(N?) OflogN)_ O(NlogN) O(N) 
Signatures O(N?) OflogN) _O(NlogN) O(N) 
Eccentricity O(N?) O(logN)  O(NlogN) O(N) 
Contour Following O(N?) OflogN)_—O(NlogN) O(N) 
Scroll Operations O(N?) OllogN) O(NlogN) O(N) 
Perimeter Computation O(N?) O(logN) O(NlogN)__ O(N) 
Component Labeling O(N?) O(logN) O(Nlog?N) O(N) 
Set Complementation O(N?) O(1) O(N) O(N) 
Set Union O(N?) O(1) O(N) O(N) 
Set Intersection O(N?) O(1) O(N) O(N) 


Table 1 Summarized Results 


The various functionally equivalent architectures enable us to 
best-fit to the requirements of real world constraints. 


5. CONCLUSIONS 


In this paper, parallel processing for quadtree problems based on 
2DSE network is discussed. Various architectures such as 2D itera- 
tively pipelined, iteratively parallel, 1D pipelined et al can be used to 


solve the quadtree problems with different performance constraints 


(te. processor numbers and time complexity). The results of the 2DSE 
approach toward the quadtree problems are listed in Table 1. It 
includes many primitive operations of computer vision. Besides the 
quadtree problems, the 2DSE network had been applied to efficiently 


solve a class of two dimensional problems such as 2DFFT, 2D 
Walsh/Hadamard transform, and 2D sorting [14]. Thus, the same 
architectures can be used for a broad range of applications. This 


makes the 2DSE approach more economical feasible. 


10. 


11. 


12. 


13. 


14. 


15. 


16. 
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Abstract 


This paper discusses the design and preliminary 
performance evaluation of the Circulating Context 
Multiprocessor (CCMP) -- a novel form of tightly cou- 
pled multiprocessor that combines significant posi- 
tive features of modern computer structures and or- 
ganizations while avoiding many of the negative ones. 
The CCMP consists of banks of special-purpose pro- 
cessors joined by full interconnection nets with 
queue-buffering. Hach bank contains processors 
which perform one portion of the traditional Von 
Neumann cycle. Processes are represented by pack- 
ets of information circulating among these banks 
and being modified by them. Many processes are ac- 
tive at once, being executed in an _ instruction- 
interleaved fashion. The structure inherently sup- 
ports load balancing, high degrees of pipelining, 
efficient context switching, and modular 
reconfiguration. It appears to the software designer, 
however, to be simply a parallel computer with many 
Von Neumann processors sharing a memory. 


Keywords: MIMD computer architecture, parallel comput- 
ers, interleaved instruction streams. 


I. Introduction 


The circulating context machine organization was first 
proposed in 1984 [1]. Since that time a great deal of feasi- 
bility study, simulation, and further development of the ar- 
chitecture has occurred. This paper reports on our current 
circulating context multi-processor (CCMP) architecture, its 
basis, strengths, weaknesses, and simulated performance. 
aoc material on the CCMP architecture can be found 
in [2]. 


The origins of CCMP are based on a collection of existing 
modern computer structures. The CCMP organization pulls 
together and adapts known successful concepts and struc- 


tures from other designs while carefully avoiding most of 


the observed weaknesses of these concepts. Overall, it is an 
adaptation of ideas that make sense together. The list 
below capsulizes the design concepts and their origins. 

e Packets of executable context --- from the 
dataflow (activation by availability) model [3]. The 
use of such packets facilitates streaming-type high 
throughput ... high-efficiency use of processing 
resources. Other problematical areas of dataflow, 
such as the need for a different model of computa- 
tion, are avoided. 
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e Von Neumann model --- By utilizing this well- 
known computational model we do not require ma- 
jor investments for development in languages, 
compilers, linkers, debuggers and the like. Non- 
Von Neumann systems (such as data flow) will cer- 
tainly require major development in these areas. 


Queues with multiple server processors --- from 
transaction systems (e.g. Pluribus [4]). The use of 
first-in, first-out queues as front-end staging for 
multiple identical servers (processing units) 
decouples hard synchronization, provides 
smoothed packet flow, and facilitates higher sys- 
tem availability. 


Interleaved instruction streams --- from modern 
pipelined systems [5,6,7]. By circulating process 
packets for many independent processes simul- 
taneously, we can ensure (given sufficient work- 
load) that processing units are kept busy, thus 
achieving very high utilization. 


Redundancy management and reconfiguratio 
from fault-tolerant systems [8,9]. The presence of 
multiple copies of identical processing units within 
a system with full interconnection implemented 
among them allows immediate replacement of fail- 
ing units --- giving a very robust and highly avail- 
able organization. 


The architectural concepts above are brought together 
to form the circulating context multiprocessor. The system 
makes a deliberate trade-off of increased instruction execu- 
tion time for each individual process in exchange for overall 
system throughput and flexibility in adapting the machine 
to many differing types of parallel processing. 


The motivations for the latency-for-throughput trade-off 
are several. In order to implement Von Neumann style com- 
puting, pipelining, and transaction-style dataflow in the 
same machine, a ring of processing stages is used. When 
parallel units are added, along with queueing to decouple 
and smooth packet flow, the trade-off becomes clear. The 
trade-off is necessary in order to get a clean, unidirectional 
processing loop structure with scalability, high-availability 
and other positive properties. The structure and connectivi- 
ty of the resulting systern inherently support "hot sparing”, 
a type of redundancy that is among the quickest and easiest 
to reconfigure. The CCMP does not really treat "spares" as 
extra units --- instead such units are actively used to in- 
crease throughput. When failures occur and some units 
start to stand-in for downed companions, the system grace- 
fully degrades in performance. 


_ By using queues to stage work packets for each set of 
processor servers, high utilization of pipelines is possible. 
The use of instruction stream interleaving guarantees that 
no dependencies will exist within pipelines and hence that 
no pipeline breaks will occur. Full pipelines imply maxim- 
ized throughput. In a heavily pipelined system, throughput 
is directly linked to system clock rate. With the latency- 
for-throughput trade-off in effect, we will expand (within rea- 


son) the number of pipeline stages that are used to accom- — 


plish a given bit of processing if it helps in increasing the 
main clock rate. 


The programming of CCMP is not vastly different from 
the programming of a conventional uni-processor. The 
CCMP offers flexible parallelism providing a high-throughput 
easily-sharable computing resource. For appropriately 
decomposable problems, the CCMP can take advantage of a 
significant degree of parallelism, particularly if the problem 
can be subdivided into co-operating but largely independent 
processes. 


The machine does not have hard architectural size limits 
as are typical of SIMD machines with fixed vector lengths. 
The CCMP is first and foremost a throughput engine. The 
more it is loaded, the more it will deliver. Section 4 
presents detailed simulation results which show linear 
throughput per process until roughly 80% efficiency is 
achieved. The CCMP can achieve streaming mode execution 
as is the goal of activation-by-availability (dataflow). Howev- 
er, the CCMP requires no changes to the model of computa- 
tion. 


On the negative side, the trade-off of per-process latency 
for system throughput implies that single stream perfor- 
mance will never be much better than that of a microcom- 
puter. Thus, tasks will little parallelism will run relatively 
slowly. This roughly corresponds to non-vectorized seg- 
ments of programs on SIMD machines. One important 
difference in the CCMP case, however, is that the unused 
portion of the machine (corresponding to the wasted vector 
register lengths in an SIMD machine) is available for use by 
other sirnultaneously-executing processes. The CCMP archi- 
tecture allows full sharing of all resources so that users with 
low parallelism tasks get fixed, but relatively low perfor- 
mance while highly parallel ones can simultaneously get a 
significant percentage of system resources. 


Previous Work 


As is apparent from the introduction, the ideas upon 
which the CCMP is based are not new. What is new is the 
combination and adaptation of those ideas in the present 
form ... culminating in a feasible, scalable, flexible multipro- 
cessing system. There are several descriptions in the litera- 
ture that document both proposed and constructed systems 
using some of the same ideas. In the area of interleaved in- 
struction streams, the work of Miller [10] and of Kaminsky & 
Davidson [6] must be acknowledged. Other important relat- 
ed work is [5]. The positive effect of queued decoupling 
between various parts of a processor was discussed in [11]. 


The closest system to the CCMP is unquestionably the 
heterogeneous element processor (HEP) [7] [16]. This 
machine bears a resemblance to the CCMP although some of 
the CCMP’s most important features are not present in the 
HEP. CCMP processes may visit any execution processor, a 
property that aids in systemwide load balancing. Intercon- 
nection networks in the CCMP are global and buffered, mak- 
ing the organization more homogeneous and thus both more 
flexible and more available. The CCMP memory is variably 
interleaved (as seen by the home module). This allows pro- 
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grams to create special memory layouts so that they can 
efficiently fit special meshes or algorithms to get added 
parallelism. 


II. Current CCMP Design 


Here we discuss a refined CCMP design based on the con- 
cepts already discussed. 


Stage Division 


The CCMP divides the Von Neumann cycle into portions, 
with a stage of processors handling each portion. The exact 
division used is a design parameter. Our current design 
uses three stages. Each of them is discussed below. Figure 
1 provides an overview of the structure of the design, with 
the three stages linked by several interconnection networks 
through which packets of process context flow from stage to 
stage. 


Execution 


Memor 
Ze Modules 


Modules 


Home 
Modules 


Figure 1: Overview of the CCMP 


The first stage is the "home" stage which performs pro- 
cess specific functions such as updating the PC, delivering 
interrupts to a process, performing register fetch/store 
operations, and using segmentation registers to translate 
virtual addresses to physical addresses. The home proces- 
sors in this stage are so called because a given process is al- 
ways served by the same home processor -- it "returns 
home" to the same point in this stage. Note that we have 
departed from the pure CCMP concept here in that a good 
deal of the process’ context remains in the home stage, in- 
stead of being transmitted about via the process packet. 
This was deemed necessary to avoid an unacceptable over- 
head in transmission costs, especially since most of this 
data is used only occasionally. The current instruction, the 
data on which we are immediately working, and a unique 
process ID number are the only contents of the process 
packet in our current design. Given the decision to hold 
some of the process context in a fixed stage, the necessity 
for that stage to have the "return home" property is obvi- 
ous, since replicating each process’ context in each proces- 
sor of the stage would restrict the degree of parallelism pos- 
sible, and result in data concurrency problems among the 
processors in the stage. 


The next stage is the memory stage, "smart memory 
banks" capable of performing variable length instruction 
fetches, executing synchronization primitives like test-and- 
set and possibly maintaining private caches. Note that the 
memory stage to memory stage interconnect shown in 


figure 1 allows processes to visit several memory processors 
in sequence to fetch or store data without going to any oth- 
er stage in the interim. 


The memory stage constitutes a shared global memory 
accessible to all processes, but has no inherent interleaving 
policy. Memory segmentation is performed in the home 
stage, since performing it in the memory stage would re- 
quire replication of segmentation information for each pro- 
cess in every memory processor. A few extra bits in the 
home processor segmentation registers can allow variable 
specification of interleave on a per-segment basis without 
any action on the part of the memory processors. The 
memory processors will use a standard physical address, 
the bits of which can be derived in any manner from the vir- 
tual address by the home processor to cause a segment to 
be low, high or even mid-order interleaved. This allows for 
more flexibility in load balancing of the memories and is 
essential to the high availability schemes discussed later. 
Variable interleaving can cause complex overlap problems 
between segments in physical memory -- it is assumed that 
any operating system for the CCMP will either plan for and 
prevent such overlaps (unless they are desired), or simply 
use a standard interleave policy for all segments. 


The final stage in the design is the execution stage, 
which performs all arithmetic or logical operations demand- 
ed by the instructions. Hach processor in the stage is ident- 
ical -- we chose not to specialize them to avoid problems 
with tasks that make skewed use of operations and might 
load only a subset of the processors. Any process, then, 
may visit any execution processor to have an operation per- 
formed on data that it has fetched. Processes are sent to 
non-overloaded processors in a round robin fashion. Over- 
loaded processors receive no new processes until they be- 
come free; then they join the round-robin. 


As an example to clarify the above, we trace a process 
packet through a typical instruction. The instruction begins 
with the packet at its home processor, where its PC is 
fetched and translated to a physical address that is placed 
in the process packet. The packet then goes to the appropri- 
ate memory processor where the contents of the instruction 
are fetched. It returns to the home processor, where any re- 
gister fetches required by the instruction are performed. If 
memory data fetches are required, the home processor 
determines the physical addresses involved and the process 
packet returns to the memory stage, travelling to the 
memory(ies) where the data to be fetched resides. Data 
stores can be done at this point also. For instance, a 
memory-to-memory data transfer can be done without re- 
turning to the home. If arithmetic or logical operations are 
required, the process travels from the memory stage to the 
execution stage. If memory stores are required after an ex- 
ecution processor completes the operation, the process re- 
turns to the memory stage, and then finally returns home 
where any register stores needed are performed and the 
next instruction begins. 


Interconnection Networks 


We have said nothing in detail thus far about the inter- 
connection networks. There are a lot of them as the reader 
can see from Figure i. This set of networks was derived 
from an examination of the likely stage-to-stage transfers in 
a typical instruction set. As one would expect, the design of 
these interconnects is critical to the success of the CCMP. 
For large scale parallelism --- more than around 40 proces- 
sors per stage --- we currently intend to use a single stage 
cube connection (see [14]). We feel that the cube connec- 
tion is superior to the shuffle exchange or one of its rela- 
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tives for this application. The cube connection allows for 
fewer transfers between nodes on the average. This is gen- 
erally offset by the cost of a larger number of interconnec- 
tions per node. In our case, however, we are designing to 
optimize the use of the hardware, coming as close as feasi- 
ble to saturating the network with a continuous flow of data, 
rather than attempting to guarantee some short delivery 
time to occasional individual messages. For this discussion, 
let us assume that the clock speeds of sending and receiving 
modules are the same as that of the interconnection lines, 
and that bandwidth is measured in bits per clock. Assuming 
full utilization of bandwidth, (with m the bandwidth per in- 
coming line and p the degree of parallelism) our choice is 
between shuffle nodes with internode bandwidths of nloge p 
on each of two internode lines and cube nodes with inter- 
node bandwidths of m on each of loge p internode lines. We 
obtain a 50% reduction in internode bandwidth with the cube 
interconnect, paying only the price of dividing the internode 
lines more finely. This comparison is discussed in more de- 
tail in [12]. 
n/p 


p 


Figure 2: Modified Crossbar Interconnection Network 


Let us assume that the transmission time of an individu- 
al packet is not at issue, and concern ourselves only with 
meeting the overall bandwidth requirements of the stage- 
to-stage interconnection with a minimum of physical inter- 
connect. Assuming a packet has 7m bits and that a fully util- 
ized module delivers one packet per clock, we see that the 
actual bandwidth required is only np and that both of the 
above networks use a factor of loge p too much bandwidth. 
We can remove this factor by using a modified crossbar in- 
terconnect as shown in Figure 2. Each sending node has an 
outgoing bandwidth divided evenly among all receiving 
nodes (simultaneous transmission is assumed). The same 
bandwidth can be divided more finely for higher degrees of 
parallelism, serializing the transmission of individual 
processes to accommodate the lower individual bandwidths. 
The cost of such division is that each packet is delivered 
more slowly, and somewhat more queueing is required to 
buffer packets prior to delivery. More complex switching is 
also required in the sending and receiving -units. The 
tradeofis between this approach and the cube interconnect 
are shown in the complexity analyses of table 1. 


i lias that may require elaboration are the O(p*) 
and O{—) figures for number of processes and per-process 


throughput in the bipartite (crossbar) case. Although the 
overall throughput of the modified crossbar is equal to that 
of the cube interconnect, the speed of transmission through 
the interconnect drops for an individual process in inverse 


Table 1: Complexity Analysis of Candidate ICNs for the CCMP 


Modified Crossbar Cube 
Resource interconnect Interconnect 
Throughput O{p) O(p) 
Interconnect Bandwidth O(p) O(ploge p) 
Number of Processes a ) O(ploge P) 
Per Process Throughput oO at O( jog - 
Number of Queues O(p*) O(ploge p) 
Interconnect Node Complexity O(p) O{logs p) 


proportion to the degree of parallelism. This is due to the 
increased serialization needed to maintain the same 
bandwidth as the number of processors increases. O({p) 
processes per sending processor will be tied up in transmis- 
sion across the net. Thus, O(p*) processes will be needed to 
maintain an O({p) increase in throughput. The constant fac- 
tors in the p*® terms for number of processes and queue 
sizes are expected to be small enough to make the modified 
crossbar connection preferred for low orders of parallelism. 
Note that number of processes, or more precisely, degree of 
parallelism is viewed as a resource here, and the modified 
crossbar scheme requires a higher order of parallelism to 
deliver the same performance with all other factors held 
constant. A number of changes in parameters for the CCMP 
will result in higher consumption of parallelism but no 
significant drop in throughput. This is true, for instance, 
when the number of pipeline stages in the various proces- 
sors is increased. 


software and System Details 


The current synchronization primitive in the CCMP simu- 
lator is a standard test and set instruction. We expect to 
change this, however, to an atomic swap operation. This re- 
quires about the same extra logic in the memory processors 
and is more powerful. In particular, it can be made to emu- 
late the full/empty flags that were attached to registers by 
the HEP [7] designers and which work so elegantly in that 
architecture. All that is required is a designated "empty" 
value. The two pieces of code in table 2 demonstrate one 
use of this. A group of processes running the sending seg- 
ment can synchronously pass one value at a time through 
memory location z to a group of processes running the re- 
ceiving segment. 


Table 2: Synchronization via Atomic Swap 


sending Segment Receiving Segment 


while data to send { 
place data in d; 
while d # “empty” value { 
atomic swap d and x; 


while data to be received 
place "empty" value in d; 
while d equals "empty" § 
atomic swap d and x; 
j 


Control of process creation and segmentation is through 
special instructions. There is a set of (privileged) instruc- 
tions to allow one process to set another’s segmentation re- 
gisters. There is also a "create" instruction that allows one 
process to create another by instructing the new process’ 
home processor to start a packet circulating for that pro- 
cess and to set its PC to a designated value. Home proces- 
sors keep track of whether a process is currently active 


(has a packet circulating) or not. If a create is done on an 
already circulating process, the creafe becomes an inter- 
rupt, which is implemented by altering the affected process’ 
PC so that it will go elsewhere next time it comes around. 
This same mechanism allows an I/O device to interrupt a 
given process. 


the 1/0 system is memory mapped, so connections 
between it and the CCMP hardware are via the memory pro- 
Gessors. Low bandwidth devices are linked to memory pro- 
cessors in a conventional manner, with devices evenly 
spread across processors to prevent any one processor from 
being overloaded by process packets attempting to fetch 
data from devices. Disk or other mass storage devices are 
somewhat more problematic, since they are required to 
deliver block transfers in various interleaved fashions to the 


memory processors. To accommodate this, we place another 
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interconnection network between disks and memories, with 
buffering at the disks. This allows each disk to deliver data 
to and from each memory and permits a block load, even to 
a low-order interleaved segment, to be delivered from a sin- 
gle disk. 


A simple approach to delivering interrupts to a process 
when an I/O operation is complete would be to run a bus 
through the home processors to allow devices to signal for 
interrupts. This might become a bottieneck in a highly 
parallel system, however. As mentioned above, internal in- 
terrupts may be delivered to a process by means of a create 
instruction. A more general approach to I/O interrupts, 
then, would be to allow an I/O device to signal a memory 
processor to create a dummy packet with a creafe instruc- 
tion in it, destined for the desired process. The dummy 
packet would be given a special process ID number that 
would cause it to vanish after completing its task. In this 
way, the capacity to deliver interrupts would increase pro- 
portionally to the parallelism. 


We are currently looking into the design of a distributed 
microcode system, in which each type of processor would 
have its own set of micro-operations, and would fetch mi- 
croinstructions from a private microinstruction memory 
which would tell it what to do for a packet with a given in- 
struction at a given point in its completion (instruction and 
state of completion are part of the process’ packet). This 
would allow a more flexible instruction set and simpler 
hardware, as with other microcoded machines. Dynarnic 
downloading of microcode would, of course, also be a possi- 
bility. 


One somewhat tricky problem with the CCMP is the pos- 
sibility of what we refer to as "ring deadlock", a situation in 
which a ring of modules and interconnect lines in the archi- 
tecture have become mutually deadlocked. This situation 
will not abate by itself, and will quickly deadlock the remain- 
ing processes as they attempt to enter modules on the 
deadlocked ring. The solution to this problem is fairly 
straightforward. As figure 1 shows, the topology of the inter- 
connections is such that the memory processors have a 
number of choices of where to send packets, and any 
deadlock loop must include a memory processor. Any time 
a memory processor becomes blocked, it sends the 
offending packet to another memory processor with an indi- 
cation that the packet has been deferred. The receiving 
processor will try to send the process to its proper destina- 
tion, or failing that will defer it once again. To obtain ring 
deadlock in this case it would be necessary to fill all the 
memory-to-memory queues. This is arranged to be more 
space than there can be process packets so deadlock be- 
comes impossible. The existence of ring deadlock was 
predicted and then confirmed by simulation. The solution 


described has also been tested in simulation and found to be 
highly effective -- keeping the machine running smoothly 
when it would quickly deadlock without the avoidance 
mechanism. 


Ill. Simulation Results 


We have developed a detailed simulation written in 'C’ 
and running on a SUN work-station (or VAX) under UNIX 4.2. 
The simulation is at the register-transfer level, accounting 
for single-clock events. We believe that our simulations are 
conservative and very realistic, modeling virtually all 
significant aspects of the CCMP. 


several configurations of a particular CCMP machine 
have been modeled. Because of the detailed nature of our 
simulator, however, we have not extensively studied large 
configurations. We have instead concentrated on investigat- 
ing relatively small systems, particularly studying the 
behavior under heavy process and memory loads. The sys- 
tem loads well (according to our studies so far) and we have 
reason to expect that because of its load-balanced, 
bottleneck-free structure, the CCMP will scale-up to large 
configurations well. 


Our simulations assume representative levels of pipelin- 
ing such as would be achievable and typical in conventional 
NMOS or CMOS VLSI (refer to table 3). We have assumed 18- 
25 MHz clock rates} (depending on stage type). Instruction 
processing may involve as many as several hundred single- 
clock pipeline stages including queue waiting times. This 
still would provide per-process instruction speeds of around 
100,000 operations per second. 


Table 3: Pipeline Stages per Module Type 
(queue lengths excluded) 


Simulated 
Module Type stages 
Home module 16 
Memory module 8 
Execution unit 32 


In the simulation results that follow, we have modeled a 
4/4/2 CCMP system, i.e. one with 4 home modules, 4 
memories, and 2 execution units. Note that because of our 
extensive use of queues these modules need not be syn- 
chronized, nor must they execute at similar clock rates. If 
the clock rates of different banks of resources are not 
roughly equal, then it will be necessary to balance the 
overall system by selecting the number of units in each 
resource bank so that their average throughputs match. 


Matrix Multiply (Saturation Load) — MMSAT 


The first test program written for our simulator was a 
matrix multiply. In this program, we use one process for 
each entry in the output matrix. Since the matrix size 
chosen was 15x15, we used 225 processes. In this problem 
the parallelism is immediate, predictable, and constant. 
This is shown by the processes versus time plot (figure 3); 


7 Note that we may freely trade additional pipeline stages 
in exchange for higher clock rates since an overall per- 
process latency trade-off has already been made. 
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the corresponding instructions per second versus process 
count plot (figure 4) shows that linear throughput is 
delivered as a function of the number of concurrent 
processes. Instead of terminating execution upon comple- 
tion of the matrix multiply, we simply continue computing 
the elements over and over so that the program can be used 
as a type of system load. 
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Figure 4: MMSAT — Throughput vs. Offered Load. 
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Recursive Quick Sort — QSORT 


The quicksort algorithm [13] has also been coded and 
run on our simulator. Three distinct experiments were per- 
formed with qsort. In the first one, the pivot element was 
chosen arbitrarily (the first element in the array) as 
specified in Knuth’s description. Figure 5 gives the parallel 
behavior for a sort of 10,000 integer elements in memory. 
Note that the maximum parallelism achieved is in the neigh- 
borhood of 30 processes. The corresponding throughput 
graph (figure 6) gives the observed MIPS rate, again very 
linear with respect to process count. 


The second experiment with qsort was an attempt to in- 
telligently choose better pivot points, ones that tend to 
divide the remaining elements into more or less equal 
groups. To do this, a simple pivot-choosing loop was added. 
The resulting behavior is shown in Figures 7 and 8. In this 
case parallelism of over 200 processes was observed, but at 


the cost of a long-running initial pivot choice. The algorithm 
used for choosing the pivot point samples every 32% data 
element throughout the entire array. For the full 10,000 
element array (the first pivot point chosen), this requires 
the inspection of over 300 items. Clearly, we could adjust 
this number and get vastly improved behavior with a much 
smaller number of probes. 


The final experiment with qsort involved running the 
previous case in the presence of the matrix multiply satura- 
tion load. This was done to see how the two parallel pro- 
grams interacted (i.e. whether they interfered with one 
another) on the machine. As expected, the CCMP allowed 
smooth sharing of resources. Neither program significantly 
impacted the resource utilization of the other. Figures 9 
and 10 show the observed behavior from the mmsat + gsort 
experiment. 
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Figure 7: QSORT w/ Pivot Selection — Processes vs. Time 
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Figure 8: QSORT w/ Pivot Selection — Throughput vs. Load 
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Figure 9: QSORT w/ MMSAT — Processes vs. Time 
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Figure 10: QSORT w/ MMSAT — Throughput vs. Offered Load 
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Relaxation Algorithm — RL 


This program implements a simple 2-dimensional relaxa- 
tion algorithm. We used a fast process generation tech- 
nique, with each process creating four others, to rapidly 
build parallelism. Even though the program begins with 
only a single process, it is clear from figure 11 that after a 
segment set-up time, included for ease of coding, the full 
parallelism is acheived almost instantaneously. The simula- 
tion uses 256 cells (each with its own process) without syn- 
chronization. Each cell accesses the data elements associ- 
ated with its neighboring cells, i.e. neighboring cells’ data 
areas overlap. Like the matrix multiply (saturation) test, 
this program never terminates. It is the fully-loaded 
behavior of the CCMP that we wish to study. Figures 11 and 
12 present the observed performance. The throughput 
graph appears jagged because of transient effects during 
the rapid increase in parallelism. The final dip in perfor- 
mance is caused by transient contentions for the same in- 
structions by many newly-created processes. As can be 
seen from the two figures, this cluster of processes rapidly 
spreads out to permit 77% utilization of the machine {max- 
imum performance achievable in this configuration is 36.36 
MIPS). The relaxation algorithm provides so much parallel- 
ism that very little time is spent (hence, relatively few sam- 
ple points) with less than the full number of processes. 
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Figure 11: RL — Processes vs. Time 
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Figure 12: RL — Throughput vs. Offered Load 


Figure 13: Grouped Module CCMP Configuration 
(not yet simulated) 


IV. Ongoing Work 


We are currently engaged in simulation of larger models 
of the CCMP. Indications have so far shown no serious prob- 
lems with expanding the degree of parallelism. One 
difficulty is that small, highly iterative loops in a low-order 
interleaved segment should be unrolled to span the entire 
set of memory processors to avoid overloading only a few 
processors with instruction fetches. 


We are currently planning a major re-structuring of the 
simulator to allow us to test the effects of different instruc- 
tion sets, varying numbers of registers, various queue 
designs, and other CCMP configurations, especially that 
shown in figure 13, where the various stages have been 
stacked one atop the other and share a common, higher- 
bandwidth interconnection network. 


Implementation of high availability for the CCMP is 
another area for further research. Coding can be added to 
the process packets to allow detection of faults in networks 
or processors. Once a faulty processor has been detected it 
can be set offline while the others continue functioning. In 
the case of the execution units this is trivial since they are 
all identical. A faulty home can be offlined by not using the 
process IJD’s that report to it. Offlining a memory processor 
can be accomplished by using the flexible interleaving capa- 
bility described earlier to interleave segments across sub- 
sets of the memory processors, avoiding the faulty one(s). 
Offlining a network link in the case of the crossbar arrang- 
ment can be accomplished by offlining the processors at ei- 
ther end of it or by using the deferral mechanism already 
installed for deadlock avoidance. All stages can send to the 
memory stage, and a bad link can be circumvented by send- 
ing the process, with an indication that it has been deferred, 
to an accessible memory processor which can then send the 
packet to the desired location. 


In the case of a cube interconnect, there are well- 
established means for avoiding a bad link. A bad node, how- 
ever, would require the offlining of the two processors to 
which it is connected. 


V. Conclusions 


We have presented a novel multiprocessor architecture 
and the results of a recent study of its feasibility and perfor- 
mance. The CCMP machine is a general MIMD computer that 
trades individual process performance for high overall sys- 
tem throughput. For maximum speed on a given (parallel) 
problem, the goal is to express the overall computation in 
terms of many (perhaps thousands) of co-operating, but 
largely independent, processes. 


We have carefully modeled a particular instruction set at 
essentially the register-transfer level and studied the per- 
formance on several important and typical problems. The 
system can deliver 80% efficiency quite routinely (i.e. 
without significant optimization) and can get near-100% util- 
ization on selected problems. 


Since the system in pipelined and uses instruction 
stream interleaving to preclude pipeline breaks, pipelines 
can be longer than typical. This can allow higher than typi- 
cal clock rates, since large-delay circuitry can be broken 
down into multiple simpler (i.e. faster) pipeline stages. The 
CCMP makes sense for VLSI, particularly as technology 
moves toward wafer-level interconnect [15] and multi-chip 
hybrids. 


Another strength of the architecture (which has not 
been stressed in this report) is its excellent support for 
error-detection, reconfiguration, and re-try. With adjust- 
ments made to the home modules (primarily in the number 
and type of registers) and with the addition of error detec- 
tion (at either the register-transfer, single module, or single 
instruction loop level), the CCMP could be made to support 
fault-tolerant applications. 
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Shared Memory Versus Message-Passing 
in a Tightly-Coupled Multiprocessor: 
A Case Study 


Thomas J. LeBlanc 
Computer Science Department 
University of Rochester 


Abstract - The BBN Butterfly Parallel Processor can support a 
user model of computation based on either shared memory or 
message-passing. In this paper we describe the results of our 
experiments with the message-passing model. The goal of the 
experiments was to analyze the tradeoffs between the shared 
memory model, as exemplified by the BBN Uniform System 
package, and a simple message-passing model. We compare the two 
models with respect to performance, scalability, and ease of 
programming. We conclude that the particular model of 
computation used is less important than how well it is matched to 
the application. 


Introduction 


The hardware architecture of the BBN Butterfly™ Parallel 
Processor supports message-passing between processors using an 
FFT network of 4x4 switch elements. The memory architecture, 
implemented by the operating system in conjunction with a micro- 
coded co-processor. provides the illusion of shared memory. As yet, 
there are no additional software levels that distort this illusion, so 
the user-level view of the architecture is that of a shared memory 
multiprocessor. This paper describes a series of experiments 
designed to explore the ramifications of a user-level view of the 
Butterfly architecture based on message-passing. 


Both shared memory and message-passing models have been 
advocated for parallel computation. Shared memory is usually 
found in tightly-coupled architectures that support remote memory 
operations in hardware or firmware. Message-passing is typically 
employed in loosely-coupied systems (eg.. local-area networks). 
Early work on the Butterfly, the Voice Funnel application in 
particular, used message-passing. Recently, Butterfly applications. 
including finite element analysis and computer vision algorithms, 
have assumed the shared memory model of computation. The only 
programming environment available on the Butterfly that masks the 
low-level details from the programmer is the Uniform System 
package from BBN, which implements a shared memory model. 
Programmers find it easier to use the Uniform System than to build 
the program from scratch, regardless of how well the application fits 
the shared memory model. 


Before proceeding with our implementation of an environment 
supporting message-passing [3], we conducted a series of 
experiments using Gaussian elimination as a sample application. 
The goal of the experiments was to explore the tradeoffs between 
the shared memory model, as exemplified by the Uniform System 
package, and a simple message-passing model based on 
asynchronous send and receive. [mportant factors to be considered 
were ease of programming, performance. and scalability. In this 
paper. we present the results of our experiment. 


The BBN Butterfly Parallel Processor 


The Butterfly multiprocessor at the University of Rochester 
consists of 128 processing nodes connected by a switching network. 
Each processor is an 8 MHz MC68000 with 24 bit virtual addresses. 
A 290l-based bit-slice co-processor interprets every memory 
reference issued by the 68000 and is used to communicate with 
other nodes across the switching network. All of the memory in the 
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system resides on individual nodes, but any processor can address 
any memory through the switch. A remote memory reference (read) 
takes about 3.75 us., roughly 6 times as long as a local reference. 


Each switch node in the switching network is a 4-input, 4- 
output crossbar switch with a bandwidth of 32 megabits/sec. An N 
processor system uses (N log, N)/4 switches arranged in log, N 


columns. The 128-node Butterfly contains 256 switch nodes: the 
extra switch capacity is used to provide an alternate communication 
path between all processors for reliability and improved efficiency. 
The alternate paths can be enabled or disabled by the user. making 
it possible to indirectly measure the effect of switch contention. 


Case Study: Gaussian Flimination 


The sample application chosen for the experiments was the 
solution of a set of linear equations using Gaussian elimination 
(without pivoting). The reason for this choice was that several 
experiments using shared memory to implement Gaussian 
elimination had been performed at BBN Laboratories [2,5]. Our 
experiments with message-passing were designed for comparison 
with their results. In addition, Gaussian elimination is 
representative of a large class of problems in finite element analysis. 
While it is fairly easy to develop parallel algorithms for Gaussian 
elimination, the problem does have important synchronization 
constraints. 


In solving a set of linear equations using Gaussian elimination. 
the coefficient matrix M is diagonalized. producing a modified vector 
of unknowns, and the unknowns are determined using 
backsubstitution. (Since backsubstitution is a small percentage of 
the total time required to solve the equations, it is not performed in 
any of the experiments.) To eliminate an entry M[i,j]. we replace 
row Mii] with M[i] - (M[j] * M{i,j]/M[j,j]). where M[j] is known as 
the pivot row. However, this operation cannot be performed until 
row M{j] has stabilized, ie, M[j,.k] = 0. V k < j. In addition, all 
previous entries in row i must already be eliminated, ie, M[ik] = 
0, V k <j. These two synchronization constraints limit the amount 
of parallelism that we can expect to achieve. 


Floating point arithmetic is required to solve a set of linear 
equations. Unfortunately, our Butterfly does not have floating point 
hardware. All floating point arithmetic is performed by costly 
subroutines. This makes it difficult to analyze the communication 
costs of an application because the execution time is dominated by 
floating point software. To alleviate this problem, our experiments 
were performed with simulated floating point arithmetic, using the 
Same approach as that used in the shared memory implementation 
[2]. All floating point variables were replaced with integer variables, 
and addition and subtraction were used in place of multiplication 
and division. Such a computation does not give the correct answers, 
but does accurately simulate the performance of the computation on 
a Butterfly with floating point hardware. Each experiment was 
performed using software floating point to ensure the correctness of 
the results, but all reported performance figures refer to the results 
of experiments using simulated floating point. By using simulated 
floating point, we not only can make predictions about the 
performance of the Butterfly with floaung point hardware, we also 
reduce the execution time of the computation and thereby increase 
the significance of communication costs. 


Shared Memory Implementation 


The shared memory version of Gaussian elimination was 
implemented at BBN Laboratories using the Uniform System (US) 
package [1]. US provides a set of calls to create a global shared 
memory, accessible to all US processes. Within the shared memory, 
pointers may be shared between processors. US also creates a 
manager process on each processor that is responsible for allocating 
the processor to a series of tasks, light-weight processes that operate 
on the shared memory. Usually, a task is some small procedure to 
be applied to a subset of the shared memory. A task, therefore, can 
be represented as simply an index, or a range of indices, into the 
shared memory and an operation to be performed on that memory. 
Atomic operations in micro-code are used to efficiently aliucate tasks 
to processors. 


The strength of the Uniform System is that tt supports a user- 
level view of the architecture consisting of light-weight tasks and 
shared memory. To reduce memory contention, US encourages a 
globally-shared memory and the scattering of data uniformly 
throughout the machine. (This may even include putting data in 
memory associated with a processor that is not involved in the 
computation.) To reduce process management overhead, only one 
real process is allocated per node; all tasks are light-weight. 


Several experiments were performed at BBN Laboratories to 
see how well applications that use US, including Gaussian 
elimination, perform on large Butterfly configurations [2,5]. In these 
experiments, the problem matrix was uniformly distributed 
throughout available memory. A task was created to eliminate a 
single entry in the matrix, M[i,j]. Both row M{i] and row M[j] were 
transferred to local memory before performing the computation, 
rather than perform a remote reference for each individual entry in 
a row. (Block transfers amortize overhead associated with remote 
references and can significantly improve performance.) Since M{i] is 
modified by the operation, an additional transfer of the modified 
row back to the shared memory location for M{i] was also necessary. 


The main result of these experiments was to show that the 
128-node Butterfly could achieve nearly linear speedup when 
performing Gaussian elimination using US. Both switch contention 
and memory contention were shown to be insignificant (2% and 3%, 
respectively). However, those results used an early version of the 
program which had not been tuned. An optimized version was later 
developed, which greatly reduced the time spent in the inner loop of 
the program. Tuning the inner loop not only lowers the total time 
required for the computation, but increases both memory and switch 
contention during the computation. Therefore, we performed 
several experiments to examine the effects of contention on the 
optimized shared memory implementation. 


The first experiment we performed was designed to examine 
the effect of memory contention on the performance of the 
optimized shared memory implementation. In the US 
implementation, all memories in the system are used to store the 
problem matrix, even when only a few processors are involved in 
the computation. This has the effect of improving the overall 
performance by decreasing memory contention. Since the message- 
passing implementation can not take advantage of memories in 
unused processors, it is important to quantify this effect before 
comparing the two implementations. 


In order to determine the effect extra memories have on the 
performance of a computation (and to indirectly measure memory 


contention), the optimized shared memory implementation of — 


Gaussian elimination was executed on various numbers of processors 
and memories. The test results show that the extra memories can 
have a substantial impact on performance, which in turn suggests 
that memory contention can be a significant factor. On a problem 
matrix of size 400x400, a Butterfly configuration with 4 columns of 
switches, 16 processors, and 96 memories performed 15% better than 
16 processors and 16 memories. On a problem matrix of size 


464 


200x200 and a Butterfly configuration with 2 columns of switches, 4 
processors and 16 memories performed 30% better than 4 processors 
and 4 memories. The greatest effect occurs when roughly 1/4 to 1/2 
of the total number of processors are in use. When a larger fraction 
of processors are performing computation, most of the memory is 
already in use, ie., there is no other extra memory. When too few 
processors are used, they are insufficient to generate enough load on 
the memory to make memory contention significant. 


The second experiment we performed was designed to 
examine the effect of switch contention on the performance of the 
optimized shared memory implementation. The 128-node machine 
has a switch configuration capable of supporting 256 nodes. The 
excess switch capacity increases the total amount of potential 
bandwidth in the switch, which will improve the overall 
performance of a computation whenever there are enough 
processors involved to cause switch contention. 


The results of this experiment show that the effect of alternate 
switch paths can be dramatic (as a percentage of total execution 
time) when a large number of processors are in use. A 35% 
improvement in execution speed was attained using alternate paths 
for 94 processors on a problem of size 400x400; 64 processors 
achieved a speedup of more than 20% on the same problem using 
alternate paths. Since the shared memory implementation does not 
exploit locality of data (two problem rows must be copied into local 
memory before an operation can proceed), it is not surprising that 
the shared memory implementation introduces significant switch 
contention once the inner loop has been tuned to the point where 
the program is communication intensive. | 


Message-Passing Implementation 


[In order to explore fully the potential of message-passing in 
the context of the case study, it was important to provide a general- 
purpose message-passing system. Communication primitives too 
closely associated with the application would demonstrate little 
about the message-passing model. Therefore, the following 
primitives were provided: 


Send(destination, buffer address. byte count): 
Broadcast(buffer address, byte count): 
Receive(source) : buffer address: 


Communication is asynchronous between a sender and receiver. 
Send is nonblocking; the sender may continue to execute before the 
destination receives the message. However. the impiementation may 
block the sending process for a short periud of time necessary to 
buffer the message. Receive will block until a message from the 
source arrives. A wildcard value is available so that messages from 
any source may be received. Broadcast is included because it can be 
implemented more efficiently by the system than if it is simulated 
using point-to-point communication. These primitives are very 
general since minor variants have been used as the basis for 
communication in distributed operating systems. 


Several versions of the message-passing system were 
implemented in an attempt to explore efficient implementation 
techniques. In the first, and simplest, implementation, an object was 
created for each message, using the object management facilities of 
Chrysalis, the Butterfly operating system [4]. Each time a broadcast 
took place, the message system would create a remote object for 
each recipient and copy the message into the object. This approach 
yielded a very clean implementation, but the overhead of using the 
object management system for each message was too great. In 
addition, remote object creation requires the cooperation of a remote 
daemon process, which introduces contention on the remote 
processor. Later versions used preallocated objects for buffers and 
placed responsibility for copying a message with the receiver rather 
than the sender. The final implementation has the following 
properties: (1) buffers are preallocated on each processor and are 
eventually reused, (2) each sender copies the message from local 
memory into a local buffer object and updates the local ring buffer 


pointers, (3) each broadcast recipient copies a message from a 
remote buffer into local memory when a corresponding Receive is 
issued, and decrements a use count in the remote buffer using an 
atomic operation, and (4) synchronization for access to buffers uses 
primitive atomic operations implemented in micro-code. 


In the implementation of Gaussian elimination, a coordinator 
process is responsible for creating worker processes on each 
processor. All workers initialize the local message handler, which 
requires global synchronization, and then create a local partition of 
the problem. When N processors are used, each processor P is 
assigned the task of diagonalizing all rows R for which R mod N = 
P. To do so, each processor must request each row of the problem 
matrix in sequence and use it to eliminate entries in the 
corresponding column of the local partition. A percentage of these 
requests, 1/P, will be satisfied locally and, therefore, do not require 
a message. Local rows are broadcast to the other processors when 
they have been diagonalized. When all local rows are completely 
diagonalized, the worker process signals the coordinator, which is 
responsible for termination of the computation. 


Implementation Comparison 


In this section, we compare the two implementations of 
Gaussian elimination with respect to performance. scalability, and 
ease of programming. Where possible, quirks in the implementation 
of a model are ignored, in order to concentrate on the methodology 
suggested by the model. 


Figure 1 shows the performance of the two implementations 
on a problem matrix of size 800x800. With tour processors. the 
message-passing implementation ts about 30% faster than the shared 
memory implementation. [he relative performance is consistent 
across different problem sizes and, in each case. the message-passing 
implementation is more efficient The disparity decreases as 
additional processors are used until the “knee” in the performance 
curve of the message-passing system is reached. 


The improved performance of the message-passing 
implementation can be directly attributed to data locality. Nearly 
every data access in the shared memory implementation requires a 
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remote reference. Even when the data physically resides in local 
memory, all data conceptually resides in shared memory and must 
be copied into the local workspace. Very few remote references are 
needed in the message-passing implementation (none, if only one 
processor is in use), and a copy takes place only when a remote 
reference is made. 


Exploiting locality of data also minimizes switch contention. 
Unlike the shared memory implementation, the message-passing 
implementation for Gaussian elimination showed no significant 
switch contention. A problem of size 800x800 was able to execute 
on 32 processors without alternate paths at roughly the same speed 
as with the alternate paths. Over the entire range of experiments, 
the effect of alternate communication paths improved the execution 
time of the message-passing implementation by at most 3%. 


Another measure of an application’s performance is how well 
the implementation is able to use additional processors. Ideally, if a 
problem of size NxN using P processors requires time T, the same 
problem can be solved on 2P processors in time [/2. The extent to 
which the ideal is realized depends, in part, on the number of 
sequential operations in the implementation that depend on P. 


Figure 2 illustrates the scalability of the two implementations. 
The computation time for a program designed for a single processor 
is divided by the time required by the parallel version to yield the 
number of effective processors. With linear speedup, P actual 
processors yields P effective processors. As Figure 2 demonstrates, 
the message-passing implementation has nearly linear speedup with 
16 processors, and a slight deviation from linear speedup with 32 
processors. Speedup reaches a turning-point at 64 processors (31 
effective processors) and additional processors only serve to increase 
the execution time of the computation. 


The shared memorv implementation, on the other hand, has a 
speedup factor that is less than linear, even for a small number of 
processors. For example. 4 real processurs shows a speedup of 2.8 
and 16 processors shows a speedup of 10.6. Speedup reaches a 
plateau when 64 processors are in use (30 effective processors). 
Additional processors do not have much impact and. at best. were 
able to yield 34 effective processors with 116 actual processors. 
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In order to explain the “knee” in che performance curve for 
message-passing, we have to look at the implementation for 
operations that depend on P, the number of processors involved. 
Each broadcast operation requires time O(P). Since all data is local 
to some process, each process is responsible for modifying its local 
rows and broadcasting the results. Each processor that is to receive 
a broadcast must copy the message into local memory, so a 
broadcast requires P-1 processors to sequentially access a single 
remote memory. A total of N messages are broadcast to P-1l 
processors. Since each message is buffered locally before it is 
copied, the total number of copies is P*N. As P increases and N 
remains fixed, we reach a point of diminishing returns. Eventually, 
the time to copy an additional row in the presence of broadcast 
contention is not justified by the gain in parallelism. 


The shared memory implementation does not exhibit a “knee” 
because, after initialization, there are no operations in the shared 
memory implementation that depend on P. The Uniform System 
encourages the programmer to avoid memory contention by 
distributing the data uniformly throughout the available memory. 
Tasks are allocated to processors independently of the data to be 
accessed. This means that while contention for a particular memory 
is reduced, overall switch contention is increased because practically 


all data references are remote. N? remote data copies, involving the 
non-zero portion of a whole row, are required. so the total number 
of copies is independent of the number of processors. Thus, if P is 
significantly smaller than N, as it should be to achieve high 
processor efficiency, the message-passing implementation will make 
many fewer copies than the shared memory implementation, 
although each copy (message) will take longer. Fewer copies will 
reduce both switch and memory contention, and improve 
performance, until the “knee” in the curve is reached. Beyond that 
point, additional copy operations cannot be justified by a significant 
gain in parallelism. 


Our third criterion for comparison is ease of programming. 
Programming models are employed so that each application can be 
written without the knowledge of too many underlying details. 
Thus, an important criterion of any model is that it facilitate 
program construction. One concrete measure is the amount of user 
code that needs to be written. For Gaussian elimination, the 
amount of user code is comparable in both implementations. The 
Uniform System implementation contains 426 lines of user code, 
including comments, compared with 368 lines of user code in the 
message-passing implementation. 


A related measure is the amount of user code unrelated to the 
specific application. That is, to what extent does the user code have 
to deal with the details of communication and synchronization? In 
the message-passing implementation, there is one call to Broadcast 
and another to Receive. There is no explicit synchronization in the 
user code. The shared memory implementation uses primitive 
atomic operations for synchronization and several calls to transfer 
blocks of memory. There are also spin locks that are sensitive to the 
amount of time delay between attempts to set the lock. 


Conclusions 


The original intention of this work was to show that message- 
passing could play an important role in the design of a programming 
environment for a tightly-coupled multiprocessor. This exercise was 
designed to provide some empirical data that suggests how best to 
structure cOmmunication between processes in a tughtly-coupled 
multiprocessor. We offer the following conclusions. 


The particular model of computation in use is less important 
than how well it is matched to the application. Gaussian elimination 
js an appropriate application for a multiprocessor, however, this does 
not mean that the shared memory model is the best fit for this 
application. Since Gaussian elimination is essentially value-oriented 
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(no addresses are communicated and rows are not used until they 
have stabilized), message-passing is a better model of the 
communication that takes place. There are other applications for 
which message-passing would not be appropriate. Any 
programming environment that offers a single model of 
communication will not be well-matched to a large class of 
applications. | | 


A high-level interface, efficiently implemented, leaves the 
programmer fewer opportunities to introduce inefficiency. Of course 
this assumes that the high-level interface provides a useful 
abstraction. The Uniform System offers no _ high-level 
synchronization primitives; the two primitives provided are a busy 
wait LOCK and an UNLOCK primitive. Each programmer must 
use these to implement the appropriate synchronization. Spin locks 
are commonly used, but the resultant program can be especially 
sensitive to the amount of time spent between attempts to set the 
lock [5]. The only process synchronization in the message-passing 
implementation occurs during access to message buffers, which is 
implemented very efficiently using micro-coded atomic operations. 
The user does not need to be concerned with low-level 
synchronization; it is implicit in the message-passing primitives. 


The performance of an application depends not only on the 
efficiency of communication, but also on the extent to which the 
underlying model of computation encourages or discourages 
communication. The Uniform System model does not encourage the 
programmer to exploit locality. Data is assumed to reside in one 
large, uniform address space and the boundaries between processors 
are ignored. Thus, in Gaussian elimination, each task must copy 
two rows into local memory to eliminate a single entry and then 
copy the result back. (Pivot rows were cached to avoid one of these 
copies.) This is not the case with the message-passing model, since 
all data is local to some process in the computation. Locality makes 
it possible to avoid copying any row to be modified (since only local 
rows are ever modified) and also to avoid copying any pivot row 
that happens to be local. Thus, an implementation based on very 
efficient communication (e.g., shared memory) may perform worse 
than one based on a less efficient mechanism (e.g., message-passing), 
if such efficiency encourages too much communication. 
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ABSTRACT 


Experimental hierarchical multiprocessor systems, such 
as the Cm* [1],(2|, EGPA [3] machine and the Cedar [4], show 
that hierarchical systems encompass a range of processor 


configurations and interprocessor communication links. Cost. 


and performance models are developed for the components of 
a hierarchical system where all processors share a global 
memory and processors are grouped into clusters that share a 
cluster memory. With these models, it is shown that a unique 
combination of processor speeds and numbers best achieves a 
given performance goal. Given the distribution of interproces- 
sor communication, the cluster size is chosen to minimize aver- 
age communication delay. When the patterns of communica- 
tion correspond directly to some parallel architecture, the 
analysis indicates the effectiveness of the hierarchical system 
in emulating that architecture. Our primary objective is to 
compare various processor configurations in large systems. 


Thus, the models we have developed ignore factors that tend 


to have a similar effect on a wide range of configurations. 


1. INTRODUCTION 


The shared memory model, in which all p processors, 
access @ common memory in constant time, has been widely: 
used in parallel processing research. Crossbar switches are too: 
expensive to construct for large p. Multi-stage interconnec- 
tion networks have order lg p stages and a direct realization 
of the shared memory model results in an order lg p increase 
in delay as p increases. Local memory for each processor can 
decrease memory latency to some extent because a large frac- 
tion of all references are to variables that are private to each 
processor. An extension of this approach is to organize proces- 
sors in groups (clusters) with a local cluster memory shared 
by each group of processors. Communication between proces- 
sors in the same cluster is done through the cluster memory. 
Access to global memories is required primarily to communi- 
cate among processors in different clusters. Such a two-level 
hierarchical system may be extended to several levels. In this 
paper, we focus on two-level hierarchies. 


As in uniprocessor systems, a large proportion of all 
references are to a small set of storage locations. The develop- 
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ment of methods to ensure that the highest level in the hierar- 
chy is populated mainly by this set has led to the success of 
cache-based systems. Furthermore, for many applications, it 
is possible to assign tasks so that interprocessor communica- 
tion occurs mainly within small groups of processors. It is for 
these applications that providing a faster set of communica- 
tion links within groups of processors results in improved 
performance. The models that we use here are similar in some 
respects to those in Welch [5] and in Vrsalovic et al [6]. 


We briefly describe three hierarchical systems. The 
Cm* is a two-level hierarchical multiprocessor system [1]. The 
Cm* hardware is made up of 50 processor-memory pairs called 
Computer Modules or Cm’s [2]. These are grouped to form 
clusters containing up to 14 Cm’s. Communication within 
the cluster is via a parallel bus controlled by a Kmap, which 
is an address mapping processor. There are five clusters and 
these communicate via intercluster buses. The Cm*% is extensi- 
ble, either by adding processors to each cluster (up to 14) or 
by increasing the number of clusters. The Cedar [4] uses a 
crossbar interconnection between the processors within a clus- 
ter and the cluster memory they share, and a multistage inter- 
connection network between all processors and a _ global 
memory shared among all clusters. At present there are up to 


8 processors per cluster and the number of clusters is extensi- 
ble. 


The Erlangen general purpose array, EGPA [3], is exten- 
sible only in the number of levels. Each processor except for 
the ones at the lowest level is connected to the memories of 
four subordinate processors. Data transfer is done on the 
EGPA via the common memories. Processors and memories in 
the same level are connected by additional links. Although 
hierarchical and tree are often used interchangeably, we 
assume that a hierarchical machine has processors at the leaf 
nodes and communicating nodes or memories at the internal 
nodes, whereas a tree machine has processors at internal as 
well as leaf nodes. We thus classify the EGPA as a tree 
machine. The models we develop are directly applicable only 
to hierarchical systems. 


Several design decisions involve specifying gross system 
parameters. For instance, a mesh architecture can be specified 
by the size of the mesh, the speed of the individual processors, 
and the communication bandwidth and delay of the links. In 
a hierarchical (cluster) system, parameters include the number 
of processors and their speed, the number of levels in the. 
hierarchy, the type of interconnection at each level, and: the’ 
number of clusters. We have developed cost and performance 


models and techniques that use them to make some of these 
decisions, given knowledge of the workload as input. 


The efficient utilization of the resources in a multiproces- 
‘sor system requires a match between the architecture and the 
application. Inter-task and interprocessor communication are 
determined by the algorithms used. Structured programming 
with hierarchical design may improve the mapping to a 
hierarchical multiprocessor. The importance of reducing 
interprocessor communication is well described in [7]. In order 
to examine the general suitability of hierarchical systems, we 
evaluate their ability to emulate the patterns of communica- 
tion encountered in typical code for some important applica- 
tions. We develop some methods to choose a processor 
configuration that is best suited for the patterns considered. 


We present models for the components of a hierarchical 
system in section 2. In section 3, we show how these models 
help to obtain an optimum distribution of costs among the 
components. In section 4, we are concerned with the choice of 
cluster size. For some patterns of communication commonly 
encountered, we show that the cluster size can be optimized 
with respect to average communication delay. In section 5, we 
determine the cluster sizes that will minimize interprocessor 
communication delay for emulating certain processor net- 
works. 


2. COST AND PERFORMANCE MODELS 


2.1. Parameters 


We list some of the parameters that characterize a two- 

level hierarchical system (Fig. 2.1). 

p - number of processors in the system. 

s - speed of a single processor. 

c - number of processors in a cluster. 

p/c - number of clusters in a two-level system. 
For a single application with a uniform pattern of communica- 
tion, a uniform cluster size, c, is the best choice. 


2.2. Cost and Access Times 


We propose models for the three major components in a 
two-level hierarchical system, viz. the processors, the cluster 
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Interconnection 
(ig ¢ delay) 
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Cluster Memory 


Cluster Memory 


Global Interconnection Network 
(Additional delay of lg (p /c) = lg p - lg c) 


Global Memory 


Fig. 2.1. Model of Communication Delay for a 
Hierarchical Multiprocessor. 
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interconnection, the global interconnection. These models are 
simple since the primary purpose is to develop quantitative 
methods to obtain a better understanding of some tradeoffs in 
large systems [8]. For example, we show a tradeoff between 
processor cost and communication structure cost. Other 
models can be substituted for the ones used here, but the basic 
analysis technique remains unchanged. 


2.2.1. Processors 


In order to focus on the configuration issues of concern 
here, a processor is characterized by only a single parameter, 
speed. All processors are required to have the same speed. 


One model for the individual processor cost as a function 
of speed is the power function c(s) = 8-8 - With this model, 
multiprocessing is a cost-effective option only if a>1. If 


a <1, the performance of a uniprocessor system can be | 
improved at sublinear cost, while a multiprocessor system 
with additional processors can result in at most linear 
speedup. Such a power function is fairly commonly used [5] 
and is mathematically tractable. The boundary beyond 
which multiprocessing is undesirable can be stated concisely. 


2.2.2. Global Interconnection 


There are several strategies for interconnecting a large 
number of processors {9]. The ring and the mesh networks 
(Figs. 5.1 and 5.2) are good for certain applications, but they 
have a large diameter. For general purpose multiprocessors, 
multi-stage packet switched networks are usually most suit- 
able for global interconnection and we model such networks 
here. | 


In such a network, there are p requesters and a similar 
number of memory banks serving these requesters. The 
memory banks together constitute the global memory (Fig. 
2.1). The number of switches in each stage is proportional to 
p. There are lg p stages in the network. The total number 
of switches and the number of wires connecting switches are: 
thus proportional to p lg p and the cost of the global intercon-. 
nection is modeled as plg p. We assume that each switch 
operates in unit time. Processor speed, s, may thus be viewed 


as being relative to switch speed. 


Since the number of switches that have to be traversed 
to reach the server is /g p, the model access time is propor- 
tional to lg p. Another rationale for modeling the access time 
or communication delay as lg p is the following. Since the 
number of possible destinations is p and the switches are 
binary in nature, the data should be switched through at least 
Ig p switches. Accordingly, the model delay of a link in a 
mesh is two, since there are four possible destinations. To 
obtain a fair comparison, it is necessary to penalize, with a 
larger delay, those parallel architectures with many communi- 
cation links per processor. 


2.2.3. Cluster interconnection 


At the cluster level, a wider range of interconnection 
schemes is possible, since the number of processors is small 
[10]. We show that a crossbar for the cluster interconnection 
is not feasible if the cluster size is to be larger than /g p and 
we may be constrained to multi-stage interconnection net- 


works within such a large cluster. 


A desirable if not essential characteristic of a large mul- 
tiprocessor architecture is scalability. Scalability as it per- 


tains to cost requires that the cost of a system should not 
increase much faster than the number of processors. We have 
seen that the cost of the global interconnection increases as 
plgp. We require that the total cost of all cluster intercon- 
nections be similarly bounded. 


Consider the choice of a crossbar for the cluster intercon- 
nection. Since there are c switches in a crossbar of size c, the 
cost of the crossbar increases as c. Since there are p/c 
crossbars, the total cost of the cluster crossbars is propor- 
tional to p.c. Since this cost must be bounded by p lg p, the 
size of a cluster, c, should not increase faster than lg p, as p 
increases. 


As we shall see for some cluster communication pat- 
terns, cluster size should increase faster than lg p to ensure 
minimum communication delay. A multi-stage interconnec- 
tion network within the cluster may then be more suitable. 
We assume a delay of lg c for a cluster interconnection. 


3. COST TRADEOFFS BETWEEN PROCESSORS 
AND THE COMMUNICATION NETWORK 


The wide disparity in the cost of microprocessors and 
large scientific computers often evokes the suggestion of gen- 
erating supercomputing power from a multi-microprocessor 
system. In a large multiprocessor, the interprocessor commun- 
ication links can account for a significant fraction of the total 
cost. In our model, the cost of the interconnection network 
increases faster than the number of processors. Even if the 
choice of cheaper and slower, but more processors to attain a 
required system performance reduces the total processor cost, 
the larger interconnection network might increase total cost. 
Additionally, as the number of processors increases the utiliza- 
tion of the processors decreases, where utilization is the frac- 
tion of the total number of processors that can be effectively 
used. As a result of these factors, higher levels of performance 
are best achieved by using sufficiently fast processors as sys- 
tem components. In this section, we develop quantitative 
methods for evaluating such tradeoffs, e.g. to determine the 
number and speed of the processors. 
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A combined analysis of the factors affecting cluster size, 
c, and system size (the number of processors), p, is difficult. 
Although the cluster size can affect the cluster interconnection 
cost, we approximate the total cluster interconnection cost to 
be a function of system size, p. This enables us to find the 
system size, p without a knowledge of the cluster size. 
Further, we approximate the interconnection cost to be Dp. 
Here ‘y is chosen so that p’ approximates the model cost of 
p lg p for the range of system sizes under consideration. This 
enables us to develop closed form expressions. Iterative 
solutions can be obtained without this additional assumption. 


The parameters of the model are listed below. 

P, - desired performance level. 

s - individual processor speed. 

$58. - individual processor cost. 

p’ - communication structure cost. 

p - processor utilization factor. 

C - total cost of the system. 
We model the utilization factor as p , because the utilization 
factor is generally observed to decrease as the number of pro- 
cessors increases. If we obtain linear speedup, =O and the 
processor utilization factor is one. If there is no speedup, 
@=1 and this factor is 1/p. Ordinarily, Plies between O and 
1, where smaller values imply better utilization. 


The objective is to attain performance P, with minimum 
total cost, C. The total processing capacity is s.p, of which a 
fraction p ' is used. Therefore, 
sp (3.1) 
The cost C is the sum of processor and the communication 
structure costs, namely 


-B 
Po = spp = 


C= p.8-8- +p" (3.2) 
From (3.1) 
cpio" (3.3) 
Substituting (3.3) in (3.2), 
sey Pip ep (3.4) 


Individual Processor Speed, s (MF lops) 
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Performance Requirement, P, (MFlops) 


Fig. 3.2. Plot of Individual Processor Speed, s vs. 
Performance Requirement, Py». 
(from (3.7) in (3.3) with P=0.1, ~=1.4, s95==25.0) 
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Fig. 3.3. Plot of System Size, p, vs. Processor Cost Exponent, a. 


(from (3.7) with ~=1.4, s5=25.0, P»==1000) 


Differentiating (3.4), 


1 


dC a. aid z 
a = {1 - a(1 - A].89.P,-p bee +7.p" (3.5) 
P 


Setting (3.5) to zero, we note that there is a solution only if 


a(1-f) >1 (3.6) 
In Fig. 3.1 a uniprocessor solution is best in the region below 
the curve. Within this region, if a uniprocessor system with 
cost C' achieves performance P, and a two-processor system 
with individual processor cost C'/2 achieves performance Po, 
then P,>P,. This can be shown by substituting for s from 
(3.2) in (3.1), assuming that the interconnection cost is zero. 


If (3.6) holds, we can set (3.5) to zero and solve for p. 
We obtain the following closed form solution for p. 


1 
p = [Pas( te A tyr tata ary 
Ff 


From (3.3) we can now find the processor speeds (Fig. 3.2). 


In Fig. 3.2, we see that higher performance is best 
achieved by increasing processor speeds in addition to increas- 
ing the number of processors. For higher processor cost 
exponents, @ corresponding to rapidly increasing processor 
cost versus speed, the same performance is best achieved by a 
greater number of slower processors (Figs. 3.2 and 3.3). If the 
cost of interconnecting processors is larger (larger 4), it is more 
economical to use fewer and therefore faster, more expensive 


processors to reduce the interconnection cost at the expense of 


processor cost (Fig. 3.4). 


4. CLUSTER SIZES FOR VARIOUS APPLICATIONS 


In the previous section, we developed techniques to deter- 
mine the number of processors and their speeds. In a two- 
level cluster system, specifying the cluster size completes the 
processor configuration. If the cluster interconnection has 
nearly a linear cost-size relationship, the choice of cluster size 
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Fig. 3.4. Plot of System Size, p, vs. Interconnection 
Cost Exponent, 7. 
(from (3.7) with a=1.5, 8=0.1, Py>=1000) 


has little effect on the cost of the system. The choice of clus- 
ter size does, however, have a significant effect on interproces- 
sor communication delay. If the cluster interconnection is a 
multi-stage interconnection network, a cluster size is best 
determined from performance considerations. Though other- 
wise identical two-level systems with different cluster sizes 
may not appreciably differ in cost, there may be a significant 
cost difference between such a system and a similar system, 
like the NYU Ultracomputer [11], with a single global memory 
and no cluster memories. In this section, we propose some 
methods for determining the best cluster size. 


4.1. Cluster Size from Probability Distribution 


We first propose a model for the distribution of refer- 
ences between processors. We assume that all references to 
cluster and global memory are to communicate between pro- 
cessors in the same cluster and in distinct clusters, respec- 
tively. This assumption is not perfectly correct since the 
cluster and even the global memory may be used to avoid 
duplication of variables among local memories, which are 
private to each processor. We choose any processor p; and 
determine the distribution of interprocessor communication 
c;(p;) between processor p, and the other processors p; in the 
system. The c,(p;) are scaled so that Jicj(p;)=1. We 


j 
; t 
assign an integer rank r ;(P,) to a processor p, so that 


r;(p,)= if there are / processors, p;, with c;(p;)>c;(p,). We 
plot the communication c,(p;) versus the rank r,(c;) of each 
processor. We call this the communication distribution curve 
and by definition it is a monotonically non-increasing curve. 
We assume that such plots for all the processors are identical, 
le. all processors have similar patterns of communication. 
Additionally, we assume that if we choose a cluster size of c, 
we can obtain a compatible assignment of processors to dis- 
joint clusters. In a compatible assignment, for each processor, 
P;, processors, p; with rank, r ;(p;) less than ¢ belong to the 
cluster containing p - With such a model, we show that there 


is an optimum cluster size which minimizes average communi- 
cation delay. 


Example : Consider a system of four processors, p, through 
p, Let the fraction of communication between any two pro- 
cessors be specified by the following table. 


Communication between Processors 
total communication of | {total communication of each processor normalized to 1) _| processor aa ees to l 


Py ee 
Po 0.2 
P3 0.1 0.7 
PD. 0.2 


Table 4.1. Distribution of Interprocessor Communication. 


The first step involves selecting a processor, say p3. The next 
step involves ranking the remaining processors. Since the pro- 
cessor communicating the most with p, is p,, the rank of p re 
r3(p4), is 1. The ranks r,(p,), r,(po) of p, and p, are 2 and 3 
respectively. We can now plot the curve of fraction of com- 
munication versus rank through the points (1,0.7), (2,0.2) and 
(3,0.1). Such curves for the other processors are identical to 
the one for p,. Choosing a cluster size of 2 and assigning {P 4, 
Po} to one cluster and {p.4, p,} to the second cluster forms a 
compatible assignment. We note that, by our model, the 
average communication delay with no cluster memory is: 
lg 4 =2. With a cluster size of 2, the average communication 


delay is 0.7 lg 2 +(0.2 +0.1) lg 4 =0.7 +0.6 = 1.3. 


Given a continuous communication distribution curve 
f(q), we can find an optimum cluster size. With cluster size, 


c, the average communication delay, D, is given by 
ce-l p-l 
D, = "gc. f f(q).dq +19 p. f f(q).d (4.1) 


0 


From the definition of the communication distribution curve, 
it follows that 


p-l 
ff(q)-4q =1 (4.2) 
0 
Eqn. (4.1) can thus be rewritten as 
e-l 
D, ='9 p-(lg p -'9 c). f f(q)-d (4.3) 
0 


The original objective of Min D,, given p, is equivalent to 


c 
c-l 


Maz |(Ig p - Ig c). f f (q)-dq] (4.4) 


If f(q) is known, we can determine the cluster size that 
minimizes average communication delay from (4.4). 


We examine the changes in optimum cluster size for a 
class of communication distribution curves of the form 


f(q) =f 9-2. ‘ where \ is a parameter and f is chosen to 
normalize f(q). Solving 
p-l 
Sfo2>! dq =1 (4.5) 
0 


Substituting for f(q) in 


we obtain 
u.ln 2 


fo = 1 _ aXe -1) 


(4.6) 
(4.4), the objective is to maximize 


In 2 


oe rn (4.7) 
1 = aXe -1) 0 


(lg p 


Differentiating (4.7) with respect to c, simplifying and setting 


‘to zero we obtain the following implicit relationship. 


Ig c +— ee -1)=[9p 
de.In?2 


Once p and are known, we can solve (4.8) iteratively for the 
optimum cluster size, c. 


(4.8) 


Since the choice of cluster size, c is ordinarily con- 
strained to powers of 2, we plot the logarithm of cluster 
size, Ig c against the communication exponent, ) (Fig. 4.1). 
When the communication exponent, » is large, a large propor- 
tion of the interprocessor communication is to a few 
processors. In this case, it is best to choose a small cluster 
size, c. A larger cluster size will slow down all references and 
will not sufficiently increase the fraction of total interprocessor 
communication satisfied directly by the clusters. 


Log, of Cluster Size, lg ¢: 
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Fig. 4.1. Plot of Cluster Size, ¢ vs. Communication 
Exponent, \(p = 512). 


4.2. Cluster Sizes for Fixed Degree Networks 


A hierarchical multiprocessor has been proposed as a gen- 
eral purpose multiprocessor, with respectable efficiency over a 
range of applications. We consider a class of applications, 
which is frequently encountered, namely problems having a 
pattern of communication in which each node communicates 
to a fixed number of neighboring nodes. The communication 
graph for this class of problems is regular with a fixed degree, 
d. Additionally, we require that a network of size p be redu- 
cible to a network of size p /c, so that c nodes in the original 
network are mapped onto a single node in the reduced net- 
work and the reduced network belongs to the same class of 
networks. In [12], the computation factor, f,, is defined to be 
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the number of nodes mapped onto a single node and the 
exchange factor, f,, is defined to be the number of links that 
map onto a single link. For the restricted class of networks 
and reductions under consideration, the computation factor, 
f, is always c since we map networks of size p to p/c clus- 


ters, each with c processors. For fixed degree networks, the 


exchange factor, f,, lies between 1 and c. This technique is 
described further in [12]. 


Many common networks satisfy these requirements. For 
Instance, each node in a mesh (Fig. 5.2) has four neighbors 
regardless of the size of the mesh. A mesh can be reduced to a 
smaller mesh, by coalescing square meshes of size Ve by Ve. 
The computation factor f, is c and the exchange factor f , is 
c. For a mesh, the degree of reduction, c, changes the 
compute/communication balance at the global level by a fac- 
tor of f,/f, =Ve. In general, the improvement of a 
hierarchical system over a multiprocessor system with a single 
global memory depends on this ratio for the application. 
There are many applications and algorithms having a mesh- 
like pattern of communication. It may be best to execute 
such an application on a mesh, but a mesh may not be ade- 
quate for other algorithms that have other patterns of com- 
munication. A single hierarchical multiprocessor may ade- 
quately support a broad range of other patterns. 


A multiprocessor system with a single global memory has 
a communication delay of lg p. Adding a cluster interconnec- 
tion to such a system for faster communication between pro- 
cessors within a cluster involves an additional cost which must 
be justified by performance improvement. We require that a 


hierarchical system with the same number of processors have 
an average delay of (lg p)/k, where the k divisor is intended 
to compensate for the additional cost of the cluster intercon- 
nection and memory. With such a requirement, we show that 
some range of cluster sizes can be excluded from further con- 
sideration for the class of applications described in the first 
paragraph of this section. We are able to exclude both very 
small and very large cluster sizes, by making a few assump- 
tions on the workload. 


We can now measure average communication delay for a 
hierarchical system with clusters of size c as 
a,.lg c +a, .lg p 


D 


c 


(4.9) 
a, + a, 

where a, and a, are the number of cluster and global accesses 

respectively. We require 


D, <(lg p) /k 


The nodes within a cluster of size c make a total of dc 
‘accesses, since the communication graph has degree d. Since 
the reduced. communication graph has degree d with f, links 
of the original graph mapped onto each link in the reduced 
graph, the mapping of nodes onto clusters causes each cluster 
to make d.f, global accesses to other clusters. The number of 
accesses that are served within a particular cluster is 
d.c -d.f,. From (4.9) and (4.10), 


(d(c - f,).lgc +d.f,.lg p) 
a d.c 


(4.10) 


l 
< 
k 


(4.11) 


Solving for lg c in (4.11), 
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(4.12) 


This relationship is always satisfied if k = 1, because the 
average communication delay can be no woyse than lg p. For 
k>1, it cannot be satisfied when c > p With a cluster 
size larger than p’ , an average delay greater than (lg p)/k is 
to be expected, because even cluster reference delay is more 
than (lg p)/k. Sincec > f,,ife <kf, the right hand side 
in (4.12) is negative, requiring the cluster size to be less than 
one (an impossibility). So a lower bound on cluster size is 
k.f,. In general, f, is a function of c. For a mesh, f, is Vc 
and the lower bound on cluster size is k°. These results can be 
summarized as 


kf,<e <p” (4.13) 


Tighter upper bounds on the cluster size, c can be 
developed if the exchange factor is known. In Fig. 4.2, we plot 
upper and lower bounds on c obtained by an iterative solution 
of (4.11) with f =Ve and k=1.5. For system sizes smaller 
than 64, no choice of cluster size can yield the desired 
improvement factor. 
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Fig. 4.2. Feasible Cluster Size Region vs. System Size, p 
for a Mesh with Improvement Factor, k =1.5. 


We note that these bounds are independent of the degree 
of the network. In this section, we have been concerned with 
the efficiency with which a hierarchical multiprocessor can 
support the communication requirements of algorithms and 
the extent to which this can be improved by using clusters. A 
related problem is to determine how well such a system can. 
emulate the communication links of other processor networks. 
The results in section 4.2 are equally applicable to processor 
networks and essentially state that very small and very large 
cluster sizes can be excluded from consideration. 


5. EMULATION OF PROCESSOR NETWORKS 


In this section, we consider the emulation of some com- 
mon processor interconnections on a hierarchical system. We 
show that the average communication delay is minimized by 
an appropriate cluster size. 


We outline the general technique used in the subsequent 
subsections on specific processor networks. In each case, we 
consider a processor network with p processors and a two- 
level hierarchical system of the same size. For some scheme of 
mapping the processors in the network onto the clusters in the 
two-level system, we find the ratio of cluster to global accesses 
as a function of the cluster size. We then use (4.9) to find the 
best cluster size and its average communication delay. In gen- 
eral, it is difficult to find the best mapping from a network to 
a two-level system. However, since the networks considered 
are regular, the best quotient mappings are obvious. 


5.1. Ring Network 


Consider the emulation of a ring of size p on a two-level 
hierarchy. We divide a ring into p/c segments each contain- 
ing c processors. We map the processors in each segment of 
length c onto the processors in a single cluster of size c. Con- 
sider a typical communication pattern in which each processor 
in the ring accesses its left neighbor. There are then c-1 clus- 
ter accesses for each global access (Fig. 5.1). It is easy to see 
that no other mapping scheme can result in a larger ratio of 
cluster to global accesses. 


cluster size, c = 4. 


Fig. 5.1. Mapping a Ring Segment onto a Cluster of Size, 
a © 


Substituting for cluster and global accesses in (4.9), 


(c -1)lgc +lg p 


D = 


¢ 


(5.1) 


Differentiating (5.1) with respect to c, setting to zero, and 


solving for lg p, 
c -l1 
In2 


We can solve (5.3) iteratively for c, or if c >>1 and 
c >> lg c, we can approximate (5.2) as 


Ige + = Ig p (5.2) 


(5.3) 


With a cluster size of In2.J/g p, the average communication 
delay is 


Cot = In 2.lg p 


In In2 1 
D, =(lglgp + - —) + (5.4) 
mes In2 In2.lg p In2 
For large systems, this can be approximated as 
1 
D, Ig lg p(1-———_) (5.5) 


In2.lg p 


The delay here is a logarithmic factor better than in a system 
with no clusters and a single global memory. 


5.2. Mesh Network 


Consider the emulation of a toroidal mesh with dimen- 


sions Vp by Vp on a two-level hierarchical system. We map 


c =4 


Fig. 5.2. Mapping a Sub-mesh onto a Cluster of Size, c = 16. 


a sub-mesh of size Ve. by Ve onto a cluster of size c. 
Assume (without loss of generality, by symmetry) that all the 
processors in the mesh access their left eye (Fig. 5.2). 
The number of cluster accesses, a,, is (Ve -1) and the 
number of global accesses, a, by processors in the cluster is: 
c. Substituting in (4.9), 
Ve (Ve. ~1)lgc +Velg p 


c 


(5.6) 
c 
Differentiating with respect to c, setting to zero, and solving 
for lg p, 
2(Ve -1) 


Ige + = lg p 


(5.7) 
In 2 

We can solve (5.9) iteratively for c or if Ve >>1 and 

Ve >> lg c, (5.7) simplifies to 


In2 2 


Cont ~~) 
2 


For large systems, the average delay ignoring small constant 


2 
lg’ p (5.8) 


terms is 


D, = 9 Ig pl - (5.9) 


In2.lg p 
In order to obtain a logarithmic improvement in communica- 
tion delay, the cluster size has to be quite large compared to 
the results for a ring. 


5.3. Hypercube Network 


Consider emulating a lg p-dimensional hypercube with p 
processors on a hierarchical system. The interconnection 
scheme in a hypercube is best described with a binary number- 
ing of the processor nodes. Processors whose binary numbers 
differ in exactly one digit are connected by direct links. Under 
our model, the average communication delay of a multiproces- 


sor interconnection strategy is the log, of the number of links 
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per processor. Since each processor node is connected to lg p 
processors by direct links, the hypercube has an average com- 


‘munication delay of lg lg p. 


We map the processors in a hypercube onto a two-level 
system in the following manner. A lg p-dimensional hyper- 
cube can be split into two smaller hypercubes of dimension 
lg p —1, by removing the links connecting processors differing 
in a particular digit (for instance, all links connecting proces- 
sors whose numberings differ in the leftmost digit). Since c is 
usually a power of 2, the hypercube can be divided into p/c 
smaller hypercubes of dimension lg c and each of these can be 
assigned to a cluster of size c. In order to analyze the average 
communication delay, we assume that a hypercube is equally 
likely to use any of its links. Of the lg p links available to 
each processor, lg c are mapped within the cluster. Hence, the 
average communication delay from (4.9) is 


_ Igelgc +(lg p -lg c)lg p 


Dee ee (5.10) 
Ig p 
By analysis as above, the best choice of cluster size is 
1 
Cot =P (5.11) 
The delay for this cluster size is 
2 
a= ; Ig p (5.12) 


This delay is not substantially superior to the lg p delay for a 
global memory multiprocessor with no clusters and it is fac- 


tor of worse than the hypercube itself. 


Alg lg p 
6. CONCLUSION 


We have developed simple models for the components of 
a hierarchical multiprocessor system. A two-level hierarchical 


system can be specified by the total number of processors, the. 


speed of individual processors, and the size of the cluster. 
Based on these models, we developed expressions for the 
optimum number of processors and their speed. Optimum 
cluster size is application dependent. If the communication 
probability distribution curve for an application is available, 
we can find the cluster size which will minimize interprocessor 


communication delay. For a broad range of problems, we find 


that we can classify certain ranges of cluster sizes as unsuit- 
able. The choice of a suitable cluster size can minimize the 
average communication delay when emulating certain proces- 
sor networks. 


While we could use more elaborate models in our 
analysis, we would not have relatively simple expressions as 
we have now. Some tradeoffs are quantified easily using the 
current models: the tradeoff between the cost of the processors 
and the cost of the communication structure, and the tradeoff 
between having a large cluster size that contains more refer- 


ences within it and having a smaller cluster size with less com- 


munication delay for each intracluster reference. We feel that 
even though the models are simple they do not unduly favor 
particular processor configurations. 


Average communication delay is a useful metric in choos- 
ing a cluster size. 
delay is lg lg p, it does not necessarily indicate that the execu- 
tion times will be degraded by that factor. Frequently a pro- 
cessor can issue a read request and then perform useful compu- 
tation while the request is being satisfied by the network. 
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However, if the average communication. 
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Abstract 


The major issue in multiprocessing is not the con- 
struction of a multiprocessor system, but its efficient use 
in real applications. This paper describes CAMP, a 
program—development tool that helps a programmer in 
coding problems for multiprocessors. CAMP can parti- 
tion’ programs, insert synchronization primitives, and 
simulate program performance. This way, a program- 
mer may experiment with different algorithms for solving 
the same problem and select the one that fits his mul- 
tiprocessor system the best. 


1. Introduction 


In the past several years we have seen many pro- 
posed and commercial multiprocessor architectures, all 
aimed at increasing machine performance by an order of 
magnitude. Although, faster hardware is easy to build 
these days, there is no agreement on how to achieve this 
performance increase for realistic applications. 


There are basically four schools of thought as to 
what is the most important factor in obtaining higher 
performance in a particular machine. The first school of 
thought believes in faster circuit technology, which will 
allow us to retain present architectures, possibly aug- 
mented with a mechanism for synchronizing parallel 
processes. The second school puts priority on optimizing 
or vectorizing compilers, possibly interactive, that will 
detect parallelism and restructure sequential programs 
into their parallel format. The third school that believes 
in a dramatic increase in performance from new parallel 
algorithms supports development of new languages that 
will allow easy conversion of algorithms into programs. 
The fourth school supports new models of computation, 
such as data flow models, that will allow dramatic 
increases in parallelism and can be easily exploited by a 
multiprocessor architecture. 


Although each school deals with one part of the 
solution, none of them shows how to optimally combine 


(a) This work was supported in part by the Hughes 
Aircraft Company under Contract No. 1-5-37368, and the 
IBM graduate fellowship program. 


(b) The author’s current address: T. J. Watson Research 
Center, P.O. Box 218, Yorktown Heights, NY 10598. 


0190-391 8/86/0000/0475 $01.00 © 1986 IEEE 


application requirements with the capabilities of VLSI 
technology. The first school uses the least risky 
approach by retaining old programming and architec- 
tural models. It takes advantage only of the speed that 
the new technology offers, not of its density. The second 
and third schools retain old architectural models, which 
may no longer be cost effective. Furthermore, the second 
school retains programming models developed for pre— 
VLSI architectures. The fourth school, although the 
most progressive, does not take into account technologi- 
cal limitations. For this reason, machines based on new 
models of computation do not exhibit impressive perfor- 
mance. Instead, they create new problems on their own. 


The current VLSI technology allows building mul- 
tiprocessors at low cost. However, This multiprocessor 
approach introduces three new requirements not encoun- 
tered in the single processor environment [1]. First, each 
program must be partitioned into tasks executable on 
one or more processors. If a task is executed on more 
than one processor, it must be further partitioned into 
processes. Second, every task and process must be 
scheduled for execution on a particular processor or pro- 
cessors. Third, synchronization must be performed 
between concurrently executing tasks and processes to 
maintain data dependences in the program. 


We support a fifth school of thought, which 
believes that only efficient solutions of the above prob- 
lems will bring an order-of-magnitude improvement in 
multiprocessor performance. There are two extreme 
approaches in achieving this goal. In one, the applica- 
tion programmers (or the problem specialists) are sup- 
posed to solve the problem of partitioning, scheduling, 
and synchronization [2] [3]. Since parallelism is not a 
natural to human thinking process, this approach may 
be too cumbersome for all but very simple problems. On 
the other hand, we may use a restructuring compiler [4] 
[5] for partitioning, scheduling and synchronization of 
sequential programs. This approach assumes that such a 
compiler has good knowledge over wide range of applica- 
tion domains and that the algorithms used in the pro- 
grams are optimal over different multiprocessor architec- 
tures. 


We advocate an approach in the middle of the 
above two extremes. We assume that the programmer is 
an expert in his application domain. In the mean time, 


we also realize the shortcomings of human thinking in a 
parallel environment. In our approach, the user extracts 
different segments of the program while a program— 
development tool helps in partitioning these program 
segments into a number of processes, and inserting syn- 
chronization primitives where needed. The simulator 
incorporated in the tool estimates the performance for 
different partitioning and synchronization strategies. 
This way, the programmer may use the tool repeatedly 
to explore several algorithms of solving the same prob- 
lem and select one that fits his multiprocessor architec- 
ture. The essence of the computer-aided multiprocessor 
programming tool (CAMP) is shown in Figure 1. Note 
that scheduling each process to a physical processor is a 
runtime procedure. However, partitioning a task into a 
certain number of processes, each of which is associated 
with a virtual processor, is performed by the tool. 


- Users 
(Problem Experts) 


a A bar 
~~ 
eater Problem = Program Partitioning Parallel Execution 
& 
‘ Algorithm Design i Synchronization Insetion Simulation j 


not satisfactory 


ieee wo 


Figure 1. The program—development tool 


In this paper, we select a loop example to demon- 
strate the program-—development methodology with 
CAMP. We describe different partitioning and syn- 
chronization methods which have been implemented in 
CAMP for program loops. Most of the parallelism in 
scientific computations comes from loop statements. In 
the parallel execution of a loop, each iteration is usually 
treated as the smallest unit of execution and assigned to 
a processor in the lexicographical order of its index set. 
This method is very effective when all iterations are 
independent. However, when data dependences exist 
between different iterations, a proper partitioning and 
synchronization mechanism must be used to assure fast 
and correct execution. 


The layout of this paper is as follows. In section 2, 
we will describe the basics of CAMP. Then, in the sub- 
sequent sections, we will discuss several methods which 
have been implemented in CAMP. We will describe the 
process alignment for eliminating synchronization 
between concurrent processes, a synchronization method, 
and different partitioning strategies for program loops in 
sections 3, 4, and 5 respectively. Finally, we will present 
results of a walk-through example and propose future 
research in section 6. 


END 
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2. CAMP: A Program—development Tool 


Figure 2 shows the flowchart of CAMP. Each step 
in a box is performed by CAMP and those steps without 
boxes are programmers’ jobs. All the jobs for the pro- 
grammers involve either a decision making or the extrac- 
tion and rewriting of program codes, which we believe 
human can do better than a machine. Note that in this 
paper we described only the parts used for developing 
parallel code for program loops. 


; Partitioning & |__| Extract = 
' synchronization ! Serial code 
riage J 


Parallel 
loop? * 


Process 
alignment 


Totally aligned 
loop? 


Extract 
loop segments 


yes 


yes 


Determine 
ARs 


Insert 
synchronisation 


Partitioning without 
synchronization: 


Partitioning with 
synchronization: 
(1) block partition 
(2) interleaved part. 


find independent 
execution set 


pet t ert ewer meter er rere st sree tweeter ret eer een 


dee atetai > Rewrite code into . 
parallel format 


yes 
a ‘Timing Sauna > Good no. Better 
mode simulation method speedup? algorithm? 
ie \ no 
END 


Figure 2. The flowchart of tasks in CAMP 


Initially, a programmer extracts a loop segment 
from the original program. He determines whether the 
extracted loop has independent iterations (called a paral- 
lel loop). If so, the programmer can transform the loop 
into parallel format easily and send the parallel code to 
the simulator. If the loop is not a parallel loop, CAMP 
will attempt to align the loop to eliminate synchroniza- 
tion between iterations. If this step is successful, the 
programmer must rewrite the code according to the 
alignment results before using the simulator. 


If the loop alignment is not applicable, CAMP can 
take two alternative approaches. First, it may partition 
the loop iteration space into independent execution sets. 
This method uses the minimum dependence distance in 
each dimension to divide the iteration space. The detail 
description of this method is given in [6]. Second, it may 
partition the loop with a selected synchronization 
method. In this approach, CAMP generates the parti- 
tioning ratio and the synchronization information, which 
helps the programmer in rewriting the code into parallel 
execution format. 


A simulator has been developed for evaluating 
different partitioning methods. It generates the simu- 
lated execution time in helping the programmer to select 
the best partitioning strategy. If the original algorithm 
does not contain enough parallelism, the programmer 
may want to write a different algorithm. In the follow- 
ing sections, we will describe the process alignment, the 
bit-map synchronization scheme, and the partitioning of 
a loop with the bit-map method. All these mechanisms 
have been implemented in CAMP. 


3. Process Alignment 


The alignment of a program loop was first intro- 
duced by Padua for the purpose of increasing the 
number of independent partitions of a loop |7]. In this 
section, we will demonstrate the essence of this method 
by an example for minimizing the synchronization 
between iterations of a loop. 


In the example of Figure 3 (a), there is a constant 
dependence vector [1,-1] between the second and the first 
statement. This constant dependence vector is due to 
the occurrence of A(i,j) in S, (producing a data) and 
A(i-1,j+1) in S, (consuming the data). We can adjust 
the subscripts of either occurrence of array A. Figure 3 
(b) shows the adjustment of the first statement by 
transforming A(i-1,j+1) into A(i,j), and Figure 3 (c) 
shows the adjustment of the second statement by 
transforming A(i,j) into A(i-1,j+1). This code adjust- 
ment, called the process alignment, has three significant 


changes: 
Fe ee eee siemens ees 


DOi=1,N 
DOj=1,N 
S;: B(i,j) = A(i-1,j+1) x C(i) + D 


Se A(i,j) = (i,j) / F(i,j) 
ENDO 


ENDO 
(a) 


DOALL i = 0, N 
DOALL j = 1, N+1 
So: IF (1<i<N) AND (1<j<N) THEN 
A(i,j) = (i,j) / F(i,j) 
S;: IF (1<i+1<N) AND (1<j-1<N) THEN 
B(i+1,j-1) = A(i,j) x C(i+1) + D 
ENDO 
ENDO 


(b) 


DOALL i = 1, N+1 
DOALL j = 0, N 
So: IF (1<i-1<N) AND (1<j+1<N) THEN 
A(i-1,j+1) = E(i-1,j+1) / F(i-1,+1) 
Si: IF (1<i<N) AND (1<j<N) THEN 
B(i,j) = A(i-1,j+1) x C(i) + D 


ENDO 
(c) 


Figure 3. A loop example and its alignment code 


(1) transformed subscripts in the adjusted statement; 

(2) an additional IF test in each statement, and the 
extended loop bounds for the boundary iterations; 

(3) the statement rearrangement to satisfy data 
dependences. 
The details of an improved alignment algorithm is given 
in [6]. 


After the process alignment, the dependence vector 
became [0,0]. Since the production and consumption of 
the data is now in the same iteration, the loop can be 
partitioned by iterations without synchronization 
between any partitioned set. 


4. The Bit—map Synchronization Method 


Synchronization of concurrent processes can be 
implemented through shared—-variable or through 
message—passing [8]. There are different shared—variable 
synchronization methods such as the full/empty bit of 
the HEP [9], the fetch&add of the Ultracomputer [10], 
and the synchronization key of the Cedar [11]. In this 
section, we will describe a shared—variable method called 
the bit-map [12]. In the bit-map method, each syn- 
chronization data element has an attached sync field, 
and each memory operation contains a mask value. The 
data can be accessed only when the mask matches the 
sync. This way a proper referencing order to each data 
element can be maintained. | 


The bit-map synchronization method will be 
explained by using the QD-algorithm for calculating the 
distribution of the eigenvalues of a large positive definite 
tridiagonal matrix [13]. (Figure 4) This example con- 
tains several constant—-dependence vectors: [1,0] for the 
dependences Q(i,j+1) — Q(i-1,j+1), and E(i,j) — E(i- 
1,j), [0,1] for Q(i,j+1) — Q(i,j), and [1,-1] for E(i,j) — 
E(i-1,j+1). We omit the dependence vector [0,0] since it 


indicates a dependence in the same iteration, which is 
always satisfied. 


The data dependences must be preserved during a 
parallel execution. They can be preserved if and only if 
the proper order of memory operations to each memory 
location is maintained. The sequence of read and write 
operations to the same memory location, called a 
referencing pattern or RP, is denoted by (R/W),, ..... : 


(R/W);, ...... (R/W)o, (R/W),, where R/W indicates a 


read or a write operation and subscripts indicate the 


DOi=2,N 
DO j = 0, N-1 
Si: Q(i,j+1) = Q(i-1,j+1) + E(i-1,j+1) - E(i,j) 
Se: E(i,j) = Q(i-1,j+1) x E(i-1,j) + Q(i,j) 
ENDO 
ENDO 


Figure 4. The QD-—algorithm 


sequence number. This sequence of memory operations 
is sorted by the indices of the multiple nested loops in 


which the memory location is referenced. If the memory 


location is referenced more than once in the same itera- 
tion, the referencing pattern will be ordered by the state- 
ment number first and inside each statement from right 
to left. In other word, the referencing pattern is defined 
by the sequential execution of the loop. 


In our loop example, the RP of each read/write 
variable can be determined by sorting the constants in 
the first subscript of all the occurrences of that variable 
in a decreasing order. If more than one occurrence has 
the same constant in the first subscript, we sort them by 
the second subscript, and so on. If the constants in all 
subscripts of several occurrences are equal, then we sort 
the occurrences based on their sequential execution 
order. However, for the boundary data elements, some 
of the operations in a RP may not be performed. 


The bit—map synchronization method preserves the 
RPs even if different processors in a multiprocessor sys- 
tem reference the same memory location out of order. 
Each synchronized variable has a Sync field and a Data 
field. The syne field contains a number of Full/Empty 
(F/E) bits. Each bit is used independently in the same 
way as the F/E bit in the HEP machine. The sync field 
and the data field can be implemented as two separate 
words in the memory to eliminate extra hardware cost. 
Every synchronization instruction has a Mask field which 
contains information for both testing and modifying bits 
in the sync field. We define two distinct synchronization 
instructions: SREAD and SWRITE, which can be 
expressed as: | 


$SREAD /SWRITE, Address, Mask. 


The Address indicates the shared memory location where 
the accessed data and its associated sync value can be 
found. These instructions are implemented by the fol- 
lowing indivisible sequence of microoperations, where n 
represents the length of the Mask and Syne fields and 
registers represent temporary storages in each processor. 


For the SREAD instruction: 


n 
A (Mask; OR Syne;): 


i=] 


Register «- Memory(Address) 


Sync «— Mask AND Sync 
For the SWRITE instruction: 


(VV (Mask, AND Synq)): 


i=] 


Memory(Address) + Register 
Sync + Mask OR Sync 


Generally speaking, the SREAD instruction tests 
whether certain bits of the sync field are ‘1’ by ORing 
the sync field with the mask field and fetching data if 
the result is all ‘1’s. After data is fetched, tested bits are 
cleared by ANDing the sync field with the mask. On the 
other hand, the SWRITE instruction tests whether one 
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or more bits of the sync field are ‘0’ by ANDing the sync 
field with the mask field and storing data only if the 
result is all ‘O’s. After data is stored, tested bits are set 
by ORing the sync field with the mask. Only logical 
operations are needed in both instructions. Other types 
of synchronization instructions such as test without 
modifying the sync, can be easily implemented. 


Using SREAD and SWRITE to enforce RPs allows 
consecutive reads to proceed in any order. In general, 
each read or write needs one bit in the sync field. If that 
bit is ‘1’, memory location can be read but not written 
into, and vice versa. However, since any order of con- 
secutive reads is allowed, the write following these reads 
needs a number of bits to detect if all the reads have 
been finished. The number of bits is equal to the 
number of reads ahead of the write. Whenever a read is 
successful, its corresponding bit is updated. Note that if 
there is only one write in the RP, this write can proceed 
when all the bits for reads become ‘0’. 


For loops with constant dependence vectors, the 
number of operations in a RP of a read/write variable is 
equal to the number of occurrences of that variable. In 
general, each operation requires a bit in the sync field. 
Therefore, using a separate word for the sync field 
should be sufficient to synchronize any constant-— 
dependence loop. Figure 5 (a) shows the parallel code of 
a eee, a ee 

DOALL i =2, N 
DOALL j = 0, N-1 
$Q(i,j+1)/1111 = $Q(i-1,j+1)/1011 + 
$E(i-1,j+1)/1011 — $E(i,j)/1100 
$E(i,j)/1111 = $Q(i-1,j+1)/0111 x 
$E(i-1,j)/0111 + $Q(i,j)/1101 


53: 
Se: 


ENDO 
ENDO 


Q(1,1..64) 
Q(2..64,0) 
Q(2..63,1..63) 
A 

QA 


2..63,64) 
64,1..63) 


(b) 


Figure 5. (a) The parallel QD—algorithm 
(b) The RPs and syncs 


the QD-algorithm using the bit-map method. With N 
= 64, the RPs and syncs of Q and E arrays are shown in 
Figure 5 (b). 


The rule of selecting the initial mask and sync 
values for this example is very simple. The mask value 
for a read operation has ‘0’s on the corresponding posi- 
tion for this read as well as on the position for the fol- 
lowing write operation. On the other hand, the mask 
value for a write operation is equal to all ‘1’s. The ini- 
tial syne value of each memory location can be deter- 
mined by its RP. The bit position for the first (the 
rightmost) operation, such as Ry of Q(2..64,0) is enable, 
i.e. ‘1’ for a read, ‘0’ for write. All other operations are 
disable. If there are consecutive reads from the first 
operation, then all these reads are enable. The algo- 
rithm of finding the masks and syncs for general cases 
was given in [12]. 


Thus, the bit-map synchronization transforms a 
sequential loop into a parallel one by removing the con- 
trol dependence and preserving the data dependence. 
All iterations can be executed in parallel like in any 
other data—flow model of computation. 


5. Program Partitioning on Multiprocessors 


Using the bit-map method, the QD-algorithm can 
be executed in parallel. The next issue is how to run this 
parallel code on a limited number of processors. Two 
factors must be considered when partitioning a loop into 
processes: amount of parallelism and memory access and 
synchronization overhead. The best partition should 
exploit the maximum parallelism with the lowest over- 
head. However, this ideal case is usually impossible to 
achieve. The partitioning strategy is a tradeoff between 
these two contradictory goals. 


In partitioning program loops, we make two restric- 
tions. First, each iteration of the loop is an indivisible 
unit of execution, executed sequentially by one processor. 
Second, the loop can only be partitioned into regular 
patterns, such as by row and/or by column in a two- 
dimensional space. Although partitioning at arithmetic 
operation level and/or partitioning along a wave—front 
[14] may exploit more parallelism, it introduces unac- 
ceptable coding complexity and performance overhead. 
We propose two ways of partitioning the loop: block and 
interleaved partitioning. 


We now define a program loop and our partitioning 
schemes. A program loop is denoted by L 
(I,, I,, .--) 1) (81, Se, ...) S,)) where J; is an index variable 
of the i-th nested level of the loop, 1 <i <n; and §; is 
an assignment or a conditional statement, 1 <j < s. 
Each index variable consists of three attributes 


~ 


I,(a,,b;,¢;), where a; is the initial value, b, is the boundary 
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value, and ¢, is increment value of ]. In a loop L = 
(1,, Ig, ..-, 1.) (81, Se, ..-, 5,), assume that dimension J; is 
partitioned into p sets. The j—th set of the block parti- 
tion is defined to have: 


the initial value ay = a + | xixG—1)/p xX é, 


the boundary value 


while the j—th set of the interleaved partition has: 
the initial value a; = a + (j- 1) X& 


the boundary value b; ; = b;, 


the increment value ¢; = p X ¢; 


The block and the interleaved partitioning schemes 
are initiated by two different considerations. The former 
tries to minimize memory access and synchronization 
overhead since synchronization is only perform at the 
block boundary. The latter tries to maximize the paral- 
lelism by interleaving iterations among processors so that 
all processors can start approximately at the same time. 


5.1. Partitioning with Different Aspect Ratios 


The best way of exploiting parallelism of the 
recurrence loop example is to execute the loop according 
to the wave-front direction. On the other hand, the 
easiest way of partitioning the loop is based on the lexi- 
cographical order of the index set. The former method 
introduces unacceptable overhead while the latter 
method may not utilize all the parallelism in the loop [6]. 


We use a general partitioning method which 
divides each level 1 of a nested set of loops into different 
number of partitions, n; Either the block or the inter- 
leaved partitioning scheme can be used in dividing the 
loop. The proportion of the number of partitions on 
each level is called the aspect ratio (AR). If 4 processors 
are assigned to execute the QD-algorithm, then three 
ARs (m:n,) are possible: 4:1, 2:2, and 1:4. Figure 6 (a) 
shows the block partition and the processor assignment 
of the QD-algorithm with AR = 2:2, while Figure 6 (b) 
shows the interleaved partition, where 1, 2, 3, and 4 indi- 
cate four different processors. Note that the AR of 1:4 
with the interleaved partition is equivalent to partition- 
ing with the lexicographical order. 


There is no extra coding complexity using the 
aspect ratio of the block and the interleaved partitioning 
methods. However, the total combination of ARs for a 
given number of processors p grows with the nested lev- 
els of the loop. In a doubly—nested loop, the number of 
ARs is equal to (logsp +1), while in a triply—nested 
loop, the number of ARs is equal to ((logyp + 1)* /2 + 
(logyp + 1) / 2), where we assume the number of proces- 
sors and the number of partitions are all powers of 2, 
and n,; < p. 


(b) 


Figure 6. (a) The block and (b) The interleaved 
partitioning with AR = 2:2 


5.2. Data Storage Scheme 


Parallelism imposed by the data dependence is cru- 
cial to the program partitioning. In order to select the 
best AR, we need to consider another factor: memory 
access and synchronization overhead. Three types of 
memory accesses: local access, local synchronized access, 
and network synchronized access can be specified for 
testing the QD-algorithm in a  distributed—shared 
memory multiprocessor system (Figure 7). The local 
access provides the fastest access time. Data with the 
exception of read—only data can be localized only if it is 
not accessed by any other processor. The local memory 
coherence problem is thus avoided. Data, read and writ- 
ten by more than one processor, can only be accessed 
through the synchronization instructions described in 
section 4. 


Interconnection Network 


Figure 7. The model of a multiprocessor system 
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Because of the delay in the interconnection net- 
work, the synchronized access to any other memory 
module takes much longer than the synchronized access 
to the local memory. The data storage scheme is very 
important in reducing the number of network accesses. 
For instance, if 4 processors execute the QD-algorithm, 
with AR = 1:4, and the interleaved partition is used, 
then P, will execute iterations (*, 1) and (*, 5), where * 
stands for all iterations of the corresponding level of the 
loop. In each iteration, P, reads the Q array three times 
(Q(i-1,j+1), Q(i-1,j+1), and Q(i,j)), and writes once 
(Q(i,j+1)). If we allocate Q(*,1) and Q(*,5) to the local 
memory of P., then, only one access to the Q array will 
be to the local memory. On the other hand, if we allo- 
cate Q(*,2) and Q(*,6) to the local memory of P,, then 
three out of four accesses to the Q array will access to 
the local memory. Thus, data should be allocated to the 
shared memory modules according to the allocation of 
iterations (i.e. the AR) and the subscript expressions of 
the accessed variable. 


In the block partitioning, the type of memory 
access for each variable may vary from one iteration to 
another. If the iteration is not located on the block 
boundary, all the memory accesses in that iteration are 
local. Otherwise, the memory access can be either a 
local synchronized or a network synchronized access. 
The data storage scheme for the interleaved partition is 
applicable to the block partition. 


5.3. Aspect Ratio Selection 


Because the number of ARs in a three dimensional 
loop is proportional to (logsp)’, it is very time— 
consuming to simulate all the ARs for any given number 
of processors. A heuristic can be used for selecting the 
best ARs. 


The sequential execution time of each iteration of a 
loop can be computed directly. For example, the start- 
ing and ending time for each memory access and arith- 
metic operation of the QD-algorithm are shown in Fig- 
ure 8. In it we assume every memory access takes the 
same amount of time (M), and every arithmetic opera- 
tion takes time (A). The time—distance for each depen- 
dence is computed as t, —t,, where ft, is the data gen- 
eration time, and ¢, is the time that the data is needed. 
In this example, the time—distance for Q(i,j+1) — Q(i- 
1,j+1) is 4M+2A; Q(i,j+1) — Q(i-1,j+1) is 0; E(i,j) — 
E(i-1,j) is 3M+2A; Q(ij+1) — Q(i,j) is -(2M+A); and 
E(i,j) —> E(i-1,j+1) is 7M+4A. The maximum value of 
the time—distance is selected for each dependence vector. 
Those are 4M+2A for [1,0], -(2M+A) for [0,1], and 
7M+4A for [1,-1]. If the time distance is less than or 
equal to 0, it indicates there is no delay caused by the 
dependence. Therefore, the dependence vector can be 
omitted. Generally speaking, the time—distance indicates 


Operations Starting Time Ending Time 
Fetch Q(i-1,j+1) 0 M 
Fetch E(i-1,j+1) M 2M 
Add 2M 2M+A 
Fetch E(i,j) 2M+A 3M+A 
Subtract 3M+A 3M+2A 
Store Q(i,j+1) 3M+2A 4M+2A 
Fetch Q(i-1,j+1) 4M+2A 5M+2A 
Fetch E(i-1,)j) 5M+2A 6M+2A 
Multiply 6M+2A 6M+3A 
Fetch Q(i,j) 6M+3A 7M+4A 
Divide 7™M+3A 7™M+4A 
Store E(i,j) 7™M+4A 8M+4A 


M = Memory access time A = Arithmetic operation time 


Figure 8. The sequential execution of an iteration 
of the QD-—algorithm — 


the ‘delay’ between two iterations with the data depen- 
dence between them. We discussed several rules in [6] 
for reducing this time—distance. 


We now describe the algorithm which uses the 
time—distance for estimating the execution time of a par- 
ticular AR. The algorithm is demand-—driven. First, the 
execution time of the last iteration of the loop is 
demanded. This triggers evaluation of the execution 
time of two groups of iterations. The first group con- 
tains those iterations which produce data for the last 
iteration. The second group contains the iteration which 
must be executed right before the last iteration because 
of the sequential execution imposed in the same proces- 
sor. The first group of iterations can be calculated from 
the data dependence vectors while the iteration of the 
second group is determined by the AR and the partition- 
ing scheme. These triggered iterations, in turn, require 
the execution time of other iterations and so on. This 
process continues until the iteration space is exhausted 
or the iteration space boundary is reached [6]. 


In this heuristic, every memory operation has a 
constant access time. This assumption is valid for the 
interleaved partitioning method, since each occurrence of 
a variable has the same type of memory access for every 
iteration and can be determined according to the data 
storage scheme. However, the type of memory access 
varies for each variable in the block partition. The 
evaluation algorithm assumes all the memory accesses 
are local for the block partition. The results from this 
evaluation indicate the parallelism exploited by different 
ARs. Next, the number of different types of memory 
accesses for every AR can be computed. These numbers 
determine the memory latency. Two ARs are selected 
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for the block partition. One exploits the best parallelism 
and the other has the lowest memory latency. Both 
selected ARs along with the best AR of the interleaved 
partition will run through the simulator to determine the 
best partitioning strategy. 


6. Conclusion 


We exercised the QD-—algorithm with loop bound N 
= 64 through CAMP. Our simulator has two options 
for estimating the delay for the synchronization accesses 
[6]. In option 1, local synchronized access takes 2 units 
of time and network synchronized access takes 6 units of 
time. In option 2, local and network synchronized 
accesses take 3 and 9 units of time respectively. 


The QD-algorithm has only one independent exe- 
cution set and the alignment method is not applicable. 
Therefore, the partitioning with the bit-map method 
was selected. The ARs for different number of proces- 
sors were determined by the evaluation heuristic 
described in section 5. The simulation results along with 
the selected ARs are shown in Table 1 and illustrated in 
Figure 9 in which INT represents the interleaved parti- 
tion, BKM represents the block partition with minimum 
memory latency, and BKP represents the block partition 
with maximum parallelism. We can see that when the 
number of processors is small, the block partition which 
maximizes parallelism provides the fastest execution 
time. However, when the number of processors is greater 
than 16, the interleaved partition gives the best perfor- 
mance. The block partition which minimizes memory 


The sequential execution time = 45056 


Number of 
Processors 


STS oD A LS 


|8 


36686 | 19848 | 19906 


Low overhead 


SE EES | NT ES ET SE I 


= ae 

jars || 1 | 
tte 
High overhead Pars | 22 | 
[Time | 18567 | 


Table 1. 


The simulated execution time 
of the QD—algorithm 


32 


Processors 


Figure 9. The plot of the execution time 


and synchronization overhead has the worst perfor- 
mance. 


CAMP is written in Pascal and runs on a VAX-— 
11/780 under 4.2 BSD UNIX") operating system. 
Presently, the program—development tool uses only a 
limited number of partitioning algorithms and synchron- 
ization primitives. The future plans include additional 
partitioning schemes especially for the partitioning of the 
program segments without loops, shown as the dashed 
boxes in Figure 2. Other useful synchronization primi- 
tives such as test&set, full/empty bit etc. could be added 
to CAMP. Other future work should also include the 
partitioning and synchronization of programs for 
message—passage multiprocessors [3]. 


(a) UNIX is a trade mark of Bell Laboratories. 
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THE DESIGN OF A QUEUE-BASED 
VECTOR SUPERCOMPUTER 


Honesty C. Young! 
IBM Almaden Research Center 
650 Harry Road 
San Jose, CA 95120-6099 


Abstract-- To meet increasing demands for computing power, systems 
must exploit parallelism at all levels. A balance between the processing 
speeds of a machine’s subsystems is critical for overall system perfor- 
mance. In this paper, we describe a queue-based vector supercomputer 
(QVC) which reduces the fraction of work that the scalar mode must 
process. 


The proposed queue-based vector supercomputer has the following 
characteristics: (1) two level control is used to support out-of-order 
instruction initiation, (2) multiple classes of registers is provided, (3) 
one valid bit per vector element is included to exploit flexible chaining, 
(4) a very short vector startup time (one clock period) is achieved, (5) 
branch instructions are used only for implementing high level language 
control structures, (6) elements of a queue may be read repeatedly, 
and (7) vectorization has been generalized. 


Several hard-to-vectorize loops from the Livermore Loops are used 
as the benchmark programs. The preliminary study suggests that the 
proposed queue-based vector architecture is a cost-effective way to 
implement a supercomputer for array-oriented programs. 


1. Introduction 

Queues have been included in many architectures [3,5,20,23,24], 
intended primarily for high performance numerical applications in 
scalar mode. Most commercial supercomputers [18,21,7] have some 
kind of vector buffers (registers) (the Cyber 205 [15] is one ex- 
ception: vector operations are carried out in memory-memory 
mode). In this paper, we describe a queue-based architecture for 
which the notion of a vector register is simply a special case. 


This paper is organized as follows. Section 2 motivates the 
vector processors. Section 3 describes the desired features for a 
vector processor and presents the proposed queue-based vector 
supercomputer. Section 4 compares the proposed architecture with 
the Cray X-MP [7] using some hard-to-vectorized Livermore Loops 
[17,21] as the benchmark programs. Section 5 has the conclusions. 


2. Motivations of Vector Supercomputers 

It has been repeatedly observed [2,4,11,22,24,28] that for a system 
with two processing speed modes, the effective execution speed is 
limited by the slower mode, unless the fraction of workload that 
has to be done in this mode is nominal. Put differently, a good 
balance between different processing speeds is critical to the overall 
system performance. In the numerical supercomputer environment, 
the high-speed mode corresponds to vector operations, the low-speed 
mode to scalar instructions. We must push the limits on both ends 
to meet ever-increasing computing requirements. An ideal high 
performance system must not only have a fast scalar mode, but 
also a vector mode that reduces the fraction of work that must 
be processed in the scalar mode. The latter can be achieved by 
architectural innovations which facilitate the vectorization of most 
array operations. 


Recently, several queue-based processors [3,5,11,20,24] have 
been proposed to achieve fast scalar mode. The advantages of a 
queue-based scalar system have been publicized elsewhere [24,11, 
29]. Many array-oriented programs cannot be vectorized, partly 
because the underlying vector computers do not provide the ade- 
quate primitive instructions to generate vector code for many com- 
monly seen vector operations, such as vector inner product. The 
major goal of the proposed two level control, queue-based vector 
supercomputer (QVC) is to ease the task of the vectorization of 
array-oriented programs. Thus, QVC reduces the fraction of work 
that must be processed in the scalar mode. 

The major considerations behind the vector instructions are the 
following: 

1. Flynn bottleneck [9]. The performance of a computer system 
is sometimes limited by the instruction initiation rate. An issue 
unit is capable of initiating at most a fixed number (normally, 
one) of instructions every clock period. If each instruction 


does too simple an operation, this Flynn limit may become a 


The work described here was done while this author was at the University 
of Wisconsin-Madison. 
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bottleneck. A decoupled architecture [5,11,20,24] does increase 
the bandwidth of the instruction issue capability by including 
multiple (two in all the cases listed above) issue units. Another 
way of increasing the effective issue bandwidth is to include 
powerful instructions. Each powerful instruction is able to do 
many operations. Hence, issuing one powerful instruction can 
achieve the same effect as issuing multiple relatively simple 
ones. The issue condition of a powerful instruction, however, 
does not have to be complex. In the QVC, each powerful 
instruction (vector instruction) is just a repetition of a simple 
scalar operation. These vector instructions still enjoy the simple 
issue condition property of its scalar counterpart. 
Out-of-order execution. The floating point unit of the IBM 
360/91 [26] and the scoreboard of the CDC 6600 [25] are 
examples of hardware methods to achieve out-of-order execu- 
tion. Out-of-order execution is an essential part of vector 
instructions with different vector length and the simultaneous 
execution of vector and scalar instructions. The vector 
supercomputer proposed in this study, however, provides out- 
of-order initiation of scalar instructions (almost) free, in the 
sense that a scalar instruction is just a special case of vector 
instruction, i.e., a scalar is just a vector of length one. 


3. Proposed Architecture 

In this paper, we will emphasize the design of the vector mode 
operation of the QVC. Many functional units are included in the 
QVC. Each of them is implemented by a linear (fully segmented) 
pipeline [13] which eliminates the necessity of doing pipeline reor- 
ganization and avoids most structural hazards (collisions). Each 
function unit may either perform a single task (e.g., floating point 
addition) or be capable of executing several operations (e.g., logical 
operations). There is an issue unit associated with each functional 
unit, which checks for the issue conditions. Except for the Load/ 
Store instructions, all operations are carried out in a register-register 
fashion. The availability of the operands determines the issue 
conditions of the register-register instructions. For performance 
measurement purposes, we compare the QVC with a single CPU 
of a Cray X-MP having an identical set of function units (see 
section 4). 


The terms used in the figure are explained below: 
C-R: Constant Registers 

S-R: Scalar Registers 

Q-R: Queue Registers 

M-R: Mask Registers 

Q-L: Queue Length Registers 
I-C: Instruction Cache 

INT: Interconnection Network 

1st I-U: First Level Issue Unit 
2nd I-U: Second Level Issue Unit 
IQ: Instruction Queue 

F-U 7: the i-th Functional Unit 


Figure 1: The Organization of a Processor with a Functional 
Units. 


The desired vector properties include flexible memory-referencing 
instructions, arithmetic/logic operations, vector editing instructions 
and the easily-vectorizable properties. By easily-vectorizable prop- 
erties, we mean that vector operations are applicable to the pro- 
grams which are not vectorizable on the current supercomputers. 
Regular arrays should be treated effectively, while sparse arrays 
must be handled with reasonable efficiency. We suggest that all 
the registers be shared by all functional units. We also want to be 
able to intermix vector instructions with scalar ones. For example, 
vector A is loaded into a queue-register by a vector load instruction, 
but entries of A are consumed by different scalar instructions. 
Figure 1 illustrates the QVC with n functional units. 

3.1. Vector Instruction Format 
A vector instruction has the following format: 
repeat mn 
inst 7; 


. inst); 
The semantic of the above repeat statement is to execute the 
instruction group, inst, to inst,,, n times, where each inst represents 
a simple scalar instruction and all instructions within an instruction 
group use the same functional unit. The size of an instruction 
group (i.e., m) is a compile time constant, while the length of a 
vector (i.e., number of iterations) is specified by one of the queue 
length (QL) registers and the “repeat” instruction selects the ap- 
propriate queue length register. If the length of the vector is known 
to be 1, at compile time, the repeat statement is not needed. This 
is how we use the scalar mode of this architecture. 
3.2. Operand Register (Queue) Sets 
In addition to the status registers (e.g., the queue length register 
just mentioned), four sets of operand registers are included: 
1. Constant scalar registers. Rationale for these registers is pro- 
vided elsewhere [29] : 
2. Scalar registers. They are just like general purpose registers in 
most computer architectures. 
3. Queue registers. Queue registers are a set of registers, each of 
which is a queue. They are rationalized in the following section. 
4. Mask register. Mask registers are similar to the queue registers 
in that they are also implemented as queues. The major dif- 
ference is that each element of a mask register has only one 
bit. 
Any one of the scalar registers or the queue registers can be 
used as a destination/source register of memory oe instruc- 
tions. The memory interface is treated as one functional unit. 


The mechanism to handle different data dependency hazards 
(RAW, WAR, WAW) for queue/mask registers is detailed elsewhere 
[29]. 


3.3. Queue Registers 

. Queue registers are a set of registers, each of which is a queue. 
Only limited elements (i.e., the head and tail of the queue) are 
accessible from outside. A queue register is the hardware imple- 
mentation of a “stream”. Thus operands in a queue register must 
be accessed in a predefined order. Each queue register is imple- 
mented as an array of registers. There is a valid bit associated with 
each element of the queue register to indicate the data availability 
of the designated element. Two pointers are associated with each 
queue register. They point to the head and the tail ofthat particular 
queue. A queue may have three possible states--Full, Empty, and 
Normal--which can be checked by examining the appropriate valid 
bits. When a queue is full, an attempt to put an operand onto that 
queue will be blocked. 


Similarly, reading from an empty queue is also blocked. One 
of the advantages in having one valid bit per element is that a 
queue register can temporarily hold a vector which is longer than 
the queue size, as long as there are other instructions that take 
operands out of the same queue register. This valid bit per element 
‘scheme also makes chaining more flexible, because the speed of 
the consumer and the producer does not have to be identical, i.e., 
chaining can be performed in an asynchronous fashion. A similar 
scheme is used by the Hitachi S-810 [19] to allow flexible chaining. 
The QVC has the flavor of dataflow machines [8] except that the 
issue unit checks for data availability, rather than the data avail- 
ability “firing” the operation. Thus, function units can operate in 
parallel as much as possible. We call this organization a control-driven 
dataflow computer architecture. This control-driven dataflow scheme 
is more flexible than that of most traditional supercomputers in 
the sense that data availability is checked on an element by element 
basis, even for vector operands. Hence, the operation of chaining 


is more flexibility than in the case where data availability is 


checked at the vector level. On the other hand, the overhead of 
the dataflow approach [10] is avoided. 
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One of the characteristics of queues is that an element of a 
queue is discarded after it has been read. There are cases in which 
we want to use an operand (a scalar or a vector) more than once. 
Therefore, we include three access modes in reading from a queue: 
(a) destructive mode; (b) non-destructive mode; and (o) circular mode. 
In destructive mode, a read operation removes the first element 
from the queue register. In non-destructive mode, the first element 
of a s aed remains after a read operation, ie., a queue can be 
(non- lestructively) read many times while its contents are not 
changed. In circular mode, the entire queue remains unchanged 
after each element has been accessed. Sometimes, the access mode 
is determined at run time based on the (partial) result of an 
instruction (e.g., compare two queue registers). Since a queue 
register is implemented as an array of elements with two pointers, 
a different access mode simply suggests a slightly different inter- 
pretation in updating the pointers. 


3.4. Two-Level Control : 

A two-level instruction initiating scheme is adapted in this design. 
The global first level issue unit decides branch outcomes and sends 
non-branch instructions to the proper instruction queues which are 
associated with the second level issue unit. For each function unit, 
there is an instruction queue and a second level issue unit. Each 
second level instruction issue unit initiates instructions in the order 
they were sent to the instruction queue (i.e., the original machine 
code textual order). Instructions in different second level instruction 
queues, however, may be initiated in a different order than that 
in which they passed the first level issue unit. The second level 
issue unit checks for data availability and issues instructions ac- 
cordingly. The implication of this two-level initiating scheme is 
that it takes two clock periods to issue a non-branch instruction, 
i.é., the work of instruction initiating is divided into two stages of 
the pipeline. However, this scheme does not introduce extensive 
delay. The delay of the additional clock period in the decoding 
logic slows down a section of straight line code by at most one 
clock period. Since the branch decision is made by the first level 
issue unit, there is no execution time penalty even across basic 
blocks, as long as the expressions that participate in the branch 
decision are evaluated earlier, which can normally be done by 
properly scheduling the code (see, for example, [29]). Thus, in the 
best case, the penalty introduced by this two-level control scheme 
is one clock period per program. The worst case penalty, however, 
is one clock period per basic block. On the other hand, the first 
level issue unit sends a partially decoded instruction to the instruc- 
tion queue associated with the appropriate function unit. In other 
words, the decoding is done in two stages (the first level issue unit 
and the second level issue unit). If the instruction decoding time 
of the original design determines the clock period, the two level 
decoding scheme may imply a faster clock rate, because the in- 
struction decoding is done in two pipeline stages. The startup time 
of a vector instruction is just the time for the first level issue unit 
to send a “repeat” statement to the appropriate second level issue 
unit, which normally takes one clock period. 


The first level issue unit is blocked only because of either (a) 
a branch instruction, which can be alleviated by the prepare- 
to-branch [11] scheme, or (b) a full instruction queue of the 
corresponding functional unit. The functionality of this proposed 
two level instruction initiation scheme is similar to Tomasulo’s 
algorithm [26] except that instructions that use the same functional 
unit are executed in program order. In [29] a comparison was 
made to compare the performance of the two-level scheme with 
that of Tomasulo’s algorithm using the example given by Weiss 
and Smith [27] in sudyine various instruction initiation schemes. 
In this example, the two-level scheme and the Tomasulo’s algorithm 
perform equally well. The former, however, avoids the potentially 
expensive associative search used in the Tomasulo’s algorithm. 


3.5. Vector Load/Store 

Studies [4,16] have shown that the relative ratio of vector load/store 
with unit stride, non-unit stride, and random access is about 70% 
: 20% 10%. Thus, the vector load/store instruction has to 
support accessing random elements of an array. A repeated load/ 
store with autoincrement can be used to access elements separated 
by a constant stride. A random access is normally represented by 
an additional level of indirection, i.e., the addresses of the needed 
elements are put in another array. This random access is supported 
by the following two vector instructions: (a) put addresses of 
needed elements onto a queue register (call it %Qgq); and (b) do 
a vector load using addresses in %Qq. 


Simultaneous vector load/store operations may cause undesirable 
memory overlap hazard conditions (that is, read before write or 
write before read). One solution for eliminating such hazards is to 
have a smart memory system detect the conflicts. Another way 
is to have the software determine the cases where the hazards may 
occur and assure sequential execution whenever necessary. 


3.6. Comparison Instructions 

The contents of a mask register can be loaded by doing an element- 
by-element vector comparison. Given two vectors A and B of 
length n, the result of the vector comparison is stored in M. The 
semantics are that Mj; gets the comparison result of Aj and B;. The 
result of a comparison, then, is treated as an ordinary operand. 
We term such comparison a logical one in the sense the result of 
the comparison, rather than one of the operands, is returned. This 
scheme is used by some supercomputers (e.g., Cray-1) to set up a 
mask register. 


The scalar logical comparison is useful in evaluating Boolean 
expressions without using branches. i.e., logical instructions such 
as AND, rather than branches, are used to evaluate complex 
Boolean expressions. Branches are needed only to construct high 
level program structures. 


There are cases where one of the operands involved in the 
comparison, rather than the logical comparison result, is needed. 
We propose another set of comparisons called contents comparisons. 
This corresponds to the if-expression in some programming lan- 
guages. The (vector) contents comparison is generally useful in 
coping with sorting-related problems. 

3.7. Vector Editing Instructions 

Different vector editing instructions can be realized by the re- 
peat statements (vector instructions) with slightly different accessing 
modes described earlier. It is also possible to include scalars within 
a vector editing instruction. Merge (combining two vectors into 
one) and split (the opposite of merge) can also be carried out by 
two repeat-statements with proper adjustments to the mask register. 


The different vector editing functions available in the commercial 
supercomputers are nicely surveyed by Hwang and Briggs [12]. 
We make no attempt to specify all vector editing instructions 
completely in this paper. 

3.8. Intrinsic Functions 

Some vector operations, such as vector sum (the summation of all 
elements in a vector), are inherently difficult to vectorize. One 
possible way to cope with these hard-to-vectorize operations is for 
the compiler to generate scalar instructions [6] which implies rel- 
atively low performance. Another possible approach is to include 
a large set of vector macro instructions and hope that most hard- 
to-vectorize operations are covered by the vector macros [15]. In 
the QVC, most vector macros can be composed by several of its 
vector instructions. The realization of the following vector macros 
can be found in [29]: (a) vector sum; (b) inner product; (c) linear 
recurrence; (d) maximum/minimum; (e) search; and (f) sorting. 

We believe that most, if not all, programs can take advantage 
of this queue-based vector supercomputer. However, advanced 
compiler techniques are required to fully utilize the potential par- 
allelism provided by this proposed architecture. 


4. Performance Evaluation 


In this section, we compare the performance of a single CPU of a 
Cray X-MP [7] with a similar architecture with the proposed ex- 
tension. In particular, there are the same number of vector registers 
in the Cray X-MP as the queue registers in the QVC. The identical 
set of function units are included in both the Cray X-MP and the 
QVC. For QVC, we assume two sets of execution times for the 
functional units. One is that the execution time (in terms of clock 
period) of each functional unit in the QVC is identical to that of 
the Cray X-MP. The other is to include an additional delay of 
one clock period when the data is sent through the interconnection 
of the QVC (i.e, two additional clock periods are added to a 
register-register operation). One last point is that we ignore the 
de ays because of memory bank conflict, i.e., we assume the data 
are nicely distributed in the memory banks and all memory access 
requests are serviced in a predetermined amount of time (i.e, 14 
clock periods to do a scalar load). 


Riganati and Schneck [21] summarize the supercomputer per- 
formance reported on Livermore Loops. The execution speed of 
loops 4, 5, 6, 11, 13, and 14 are slower than other loops. In fact, 
the performance of these 6 loops is below 50 mega flops (floating 
point operations per second). In other words, if the Livermore 
Loops are used as the workload, the execution time of the afore- 
mentioned loops dominates the total execution time. Loop 5 and 
loop 6 are essentially the same provided the compiler does the 
induction variable analysis [1]. Therefore, we use loops 4, 5, 11, 
13, and 14 as the benchmark programs. 


In Table 1, the execution times (in clock periods) of the 5 loops 
on 4 different configurations are shown. There are two columns 
associated with the Cray X-MP. The “Cray” column has the 
execution times of the loops while the scalar scheduling is done 
within the loop boundary -- we do not employ “loop carry-over” 
optimizations, such as loop unrolling, other than the ones specified 
in the source Fortran code. The ‘“Cray(SP)” column has the 
execution times where the software pipelining? (using the algorithm 
described in [29]) technique is applied. The “QVC” column has 
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15148 
12275 
13090 12066 1386 
22736 20636 3645 


Table 1. The Execution Times for Different Configurations. 


18129 


12963 


20261 6009 


the execution time of this queue-based vector computer. The 
“QVC+” column has the execution time of the QVC, while an 
additional clock period is needed for an operand to “penetrate” 
through the QVC’s interconnection network. We apply some known 
optimization techniques, by hand, to the Cray code. In particular, 
we try to partially vectorize the loops whenever appropriate. For 
example, the floating point multiplication of loop 4 (X(LV)*Y(J)) 
is carried out using vector instructions. Also, vectors Y & Z of 
loop 5, vector Y of loop 11, vectors P(1,IP) & P(2,IP) of loop 13, 
and vectors GRD, VX, & XX of loop 14 are block-preloaded into the 
appropriate vector registers using the vector load instructions. That 
is, we use the vector registers as data buffers between the processor 
and the memory system whenever appropriate. 


For loop 4, the QVC outperforms the Cray X-MP primarily 
because of the following: 
¢« Mixed mode operation. In th Cray X-MP, scalar instructions 
must be used to get an element from a vector register. Three 
instructions (a vector to scalar move, an increment to the index 
register, and a branch instruction) are needed for each result 
stored in a vector register. In other words, the overhead to get 
an element out of a vector register is in the order of 10 clock 
periods. (The execution time of the vector register to scalar 
register data move instruction [076ijk] is 4 clock periods.) 
» Branches. Because we cannot express a linear recurrence in a 
vector form on the Cray X-MP, branch instructions are used to 
carry out the loop. The overhead may be as many as 5 clock 
periods per iteration (for an in-buffer condition). 
Chaining. The first element of the vector operation results, stored 
in a vector register, cannot be taken from a vector register until 
the vector operation has completed. The chaining on the mixed 
mode operation is not flexible enough. 


For loops 5 and 11, the difference in the execution time comes 
from the following: 
« Mixed mode operation. Though loop 5 is hard to vectorize due 
to the first order linear recurrence, arrays Y and Z should be 
able to be “block-preloaded” into some vector registers. Because 
the vector mode and the scalar mode on the Cray X-MP are 
not readily compatible (i.e., additional move instructions are 
needed to move each element from a vector register to a scalar 
register), the mixed mode operation can not be carried out 
efficiently. 
Branches. 
Store. In the Cray X-MP, a store instruction cannot be issued 
until the result is available. In the two level control scheme, 
however, the first level issue unit can issue a store instruction 
as long as the instruction queue associated with the store func- 
tional unit is not full. Therefore, instructions following the store 
instruction can be issued, by the first level issue unit, even before 
the operand for the store instruction has been computed. Thus 
less instruction blockage will occur in the two level control 
scheme than in the one level control scheme. 
¢ Loop unrolling. The source code of loop 5 is unrolled three 
times. Because of the limited number of S registers in the 
Cray X-MP (8 of them), some instruction issuing blockages come 
from the static register assignment. As stated in [29] that queues 
provide the dynamic register assignment property. Thus, unnec- 
essary data dependencies, owing to static register assignment, can 
be avoided. 
Software pipelining, however, overlaps the time of the mixed mode 
operation (load from a vector register), the branches, and the store 
instruction blockages. Thus, for loops 4, 5 and 11, the performance 
improvement due to software pipelining is significant. 


Table 2. The Relative Performance for Different Configurations. 


2 One simple kind of software pipelining is the pre-loading and post-storing of 


operands. 


The discrepancy in the execution times of loops 13 and 14 
comes from the following: 
e Memory indirection load. Memory indirection load is not well 
supported in Cray X-MP. 
e Scalar variable expansion [14]. A scalar variable can easily be 
expanded to a vector in the QVC by allocating such a scalar in 
a queue-register. By so doing, the variable from different itera- 
tions occupies different locations in a queue-register. 
Two major constraints that also limit the performance of loops 13 
and 14 on the QVC are: 
Number of functional units. The performance is bounded by the 
number of load and floating point addition pipes. This can be 
remedied expensively by adding functional units to the processor. 
The additional functional units, however, may slow the clock 
cycle down. 
Data conversion. It takes several instructions to do data format 
conversion between integer and floating point number. This, 
however, is not an essential limit of the architecture. If data 
conversion happens frequently, we can add a special data con- 
version pipe to the architecture. 


In Table 2, we compare the relative performance for different 
configurations, Looking at the QVC + : Cray(SP) column, we find 
that the Cray X-MP outperforms the QVC+ only for loop 5. The 
data dependency graph of loop 5 is very deep, i.e., every instruction 
depends on the result from the previous instruction. In other 
words, the data dependency, rather than the issue unit, limits the 
execution speed. Thus, the QVC with additional delay runs slower 
than the Cray X-MP. In most cases, however, the advantage of 
the two level control and the queue-based vector register still 
outweighs the delay going through the interconnection. Looking at 
the OVC : QVC+ column, we find that for loops 4, 5, and 11, 
where the execution time is limited by the data dependencies, the 
performance penalty introduced by the one clock period delay 
going through the interconnection is about one third. (Recall that 
the execution times of a floating point addition and a floating point 
multiplication are 6 clock periods and 7 clock periods, respectively. 
The time to go through two interconnections [one on the input 
side, the other on the output side] is 2 clock periods, which is 
about one third of the execution times of the floating point oper- 
ations mentioned above.) On the other hand, for loops 13 and 14, 
where the execution time is limited by the availability of the 
functional units, the performance penalty caused by the delay in 
the interconnection is between 2% and 3%. 


For highly vectorizable loops (e.g., loop 7), the performance is 
limited by the availability of the functional unit. In this case, the 
extra clock period delay going through the interconnection will 
slow down the entire system by only a small fraction. 


5. Conclusions 

This queue-based vector supercomputer supports out-of-order in- 
struction initiation and flexible vector chaining. Its simple vector 
instruction format implies very short vector startup time (one clock 
period). This simple format, however, is very powerful to use with 
the queue registers and different access modes. As demonstrated 
earlier in this paper, many array-oriented programs are vectorizable 
under this proposed architecture. The destination non-blocking 
interconnection network avoids the unnecessary result bus conflicts 
in the case where a shared result bus is used for all functional 
units (in Cray X-MP, all results going to S registers share the same 
bus). Branch instructions are needed only to carry out the high 
level language control structures by introducing the logical/contents 
comparison. Access modes are provided so that it is possible to 
read the elements of a queue repeatedly in a variety of ways. 


The preliminary study suggests that this proposed architecture 
is a cost-effective way to implement a processor for array-oriented 
programs. The major remaining problem is the design of compiler 
techniques to fully utilize the suggested features automatically. 
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ABSTRACT 


To satisfy the growing need for computing power, 
a high degree of parallelism will be necessary in 


future supercomputers. Up to the late 70s, 
supercomputers were either multiprocessors 
(SIMD-MIMD) or pipelined monoprocessors. Future 
industrial realizations should combine these two 
levels of parallelism. In a multiprocessor, 
classical pipeline controls become inefficient 
because’ the interdependent behavior of the 


processing elements cannot’ be foreseen either at 
compile time or at decode time. 

In this paper, we introduce 
pipeline architecture : the Data Synchronized 
Pipeline Architecture (DSPA). Based on-— an 
independent sequencing of the functional units, this 
model allows a high degree of parallelism in the 
pipeline, even in the case of unforeseeable 
behaviors of some ressource. 


a new model of 


I Introduction 


Need for computing power seems unlimited in 
various scientific applications. During the last 
ten years, tremendous progress has been made in the 
domain of component integration. But todays’ 
Supercomputer clocks are of the same order of 
magnitude as those of supercomputers ten years ago. 
E.g. the clock of the Cray2 (1985) is only three 
times faster than the one of the Crayl (1976). 

Requirements for performance have lead 
manufacturers to the design of parallel structures. 
The first industrial parallel supercomputers were 
pipeline processors (Crayl, CDC Cyber 205, .. ). 
Today, these pipeline computers can be considered as 
the state of the art in monoprocessor architecture. 
Since the late 1970’s, a lot of multiprocessor 
projects have been initiated [5][6][14][7]. In 
future industrial realizations of these ambitious 
projects, the elementary processors will be pipeline 
processors. Great attention must be taken in the 
design of these pipeline processors. 

In a pipeline computer, the execution of an 
instruction stream generates concurrent activities 
in several functional units ( FUs ). These FUs may 
be pipelined. Even when the FUs are not pipelined, 
the successive FUs crossed by the data forma 
macropipeline. Performance of the computer depends 
heavily on the overlapping of successive 
instructions and are bounded by the throughput of 
the instruction decoder. In the case of vector 
instructions, simple pipeline control is possible 
because of the regularity of the data and 
instruction streams. Optimization of the code can 
be done at compile time. For example, performance 
of the Crayl can be considered as near optimum on 
classical vector instructions. That is the reason 
why pipeline computers are generally refered as 
vector processors. Unfortunately, performance of 
existent pipeline computers on scalar instructions 
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are not as good as_ one can expect. The overlapping 


of successive instructions 
at compile time because memory access’ conflicts or 
hazards may arise at execution time. Because ina 
multiprocessor architecture some hardware is shared, 
the behavior of processing elements are 
interdependent. Moreover if decode time for vector 
instructions is small regardless of the occupation 
of the FUs, very fast algorithms have to be used for 
pipeline control in scalar mode, because the 
decoding delays become critical. 

In section II, we recall the problems which exist 
on some classical pipeline computers. Then, in the 
following sections, we present a new model of 
pipeline architecture : the Data Synchronized 
Pipeline Architecture (DSPA). This model has been 
developed in order to improve the performance of 
pipeline computers on’ scalar code. We also point 
out that using this model, a distributed decode of 
instructions allows very fast sequencing and 
efficient overlap for both vector and _ scalar 
instructions even in the cases where memory 
conflicts or hazards may occur. 


cannot always be decided 


II Existent pipeline models 


1 Outlines 

Today, a pipeline processor can be used as a 
monoprocessor ( Crayl, FPS164, Cray XMP-1, Fujitsu 
VP 200,.. ) or as an elementary processor ina 
multiprocessor system (Cray XMP-4, Cray2, BSP,..). 
The BSP [10] is a SIMD computer; its elementary 
processors (PEs) are pipelined. A set of vector 


instructions has been defined by the designers. 
These instructions are memory to memory instructions 
( operands are read from memory and the result is 
stored in memory ). When the machine has been 


defined, the behavior of these instructions was 
studied and optimized —- by introduction of delays 
between two functional units for example; 
reservation tables [1][9][8] are stored in the 
machine. The SIMD structure of BSP allowed global 
pipeline control for all PEs. In general, the 
distinct FUs do not have the same number of stages 
and many cases of memory conflicts may occur; even 


with a restricted number of vector instructions, the 
number of cases to be studied would increase 
exponentially with the number of operands involved 
by the instruction. In the BSP the number of cases 
to be studied remains quite limited; all the FUs 
have the same number of stages (the reservation 
table does not depend on the FUs' but only on the 
number and the distribution of distinct FUs 
concerned with an instruction), and memory conflicts 
are almost avoided by a constant increment 
definition of a vector and the choice of a prime 
of memory banks [11][12]. On the BSP, a good 
overlapping of distinct iterations of the same 
vector instruction is performed; but there is no 


overlapping of the end of a vector instruction by 
the beginning of an other’ instruction: the 
pipeline must be filled at the beginning of an 
instruction, and emptied at the end. Memory to 
memory instruction may be quite efficient on vector 
operands, but scalar instructions overlapping has to 
be done with other techniques while the restricted 
choice of vector definition increases the ratio of 
scalar code (e.g. loop 3 proposed later must be 
executed in scalar mode). 

In todays’ multiprocessors, the behavior of a 
shared memory cannot be foreseen either at compile 
time, or at execution or decode time : it depends 
on the different processes being executed by the 
different elementary processors. Even when vector 
accesses are synchronized, memory conflicts may 
occur at anytime; memory behavior depends on the 
relations in the distribution of the vector 
elements, on the relations between two successive 
vector accesses and on the treatment of RAW (Read 
After Write) hazards [13] : memory to memory 
architecture are no adapted to high speed scalar 
execution and asynchronous PE controls. 

We present the problems that remain on two 
existent pipelined monoprocessors. 


2 Vector register machines 


In register machines, the operands of an 
instruction are loaded in the functional units from 
registers and the results are stored in registers. 
In some pipeline computers, such as Crayl [3], there 
are vector registers; a vector register may contain 
up to N words —in Crayl, N=64. A single instruction 
with vector register operands may generate up to N 
times the same operation on distinct data, results 
are then stored in a vector register. This approach 
seems very efficient on pure vector instructions. 


Loop 1 : 
DO 1 I=1,N 


1 A(T)=B(1)+C(1)*D(1) 
Only seven instructions are necessary to code the 


body of this loop for the Crayl. 


Vector load C --> Rl 

Vector load D --> R2 

R3 <-- R1l*R2 

Vector load B --> R4 

RS <-- R3 + R4 

Vector store R5 --> A 


Conditionnal jump 


Time to decode such a loop is not critical : up to 
256 accesses to memory are generated by these seven 
instructions and the Crayl can decode one 
instruction by cycle. 

Unfortunately accesses to memory on the Crayl are 
performed in the order of the decode sequence; 


is taken on the decode 
sequence : the next instruction is decoded only 
when the present instruction is initiated. An 
instruction is initiated only when the presence of 
all its operands is guaranteed;, as there is no 
hardware mean to verify if the i element of a 
vector is present or not, to initiate a vector 


condition must be 


moreover no advance 


instruction, the following 


guaranteed : 
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For every i, i cycles later the i™" element will be. 
present. 
These restrictions explain why 353 cycles are 


necessary to execute an iteration of loop 1 while 
the memory is only busy during 256 cycles. They 
also induce the same definition of a vector as in 


the BSP. The loop 2 is then treated in scalar 
mode : 
Loop 2 : 

DO 2 I=1,N 


2 A(I)= BCH(I))+ C(I) 


Crayl is about ten times slower than the Fujitsu 
vP200 [15] for the execution of this loop; because 
vector registers of addresses can be computed, the 
Fujitsu VP200 executes this loop as a vector loop. 
But loop 3 is executed in scalar mode on both 
machines : 


Loop 3 : 
DO 3 I=1,N 
3 A(I)= ACH(I))+ B(T)*C(1) 


Possible hazards may occur on this loop; as H(2) may 
be equal to 1, A(1) has to be written before A(H(2)) 
is read ( at least the hardware should avoid reading 
AC(H(2)) before writing A(1) if H(2)=1). The loop 
must be coded in scalar mode. 

In scientific programs, some part of the code 
always remains scalar. If performance in scalar 
mode are slow compared to vector performance, the 
execution time of the residual sequential code may 
become predominant. On Crayl, for example, when 
loop 1 is executed in scalar mode, about 25 cycles 
are necessary to decode the loop body: the 
theoretical throughput of the memory is then more 
than six times higher than the real throughput! 

Even inthe case of a register machine where 
advance can be taken at the decode [16][17], it is 
very difficult to busy all the FUs in scalar mode : 
parallel decode is quite impossible for register 
machines. The decoder will always remain a 
bottleneck for performance in scalar mode for these 
machines : that is the reason why’ they are 
considered as vector machines. 


3 Microcoded fully connected pipelined machines 


In microcoded machines, an instruction word 
contains a subinstruction parcel for each of its 


FUs. When all operands are loaded from registers, 
the register throughput becomes tremendous and 
unrealistic. Operands for a FU may be directly 


taken on output buses of the FUs. Ona fully 
connected machine, each output of a FU is connected 
to each input of the FUs; some connections are not 
so useful as others : e.g. the path between the 
output of the floating point adder and the input of 
the address unit is very rarely used; such paths may 
be suppressed (fig. 1). 

The FPS 164 of Floating Point Systems [2] may be 
considered as a “nearly” fully connected pipeline 
processor. Each of its FUs can initiate an 
instruction on every cycle. As_ the decoder can 
decode an instruction word on every cycle, the 
FPS 164 can run at full speed on scalar code. 


In an instruction word, the subinstruction 
parcel for each FU is always located at the same 
place. For example, the subinstruction for the 


adder can always be decomposed in three subparcels : 
- Origin of the left operand, (i.e. the bus on 
which it is present) 
- Origin of the right operand 
- Codeoperation 


The width of an instruction word for the FPS 164 
is only 64 bits; it contains an order’ for each FU 
(see [2]). Controls of the crossbar network are 
given by the origins of the operands for’ the 
different FUs; synchronization of the FUs is 
explicit : the orders decoded in a cycle are 
initiated at the next cycle. A datum must be 
present on the output bus of a FU at the foreseen 
cycle. Delays to cross most of the FUs are constant 
and the date of their exit from a FU can be 
foreseen. Unfortunately, unforeseen memory bank 
conflicts may arise on an interleaved or shared 
memory; in this case, one solution consists of 
stopping the clock for the other FUs. 

It is very difficult to produce efficient code 


for the FPS 164 because of the explicit 
synchronization by the microcode between the FUs. 
All the instructions are scalar i.e. an instruction 
initiates at most one order by FU. This is not 


critical because an instruction word can busy all 
the FUs; but automatic production of dense code is 
very difficult. Moreover efficient code cannot be 
produced without unrolling loops. E.g., on FPS 164, 
the loop body of the inner dot product may be coded 
on a single instruction loop preceeded by about ten 
instructions to fill the pipeline and about ten 
instructions to empty the pipeline. Unfortunately, 
no general compiling techniques can be _ used to 
unroll loops. To allow high performance on the 
FPS 164, Floating Point Systems proposes a library 
of frequently used functions and procedures which 
have been hand coded (BLAS). Performance are 
increased by a factor of two when coding LINPACK 
using BLAS instead of coding it in classical FORTRAN 
[4]. When possible hazards may occur (e.g. 
loop 3), loops cannot be unrolled : reads and 
writes cannot be exchanged by software. 


We have pointed out the major limitations of 
some existent pipeline machines. Performance of 
existent vector register machines such as Crayl drop 
dramatically as soon as ”good” conditions disappear, 
(scalar code, memory bank conflicts, scatter-gather, 
possible hazards ..). In scalar mode, performance 
are limited by the decoder : at each cycle, only 
one instruction can be decoded and then only one FU 
can be activated. Moreover the decoder cannot make 
any advances and no overtaking of writes by reads is 
allowed on memory. Fully connected microcoded 


pipeline processors can run at full speed in scalar 
mode, but they present insurmountable difficulties 


for efficient code production. For both models of 
machines, parallelism between the FUs is generated 
by software : possible hazards (e.g. loop 3) 
induce scalar execution, overlapping of the end of a 
loop by the beginning of the next loop is rarely 
performed and is always poor. 
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A data interconnection scheme in the FPS 64 
fig. l 


In section III, we present a new model of 
pipeline machine. As in FPS 164, an instruction may 
contain a subinstruction parcel for each FU and, as 
the execution of these subinstructions are totally 
independent, a distributed decode can be _ done; 
another parallel decoder is also proposed. Vector 
and scalar instructions are available for this model 
of machine; a good overlapping of distinct 
instructions is possible whether they belong to same 
iteration of a loop or not and even when they do not 
belong to the same loop. 


III The Data Synchronized Pipelined Architecture 
( DSPA ) 
1 goals 


The basic problems which have lead us to define a 
new model of pipeline machines have been developed 
in the previous section. 

As we have already pointed out, todays’ pipeline 
processors must be designed to be included as 
elementary processors in a multiprocessor computer : 


MIMD and SIMD computers. Using fully connected 
microcoded pipeline processors as_~ elementary 
processors in such architectures would be 
unrealistic : when a datum cannot be delivered by 


the shared memory, the clock would be’ stopped for 
the other FUs of the elementary processor; this is 
unacceptable and has lead us to express a first 
condition that we want to be satisfied by the design 
of a pipeline processor : 


Condition 1 : The sequencings of the different FUs 
of a pipeline processor have to be independent 


When this condition is satisfied, a delay in the 


delivery of a datum for a FU will not necessarily 
block the other FUs. Also, an instruction is not 
necessarily initiated at the cycle where its 
operands are produced and the production of the 


operands of an operation do not have to be 


synchronized. 


In section I, we have seen that the decoder 
becomes a bottleneck for performance in scalar mode 
for vector register pipelined machines, so we 
propose a second condition : 


Condition 2 : Parallel decode 
pipeline processors. 


We have also noticed that performance on the 
FPS 164 depends heavily on hand coded libraries. We 
wish to design machines able to reach correct 
performance on a very large set of scientific 
programs. This has led us to a third condition : 


is necessary on 


Condition 3 : Natural code generation (by a 


compiler ) must produce efficient code. 

Two hardware tools seem to favor this goal : 
possible advance on decode and aée RAW detection 
mechanism to enable reads to overtake writes on 


memory. 
2 The DSPA model 
We have rejected register pipeline machines 


because of the tremendous throughput demand on the 
registers and on the decoder to achieve performance 
in scalar mode. A Data Synchronized Pipeline 
processor is a nearly” fully connected pipeline 
processor in which a FIFO (First In, First Out) 
queue is associated with each crosspoint on the 
interconnection scheme between’ the _ producers 

-outputs of the FUs— and the consumers — entries of 

the FUs. Except for the sequencer, each FU also has 

a FIFO queue of instructions (fig. 2). When a datum 

flows out from a producer P, it is stored in a FIFO 

queue associated with apair (P,C) where C is a 

consumer i.e an input of aFU F : the datum is 

stored in this FIFO queue until F is ready to treat 
it (fig.3 ). An instruction for a FU may be 
decomposed in three parcels : 

-OQrigins of the operands —i.e. names of producers; 
the FIFO queue associated to the path between the 
producer and the input of the FU is refered. 

-Destination of the result -i.e. the name of a 
consumer. 

-Codeoperation 

The sequencing of the FUs is very simple. 

We detail here the sequencing of the adder (fig. 4) 

1l.If the FIFO queue of instructions is empty then 
GOTO 1. 

Z.Load the instruction in the adder’s sequencer 
and decode it. 

3.If one of the origin FIFO queues is empty then 
GOTO 3. 

4.Load the operands; initiate the operation. 

2.Store the result in the referenced FIFO queue 
and GOTO 1. 


These five steps may be pipelined : 

-Step 5 is always overlapped by the other steps; 

~The next instruction -when existing in the FIFO 
queue— may be decoded during step 3 and 4; 

-~Operands may always be loaded and the operation 
initiated : this operation can be aborted if the 
operands are not valid. 


3 Deterministic execution 


In classical machines, instructions are executed 
in the same order they are decoded: this 
guarantees the unique interpretation of a sequence 
of instructions. This constraint is respected on 
the FPS 164 : instructions are immediatly executed. 

In our model, the FUs are synchronized by the 
data; an instruction may wait for its operands 
during a few cycles : there is no reason to decode 
the memory load of an operand for an addition before 
decoding the addition; if the adder is not busy, it 
can wait for the datum. The only significant order 
in this model of pipeline machines is the order of 
the data which enter the same FIFO queue. 


BEARER 


An example of DSPA integration scheme 
fig. 2 


Crosspoint FIFOs 
Fig. 


wll amen [, 


LI : Left Input 
RI : Right Input, 
OP : Operator 
SOP: Operator’s Sequencer 
IF : Instruction FIFO 
OUT: Output Bus 

_ OUT 


A typical a functionnal unit 
ig. 
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A FIFO queue is associated with only one 
producing functional unit. The instruction FIFO 
queue of a FU guarantees that this FU initiates its 
operations in the same order it receives its 
instructions : it is reasonable to impose the 
constraint that the results flow out from the FU in 
the same order that their production is initiated : 
in most cases, the delay to cross a FU is constant. 

This constraint guarantees the unicity of the 
signification of a code sequence. 


4 DSPA interleaved memory 


In the DSPA model, the memory FU can _ be 
considered as a producing unit ( reads ) and also as 
a consuming unit ( writes ). The DSPA model imposes 
the constraint that the results of the reads flow 
out from the memory in the same order that the reads 
have been decoded. On an interleaved memory, this 
may dramatically decrease the throughput of the 
memory if the reads are really done in the’ same 
order they are decoded; let us suppose a four memory 
bank interleaved memory whose banks are busy during 
four cycles of a read and let us consider the 
following distribution of read requests : 

read 1 and 2 on bank 0 

read 3 and 4 on bank 1 

read 5 and 6 on bank 2 

read 7 and 8 on bank 3 
The results flow out from the banks in good order if 
the reads are initiated in the following order : 


read 1 is initiated on cycle l 
read 2 is initiated on cycle 5 
(bank 0 is busy on cycles 2, 3, 4) 
read 3 iS initiated on cycle 6 
read 4 is initiated on cycle 10 
read 5 is initiated on cycle ll 
read 6 is initiated on cycle 15 
read 7 is initiated on cycle 16 
read 8 is initiated on cycle 20 


the real throughput of the 
fifths of the theoretical 


In this extreme case, 
memory is only two 
throughput. 

We propose a design of an interleaved memory 
which increases the real throughput of the memory 
and which respects the DSPA philosophy (fig.5). 
Addresses (and data for the write instructions) flow 
out from the access unit (AU) in good order and 
enter the FIFO queue of requests for the desired 
banks. Bank i loads its requests from the FIFO 
queue, then read data are stored in a FIFO queue 
associated with bank i. Then’ the reordering unit 
(RU) reads the data on the desired FIFO queues. The 
data flows out from the memory unit on the memory 
output bus (MO) in the desired order. 

One can easily verify that when the sequence of 
reads of the previous example is repeated, the 
asymptotic throughput of the memory reaches the 
theoretical throughput. We will see later that a 
RAW detection hardware mechanism is necessary in the 
memory of a pipeline mechanism; this mechanism can 
be implemented in the AU (global detection) as well 
as in each bank (local detection). 


491 


: Request Input 


DI : Data Input 

MO : Memory output 
RU : Reodering Unit 
AU : Access Unit 

Bi : Bank i 


A DSPA-compatible interleaved memory 
fig. 5 


5 Instruction decode 
5.1 A distributed decode 


We have already pointed out that the 
order of two instructions for two distinct FUs does 
not matter. Many solutions may be imagined for the 
Global decoder. For example, as in the FPS 164, an 
instruction may contain a subinstruction parcel for 
the distinct FUs : these subinstructions may always 
be located at the same places in the instruction 
word and then can be directly routed to the distinct 
instruction FIFO queues of the FUs : 

The global decoder is only a bus. 


relative 


When, in an arbitrary sequence of instructions 
the most frequently used FU receives N instructions, 


the whole sequence can be coded on only N 
instruction words; the relative order of the 
subinstructions for the same FU is the only 


Significant relation. 

Very dense code is obtained without unrolling the 
loops as for the FPS 164 for example. Moreover, in 
the FPS 164 case, a lot of instruction words are 
necessary to fill and to empty the pipeline : in 
our model, these instructions are not necesssary; 
all the iterations of a loop have the same code. 
Let us suppose a DSPA processor which FUs have the 
same characteristics as in the FPS 164. Natural 
code generation will produce only one instruction by 
FU for the loop body of the inner dot product : 
without unrolling loops, a compiler will generate a 
Single instruction word loop for this machine. 

Nevertheless, we do not think that this solution 


of the decode is the only possible : 64 bits are 
necessary to code an instruction word for the 
FPS 164, more information must be given in 
subinstructions in our model, the width of an 


instruction word may result in an instruction memory 
or cache that is too expensive. 


5.2 Decoder throughput 


We discuss here why the throughput of the decoder 
may be more limited. 


5.2.1 Flexibility of the model 


In the design of the FPS 164 and other machine as 
Crayl, all the FUs are assumed to have the’ same 
basic cycle ;: this allows initiation of a new 
instruction on each FU at each cycle. This 
Simplifies the microcoding of the FPS 164 and the 
sequencing of the Crayl. 

But this cannot be considered as a good balance 
between the throughput of the FUs : most of the 
classical algorithms require at the minimum an 
average of one memory access per floating point 
operation. On the FPS 164, this difficulty has lead 
to the introduction of an auxiliary memory having 
the same throughput as the main memory. In the 
design of the successor of the Crayl, the Cray-XMP, 
two pipelined channels are used to simultaneously 
access the memory : they can be _ considered as 
distinct FUs which have to respect very strong 
constraints. 

Our model processor 
cycles for the distinct FUs; e.g. 
of the floating point operators may be two times 
longer than the basic cycle of the memory. Moreover 
the delay in which a FU delivers a result may vary; 


can support distinct basic 
the basic cycles 


we have only imposed the constraint that results 
Flow out from the FU in the order their productions 
have been initiated. 


5.2.2 Vector instructions 


On the FPS 164, 
because of the 
microcode at each cycle. 
difficulties exist to prevent 
Executing a vector instruction 
of executing the same scalar 
As in vector register machines, 


vector instructions cannot exist 
explicit synchronizations by the 
In the DSPA model, no 
vector instructions. 
of length k consists 
instruction k times. 
the physical support 


of the FIFO imposes a maximum vector instruction 
length. In some cases as in loop 4, vector 
instructions increase the performance by removing 
strict dependences’ between the data in the 
pipeline : 

Loop 4 : 


DO 4 I=1,N 
4 A(I)= BCI) + C(I) + D(T) | 
In the body of this loop, two instructions 
concern the adder; the second addition cannot be 
initiated before the end of the first one. If the 
delay to cross the adder is k cycles -seven cycles 
in Crayl- no addition can be initiated during k-1 
cycles. A vector instruction of length k works on k 
independent flows of data : in loop 4, the adder 
should then be busied when the memory throughput is 
sufficient. Our model allows coding many loops as 
vector loops; loops 1 and 2 are coded with vector 
instructions; the following loop 5 is also coded as 
a vector loop. 
Loop 5 : 
DO 5 I=1,N 
5 ACH(I))= B(P(I))*C(Q(T)) 
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The loop 3 may be coded by mixing vector 
instructions and scalar instructions; accesses to H, 
B, C, the multiplication and also the addition may 
be coded as vector instructions. In order to detect 
possible RAW hazards, the two accesses to A must be 
scalar coded : an internal scalar loop has_ to be 
coded in the global loop. 

Let us suppose that the maximum vector length of a 
vector instruction is 8, the body of loop 3 can be 
coded by the following sequence :(1) 

: ?VpRead H > Ma 

: ?VpRead A ¥1 

: ?VpRead B ¥Ir 

: ?V%(M,M) +1 

> 2V+(%,M) Md 

: ?SiRead A (M) — +r 

: ?SpWrite A (+) 

seq: CeC-1,if C>0 then GOTO S 

seq: NeN-1,if N>O then (VL=min(N,8);C«VL;GOTO V) 


—> 


~_ 


> 


y 


The floating point additions have been extracted 
from the internal scalar loop : these instructions 
are kept in the adder’s instruction FIFO queue until 
the operands are received. 


The flexibility of the vector instruction definition 
associated with the possibility to access vector 
operands in scalar mode and to concatenate scalar 
operands to form vector operands increases the ratio 
of vector instructions in DSPA programming. 


5.2.3 Towards another proposition for distributed 
decode 


In a loop body of scientific programs, some FUs 
are not as heavily used as others; in the previous 
examples, no registers were used. When an 


instruction word may contain a subinstruction parcel 
for each FU, the ratio of empty instructions parcels 
in a sequence of code may be tremendous. In most of 
the cases, there is only one instruction for the 
sequencer in a loop body; on the other hand, in all 
classical examples the memory is the critical 
resource : it does not seem natural to decode an 
instruction for the sequencer on every cycle, but it 
may be critical for the memory. Our experiments 
have shown that for a machine with well designed 
FUs, nearly half of the instructions are memory 
accesses. We recall that we consider that the 
memory FU computes postincremented addresses and 
indirect based addresses. 


err ey 


(1) M is the memory, Ma is the address input of the 
memory, +1 is the right input of the adder .. . 


?V (resp. ?S) refers to vector (resp. scalar) 
instructions. Memory instructions are p-access 
(post -incremented) or i-accesses (indirect). 


p-access refers to 
base address A_ and 
p-access generates 


a descriptor containing a 
an increment R. A vector 
VL addresses of the form 
A+ iR and leaves the descriptor with a new 
base address A + VL R; a scalar p-access sends 
the address A to the memory and assigns the 
value A+R to A. Address for a i-access is 
computed by adding an internal base address to a 
value read on Ma. 


Then we have managed to decode only a few 
instructions at the same cycle. These instructions 
must be applied to distinct FUs. It seems that 
decoding two instructions during a memory cycle 


represents a good balance between the decoder 
throughput and the possible performance of the 
machine. The global decode is very simple : names 


of the FUs involved by the instructions are decoded 
and then the two instructions are routed to the 
correct instruction FIFO queues. 


IV RAW hazards 


The DSPA model respects the conditions 1 and 2. 
When producing code for a processor of DSPA family, 
the compiler can forget that this processor may be 
an elementary processor of a multiprocessor 


computer. The code may be generated in the same 
manner it would have been generated for a 
monoprocessor : the real behavior of the shared 


memory can be ignored. This has allowed us to treat 
loop 2 and 5 as vector loops : this cannot be done 
on the Crayl because the behavior of the memory 
cannot be foreseen at compile time. | 

Independency of the FUs also allows parallel 
decode —two solutions have been proposed, others may 
be imagined. Code generation may be relatively 
simple : as the instruction flows for two distinct 
FUs are completely independent, the compaction of a 
linear sequence of code is very easy. 

Unfortunately advance on decode is not very 
useful when the effective accesses to memory have to 
be done in the order of their decode; in most of the 
cases, a loop body begins by a read and ends by a 
write to memory. This prevents the overlapping of 
the end of an iteration of the loop by the beginning 
of the next iteration unless some mechanism is used 
to allow passing the writes by the reads (when the 
read address and the write address are distinct). 
Software mechanisms have been proposed in the past : 
two distinct flows of memory accesses may be 
generated and reads of a flow overtake the writes of 
the other flow. Many examples where these solutions 
are not efficient can be imagined. 


Loop 6 : 
DO 6 I=1,N 
DO 10 J=l,I 
10 A(J)=A(I)+B(1,J) 
6 CONTINUE 


When executing the internal loop, it is very 
interesting to pass the write of A(J) by the reads 
of A(J+1), A(I+2),.. But when I becomes small, 
one cannot ensure that A(1) has been written by the 
previous external iteration. The only software 
solution to prevent hazards in the execution of this 
loop is to empty the pipeline after each iteration 
of the internal loop. 


We think that a hardware detection of RAW hazards 
on memory has to be implemented in a pipeline 
processor; in loop 3, there are possible hazards on 
memory, no software means can be imagined to treat 
this case. Using one of the two models of decode we 
have presented, the decode of this loop programmed 
with vector instructions and an internal scalar loop 
will take no more than 20 cycles : the decode will 
not be a bottleneck -—40 accesses to the memory are 


of RAW hazards is 
that when real hazards don’t 
will only be limited by the 


If a hardware detection 
done, one can’ hope 
occur, performance 
memory throughput. 

Another advantage of a hardware detection of 
hazards is to enable the reaching of correct 
performance on unoptimized code : performance on 
loop 1 when scalar coded can be equal to vector 
performance, even where start-up delays are longer 
if an interleaved memory is used. One can hope for 
correct performance on long loop bodies. To detect 
RAW hazards on a DSPA computer, the memory access 
instruction and the corresponding address are 
extracted from the FIFO queues in the decode order; 
but write addresses are saved in some associative 
memory until corresponding data is present; read 
addresses are checked against the associative 
memory. 


done. 


V Some limitations of the DSPA model 
1 Registers 


In the examples we have detailed, the registers 
have never been used. In the case where a datum is 
used two times or more in an algorithm, it must be 
explicitly saved in a register and each operation 


requiring this datum as an operand will cost two 
instructions : the instruction to initiate the 
operation and the read of the register. The most 


important case of two uses of the same operand is 
the complex multiplication : a solution may consist 
in implementing a complex mode on the multiplier. 


2 A blocking machine 


In a DSPA processor, an instruction waits for its 
operands; this instruction may have’ been decoded 
before the decode of the production of its operands 
and if these productions are never done ( a bad 
generation of code may lead to this extremity ), the 
FU will never be activated as in example 7 : 


7 +: (MM) > +1 
seq: GOTO 7 
On the other hand, data may be produced and never 
consumed : 
Example 8 : 
8 M: pRead X > +l 
Seq: GOTO 8 


Correct codes must be written : when a_ datum is 
produced by a FU, it must be consumed by an other; 
when a_ data has to be consumed, its production has 
to be guaranteed. 

We define a correct unbreakable sequence of code 
(CUS) as : 

No external jump inside the sequence is possible 

unless to its first instruction. 

No jump out of the sequence is possible unless 

from the last instruction. 

Each datum produced in the sequence 

in the sequence. 


Each datum consumed in the sequence 
in the sequence. 


is consumed 


is produced 


The loop body of example 3 is a CUS, but the 
internal scalar loop is not a CUS : data which are 
consumed are not produced in the loop. 


The following loop body of an. inner dot product 
is not a CUS : 
S M: pRead A > *l 
M: pRead B > xr 
*: *(M,M) > +r 
+: +(+,%) > 41 
Seq: C« C-1, if C>0 then GOTO S 
When executing this loop, the first left operand 
for the adder must have been previously produced in 
order to initiate the pipeline; this corresponds to 
the initialization of the sum — note that p subsums 
may be computed by p initializations. 


The following condition must be _ respected to 
guarantee the execution of all the decoded 
instruction ;: 

Each sequence of code must be included in a CUS. 
There are no problems for a compiler’ to produce 
correct code (e.g. a solution may consist in 
treating DO loops as CUS, addresses of jumps 
delimiting distinct CUSs), but hand optimizations of 
the code may be dangerous : if the previous 
condition is not respected, the machine will be 
blocked. But this is not a real problem : machines 
on which badly coded programs produce the desired 
results don’t really exist. 


VI Conclusion 


We have’ presented an_ original model of 
architecture for pipeline processors. In _ this 
model, the decoder does not remain a bottleneck for 
performance in scalar mode as in existent pipeline 
processors : independency’ of the FUs-~ and 
distributed decode of their instructions allow very 
fast sequencing. Synchronization of the FUs is done 
by the data. Such a processor can be integrated in 
synchronous multiprocessor architectures as well as 
in classical MIMD structures. 

Generating code for machines based on our model 
will be very easy; natural sequential code can first 
be generated -instruction by instruction-, then 
independent codes for the distinct FUs can be 
extracted. For classical pipeline computers, 
complex reordering algorithms are applied to code at 
compile time to ensure performance at execution 


time; these algorithms may also be applied to the 
distinct codes for the FUs but this is less 
necessary than for classical architectures. When 


possible hazards may occur on memory or registers, 
there is no mean to optimize code for classical 
pipeline computers; on a DSPA architecture only real 
hazards may degrade performance. 

At present, we are studying a real implementation 
of a pipeline processor of this family. Many 
solutions can be adopted : buses can be shared by 
several FUs, this sharing may be _ static or 
dynamic ... Great attention has to be taken in the 
design of the FUs; for example, in this paper we 
have already exposed some characteristics of the 


memory FU (computation of the addresses by the FUs, 
need of a RAW hazard detection mechanism). Another 
important point to be optimized in the design of the 
machine is the interconnection scheme : connections 
can be quite expensive in this mode because a FIFO 
queue is associated to each crosspoint on this 
interconnection scheme; infrequently used paths 


between the FUs can be suppressed; several FIFO 
queues may be implemented on the same support. 

In the theoretical model, arbitrarily large FIFO 
queues are used. Ina _ real machine, the sizes of 
the FIFO queues are limited by the physical support 
which limits the length of vector instructions. All 
these questions will be discussed in future papers. 
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Abstract 


Multipipeline networking is a_ generalized 
technique to expolit maximum parallelism in vector 
computations. A pipeline net can be viewed as a two- 
level pipelined systolic array, which is dynamically 
reconfigurable for evaluating various vector compound 
functions. In other words, the pipeline net can best 
match the various data dependency relationships in 
program graphs which can be vectorized. Multiple 
vector streams flow through the net to achieve 
significantly higher throughput than using conventional 
pipeline chaining. The reconfigurability provides the 
flexibility in various vector processing applications. 
This paper discusses design issues of the pipeline net 
and presents techniques for converting program graphs 
into pipeline nets. 


1. Introduction 


This paper addresses the design of scientific 
supercomputers which involve heavy vector 
computations. In such an_ environment, a 
computational job can often be decomposed into a 
number of communicating tasks. Each task may 
comprise several vector compound function evaluations 
(8, 13, 21]. Henceforth, we define a vector compound 
function (VCF) as a collection of linked scalar 
operations, which will be repeatedly processed many 
times in a looping structure. So far, these VCFs have 
been realized in array processors (Hliac-[V, MPP), 
pipelined uniprocessors (Cray-1, Cyber-205, and 
FPS-164/max) and multiprocessors (Cray X-MP, 
Cyberplus, FPS Tesseract, Remps, and Cedar) 
[4, 6, 9, 10, 12, 14, 20]. Most commercially available 
supercomputers are equipped with multiple pipelines in 
each central processor. Pipelining has been proven 
effective in implementing linked vector computations. 
However, only linear chaining has been implemented in 
commercial machines. This paper generalizes the 
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conventional linear pipelining to a multipipeline 
networking approach. The goal is to further speed up 
the evaluation of VCFs for future pipelined 
multiprocessors. 


As illustrated in Fig.1, the pipeline networking 
concept originates from the internal forwarding 
technique used in IBM 360/91 and the dynamic linking 
of functional units in CDC 7600 [12]. The concept of 
two-level pipelining has been practised in Cray-1 and 
Cray X-MP in the form of pipeline chaining [6]. 
However, only linearly connected patterns appear in 
these linked vector operations. In the Cyberplus [4], 
fifteen FPs in each processor can be linked by a 
crossbar network. In FPS-164/max or the recent FPS 
Tesseract, dot-product operations are executed by a 
multiplier pipeline cascaded with an _ adder 
pipeline [10, 20]. The networking concept generalizes 
the chaining practice to a two-level, reconfigurable 
systolic approach. 


IBM 360/91 CDC 7600 
(Internal (Linking of Multiple 
Forwarding) Functional Units) 


Systolic Cray-1 (Pipeline Chaining) 
Array 
(One-Level) | 


Cray X-MP (Multiple Chaining) 


Systolic 


Array Cyberplus FPS 164/max 
(Two-Level) (Linked (Matrix Accelerator) 


Vector 
Operations) 


Multipipeline Networking 
(Pipeline Nets) 


Figure 1. Historical evolution of the concept of pipeline networking 


A ptpeline net consists of multiple functional 
pipelines (FPs) interconnected by a buffered switching 
network, which is itself pipelined. Whenever a new 
VCF is to be evaluated, the pipeline net is reconfigured 
into a topology that best matches with the dataflow 
pattern in the program graph of the VCF. The 
evaluation is then performed with multiple operand 
strams flowing through the pipeline net in a 
synchronous fashion. Thus we can view 2 pipeline net as 
a two-level pipelined, dynamically reconfigurable 
systolic array. Using a single physical pipeline net, we 
can provide many virtual systolic arrays to support a 
large collection of application algorithms. 


In this paper, we first characterize the functional 
structure of pipeline nets and describe their operational 
principles. We define program graphs and discusses 
their basic properties. We model VCFs with program 
graphs and provide a unified theory for converting 
program graphs into pipeline nets. Two ptpeline 
networking techniques are presented, one using the 
concept of cut sets and the other using a retiming 
technique modified from  Leiserson, Rose, and 
Saxe [18, 19]. 


2. The Concept of Pipeline Net 


A ptpeline net is made of three types of hardware 
resources, namely multiple functional pipelines (FPs), a 
buffered crossbar network, and a set of data registers, 
as illustrated in Fig.2. All FPs used in a net are 
identical and multifunctional. Different operations can 
be performed by the same FP at different times. They 
may, however, use different pipeline stages thus require 
different amount of pipeline delays. The registers are 
used as interface latches for holding operand and result 
data. The buffered crossbar network is used to provide 
dynamic interconnection paths among the FPs and the 
registers. 


A crucial component of the pipeline net is the 
buffered crossbar network. We choose the crossbar over 
multistage packet switching networks [5] due to the 
demand of full connections in pipeline nets. In general 
purpose computations, pipeline nets with many different 
topologies may be needed. Thus the crossbar network 
should support arbitrary 1-to-l1 and _ 1-to-many 
mappings. (Many-to-1 mappings are not allowed in 
pipeline networking operations). Apart from the full 
connectivity, crossbar switching has the advantages of 
ease to set up, which is very important since the 
network setup time is a major source of the startup 
overhead of the pipeline net. The main critique for 
crossbar network is the hardware complexity, which is 
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Figure 2. The schematic of a pipeline net 


O(m?) for an m by m network. However, if the number 
of FPs is relatively small (say 64), a crossbar switch is 
feasible with today’s VLSI and packaging technology. 
In fact, several crossbar designs have been reported to 
have up to hundreds of inputs and outputs [2, 3, 7]. The 
recent progress in optical interconnection [22] promises 
the hope for even larger optical crossbar networks. 


The pipeline net supports arbitrary connections 
among multiple pipelines. Thus local connections as 
necessary in a systolic array, are no longer a structural — 
constraint in a pipeline net. However, the systolic flow 
of data through pipelines is still preserved. For example, 
when two operand streams arrive at a certain pipeline, 
they may have traversed through some data paths with 
different delays. These path delays must be equalized in 
order to have the correct operand pairs arriving at the 
right place at the right time. In the pipeline net, delay 
matching is handled by the crossbar network. Some 
programmable bufferes (latches) are provided on each 
output port of the crossbar network, so that proper 
noncompute delays can be inserted on each data path. 
An example design is the LINC chip [11], which is an 8- 
by-8 crossbar with up to 32 units of programmable 
delays on each data path. 


The pipeline net performs computations based on 
the pipeline networking concept, which is a natural 
generalization of pipeline chaining in Cray-1 and Cray 
X-MP. The idea is that after the operand data are 
loaded into the register file, a pipeline net consisting of 
vector registers, functional pipelines, and a crossbar 
network is dynamically set up. A block of operands 
(which is called as a wavefront in Kung [17]) may 
traverse multiple pipelines via the network, before it 
finally returns to the register file. Intermediate results 
flow directly from a pipeline to another without 
memory accesses. The pipeline net best matches the 
dataflow pattern of the VCF to be evaluated, thus 
operand fetchings, arithmetic/logic operations, 


intermediate result routing, and final result storing are 
executed concurrently in a pipelined fashion as 
illustrated by the following example. 


Example 1. Consider the following Fortran loop 
representing a VCF. Each iteration of the loop is 
characterized by the program graph shown in Fig.3a. 

DO 10 I=1, 400 
T(I) = BC(I)#C(I) 

10 E(I) = (ACI) *B(I)+T(I))/(T(1) # (C(I) *D(1))) 
Suppose that the add, multiply, and divide pipelines 
require 2, 4, and 6 pipeline stages respectively. This 
graph can be systematically mapped into an equivalent 
pipeline net shown in Fig.3b. Noncompute delays are 
inserted into data paths connecting FP3 and FP4 to 
FP5 and FP6.~ All inter-pipeline data paths are 
provided by the crossbar network as shown in Fig.3c. 
After the pipeline net is configured, the VCF is 
evaluated by passing 400 operand blocks (wavefronts) 
from the vector registers A, B, C, D through the net. 
And the final result data will be stored in register 
E. Note that no temporary storage is needed for the 
intermediate result T. 


3. Properties of Program Graphs 
A program graph G=(V,E,f,,f,) is a weighted 
directed graph, where 


1.V is a_ set of nodes 
arithmetic/logic operations; 


representing 


2.E is a set of edges representing data 
dependency; 


3. f, is a function from E to {0,1,2,...} 
denoting the edge delays; 


4. f, is a function from V to {0,1,2,...} 
denoting the nodal delays; and 


5. There are two specific nodes uv. ,v,,, € V for 


tn’ out 
handling I/O operations. Note that v. has 
no inbound edges and U out has no outbound 


edges. 


A synchronous program graph is one in which 
every cycle has at least one nonzero node or edge delay. 


Since asynchronous graph involves nondeterministic 


behavior, we only consider synchronous graphs in this 
paper. In the following, the term graph refers to 
synchronous graphs only. A (k,,k.)-graph, is & program 
graph in which k, < f,(v)< k, for all vEV. A 
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(a) The program graph 


(b) The pipeline net 


(c) The network implementation 


Figure 3. An example of multipipeline networking 


(k,,k,)-graph is abbreviated as a k-graph when 


k,=k =k. Note that 0-graphs and 1-graphs are special 


cases of k-graphs. A systolic graph is a 0-graph with 
positive edge delays (ie., f,(e)>0 for all e€ E). We 


shall denote such a systolic graph as Coe. 


When a graph is implemented by a hardware 
circuit, the delay of an edge represents the number of 
delay registers on that edge. A computing node with 
delay k corresponds to a k-stage linear pipeline which 
has k latches. At any time, the graph has a state which 
is defined to be the contents of all edge registers and 
pipeline latches. All input data are issued from the 
input node v,, and all results are sent to the output 


node U out" A set of input data which is issued at the 


same time is called a wavefront (operand block). A 
computational task is performed by a program graph G 
synchronously: Suppose at time f, the inttial state of G 


is 8,. A sequence of wavefronts, J, is input from the the 
input node at time f)+?a for t=1,2,---,n, which 
produces an ouput sequence O at the output node at 
time f)+d+ta for *=1,2,---,n. Note that n is the 


number of wavefronts feeding into G from the input 
node, and a, called the spacing, is defined as the 
number of clock periods elapsed between two successive 
wavefronts. 


Two program graphs G and G! are equivalent, 
denoted as G = G'. if the same input sequence 


produces the same output sequence on both graphs. In 
other words, equivalent graphs have the same 
input/output behavior. However, the nodal and edge 
delays and the spacings between wavefronts on each 
graph may be different. 


The following lemmas provide basic tools for 
transforming a program graph into an _ equivalent 
pipeline net, as illustrated in Fig.4. Detailed proofs of 
these lemmas can be found in [15]. A cut-set in a 
graph G is a minimal set of edges the removal of which 
partitions the graph into two isolated subgraphs, called 
the left subgraph (which contains v,,) and the right 


subgraph (which contains v,, ,). All edges from the left 


subgraph to the right subgraph are called rightbound 
edges, and those in the reverse direction are called 
leftbound edges. 


Lemma 1. Adding k delays to any node in a 
program garph and then subtracting k delays from all 
inbound-edges (or all outbound-edges) of that node will 
produce an equivalent graph. An equivalent graph can 
also be obtained, if “adding" and “subtracting" are 
exchanged in the above operation. (Fig.4b) 


Lemma 2. An equivalent graph is generated, if 
all nodal and edge delays and the spacing are multiplied 
by a positive integer. (Fig.4c) 


Lemma 3. For any cut-set of a program graph, 
adding k delays to all leftbound (or rightbound) edges 
and at the same time subtracting k delays from all 
rightbound (or leftbound) edges will result in an 
equivalent graph. (Fig.4d) 


Lemma 4. For any 0-graph, an equivalent graph 
is obtained by splitting any node into a linear cascade 
of k nodes with zero nodal and edge delays in the 
cascade. (Fig.4e) | : 


4. Mapping Program Graphs to Pipeline 
Nets 

A user task is specified with a program graph Go. 
Usually G)=(V,E,f,,f,) is given as a l-graph if we 
assume each arithmetic/logic operation takes one time 


unit to complete; or it is specified as a O-graph if it is _— 


given as a signal flow graph [17]. we hope to transform 
it into an equivalent program graph G,=(V,E,f,',f,); 
which corresponds to the required pipeline net. Note 
that the transformed graph G, has the same topology 
as Gp, but the delays and the spacing may be modified. 
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follows: for any node v, other than v,, and v 


(d) After applying Lemma 3 


(c) After applying Lemma 2 
i to the cut set shown 


with 6==2 


(e) After applying Lemma 4 to Fig.4(b) 


Figure 4. Illustrating Lemma 1-4 with an example graph 


The new nodal delay function f,! is determined as 


out? i v; is 
implemented by a pipeline of k-stages in the required 


pipeline net, then Sf, (v)=k;. 


Leiserson et al [18, 19] studied the problem of 
minimizing the clock period of a synchronous system by 
a technique called ‘"“retiming*. They provided 
algorithms to tramsform a synchronous system into a 
systolic array. Similar work is also reported in [17], 
where a systolization procedure is used to transform a 
signal-flow graph into a systolic array. In this paper we 
choose a different approach for a different purpose. All 
the previous researchers tramsform a program graph G 


into a single-level systolic array, Goi: while our purpose 
is to convert the graph G into a k-graph G,, which can 
be implemented by a two-level pipeline net. 


Two systematic methods are presented below to 
perform the network conversion. We first present cut- 
set method based on Lemmas I, 2, and 3. This method 
is closer to what has been reported in [16], where a cut- 
set rule is used to transform feedback-free program 
graphs into two-level pipelined systolic arrays. We 
investigate more general cases which allow feedbacks in 
the program graphs. We then extend Lelserson’s 
retiming method to generate two-level pipelined systolic 


arrays. These network conversion methods are 
compared with previous approaches in Fig.5. The 
following theorem is a direct concequence of the two 
network conversion algorithms to be presented shortly. 


Theorem. Any synchronous graph Gp» can be 


transformed into a pipeline net characterized by a 
k-graph G, such that G) = G,. 


In what follows, the delay of a cycle c. in a 
program graph G is denoted as d(c.). The sum of all the 
nodal delays of the same cycle (c,) in G, is denoted as 
d(c.). 
in a linear cascade as examplified in Fig.6a. The 
rightbound (leftbound) edges connecting distant nodes 
are called forward (backward) edges as shown in Fig.6a. 
A multichain consists of several chains linked by 
forward or backward edges as shown in Fig.6c. 


A chain is a garph whose nodes can be arranged 


Given a graph G, we define a mazimal acyclic 
subgraph (MAS) of G, as a cycle-free subgraph of G 
which contains all the nodes of G and by reattaching 
one more edge, we will have a cyclic graph. For 
example, an MAS of the cyclic graph shown in Fig.7a is 
depicted in Fig.7b. Before describe the main algorithm, 
we recall that any acyclic graph can be converted into a 
chain by topologic sorting [1]. In the following 
procedure, we first transform a given graph G into a 
multichain. Then we use cut sets, which seperate 
successive adjacent nodes in the multichain, to convert 
G into G,. 


Cutset Conversion Algorithm: This algorithm 
transforms a program graph G into a k-graph G, such 


thatG = G,. 


1. Find an MAS of G. Note that if G is a cyclic 
graph, then some edges are removed. 


bS 


. Perform topologic sorting to obtain a 


multichain Pa ct ae ee 


3. Reattach all edges removed in Step 1. 


4. Apply Lemma 1 to obtain a 0-graph Gy by 


removing all nodal delays to the 
corresponding inbound edges. 


5. For 1=1,2, --- ,m+1, apply Lemma 3 to the 


cut set S. consisting of all edges between 
v,_, and v, Note that uv, and v 
t 0 m 


represent vu, and uv. , respectively. 


+1 
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a. Let €4€5, °° + »€, be all the rightbound 
edges in cutset S., w=min{delay of e, 
for l<j<r}. If w=k., then do 
nothing. 


b. If w>k,., then add w—k, to the delay 
of all leftbound edges in S, and 
subtract w—k. from the delay of all 
rightbound edges in S.. 


ce. If w<k,., then we must consider two 


cases. If the cut set contains leftbound 
edges and there is a forward edge e in 
the cut set with delay w, examine 
whether e belongs to any previous used 
cut set without leftbound edges. If so, 
apply Lemma 3 to raise the delay of e 
to k. and goto Step 5c. Otherwise add 


k.—w to the delay of all rightbound 
edges and subtract k.—w from the 
delay of all leftbound edges. 


At the end of Step 5, a new edge delay 
function is obtained. Denote it as f 


6. If all inbound edges of v, have delay > k, 


for :=1,2, - - - ,m, go to step 7. Otherwise 


Program Graphs, G 


Cutset Conversion Method 
(Hwang & Xu 85) 
Retiming Method 
(Extended from 


Systolic Conversion Theorem 
(Leiserson & Saxe 83) 
Systolization Procedure 


(S.Y. Kung 8&4) : 
: Leiserson & Saxe 83) 


Systolic Arrays, G,* Pipeline Nets, G k 


Figure 5. Various methods for converting program graphs 
to systolic arrays or to pipeline nets 


(a) A chain with forward edges (dotted arcs) and 
backward edges (dashed arcs) | 


ee 


on ~S<\ 
N 
. | : 
~ ~ ee: 4 
(b) A multichain consisting of two chains linked 


by distant edges (dashed arcs) 


Figure 6. Chain and multichain of functional pipelines 


a. Compute the scale-up factor 


ae )-F,"e;) 
= Max { d(e) 
cjEC 


where the e, is a leftbound edge in 
cycle c, and C is the set of all cycles. 


b. Apply Lemma 2 to scale up the 0- 
graph G, obtained in Step 4 6 times. 


Goto Step 5. 


7. Transfer k, delays to v, from its inbound 
edges for all v, € V, we obtain the required 
pipeline net corresponding to G,. 


Example 2. Consider the program graph in 
Fig7a. We want to convert it into a 4graph with 
k,=2,k,=k,=3, and k,=4. By removing’ edges 
(v,,0,); and (05,0); we obtain an MAS in Fig.7b. 
Applying topological sorting, reattaching the removed 
edges, and transfering nodal delays to corresponding 
inbound edges, we obtain the graph in Fig.8a. applying 
Lemma 3 to cut sets Si Soa) S 4 and Se in Fig.8a, the 
edge delays are modified as shown in Fig.8b. Now the 
two leftbound edges in Fig.8b, having delay less than 
k,=2, creat 4 cycles in the multichain. The scale-up 
factor 6={ 5/3, 14/8, 11/6, 11/6}=2. Scaling up the 
graph in Fig.8a two times, we obtain the graph in 
Fig.8c. We then repeat Step 5 once more to obtain the 
graph in Fig.8d. Now all edges have delay greater than 
or equal to the required value. By moving k, delays 


from the inbound edges of v, into v,, we obtain the 
required 4-graph in Fig.8e, which is explicitly redrawn 
as a pipeline net in Fig.8f. v4 Vout 


(b) A maximally acyclic subgraph of G 
Figure 7. A sample Program graph and the maximally acyclic subgraph obtained 
after step 1 of the network conversion algorithm 


(f) The final pipeline net corresponding to the 4grapb 


Figure 8. Successive graphs obtained in applying 
the network conversion algorithm 


5. Designing Pipeline Nets by Retiming 


Leiserson and Saxe [19] studied the problem of 
mapping synchronous graphs into systolic graphs. Their 
results are modified below for generating pipeline nets. 
Before describing our algorithm, we restate their 
retiming technique using our terminology. 


Retiming Lemma. Let G be a synchronous 0- 
graph and lag be a function that maps the I/O nodes to 


zero and any other node to an integer. Suppose that 
for every edge e=(u,v), the value f,(e)+lag(v)—lag(u) is 
nonnegative. Let G’ be the 0-graph obtained by 
replacing the delay f,(e) of every edge e=(u,v) by 
Sfe)+lag(v)—lag(u). Then G= G'. 


Systolic Conversion Theorem (Leiserson and 
Saxe). Let G be a synchronous 0-graph, and G is the 
graph obtained _by subtracting all edge delays G by 1. 
Suppose that G has no cycles of negative delay, then 
there exists a systolic graph Gy" which is equivalent to 


G. 


A constructive proof of the above theorem can be 
found in [19]. Given any 0-graph G, they first derive 
G. Then they specify the lag function as follows: for any 
node v, there exists in G a path of minimal delay from v 
to the output node. Jag(v) is then determined as the 
delay of any such shortest path. They proved that by 
applying the Retiming Lemma to G with the defined lag 
function, G can be mapped into a equivalent systolic 
graph if G has no cycle of negative delay. The 
following algorithm applies the above results to map a 
program graph G) into a pipeline net. 


Retiming Conversion Algorithm: This 
algorithm convert a program graph G into a pipeline 
net G,. 


1. Transfer all nodal delays to inbound edges 
to obtain a 0-graph Gp. 


2. Compute 6 = max ; {d(c)/d(c)} for all 
cycles {ec} in Gp. Scale up the graph G, 6 
times using Lemma 2. 


3. For every node v,, if it corresponds to a 
k stage pipeline in the required pipeline net 
G,, then apply Lemma 4 to split it into a 
linear cascade of k, nodes with zero nodal 


and edge delays. Denote the obtained graph 
as G!. | 


4. Apply the Systolic Conversion Theorem to 
G' to obtain a systolic graph Gy": Apply 
Lemma 3 to eliminate negative or zero 
delays on all outbound edges of v,_, if there 
is any. 


5. Merge the splitted nodes in each cascade 
into a single node by assigning k.—1 delays 
to v, for all v. € V. 
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6. Transfer one delay from all inbound edges v, 
for all v, € V, we obtain the required 
pipeline net corresponding to G,. 


Example 3. Consider the conversion of the 
program graph Fig.7a to an equivalent 4-graph. After 
Steps 1 and 2, we obtain the 0-graph shown in Fig.9a. 
Note that the multiplying factor 6 is found to be 2. The 
0-graph is then expanded into a 14-node graph G'! as 
shown in Fig.9b. The Systolic Conversion Theorem is 
then applied to obtain an equivalent systolic graph Gy" 


as shown in Fig.9c. Note that Lemma 3 has been 
applied to the outbound edges of the input node to 
eliminate negative delays. Grouping every cascade of 
nodes into a single node, we obtain the graph G, shown 


in Fig.9d. The final pipeline net is shown in Fig.9e 
corresponding to the graph G,. 


(c) Applying the retiming lemma in step 4 


Figure 9. Successive graphs obtained in applying 
the retiming conversion algorithm 


(d) Grouping of nodes in step 5 


(e) The final pipeline net after step 6 


Figure 9(Continued). Successive graphs obtained in applying 
the retiming conversion algorithm 


6. Conclusions 


This paper generalizes the conventional linear 
pipelining principle. Pipeline networking offers a new 
design methodology for vector multiprocessor 
supercomputers. Basic techniques are provided for 
mapping program graphs into pipeline nets. These 
techniques are useful for designing two-level pipelined 
systolic arrays. By matching the multiprocessor 
configuration with the dataflow patterns of user 
programs, and by combining pipelining with 
parallelization, the pipeline net architecture provides 
higher flexibility and higher throughput, especially for 
compound vector/matrix operaticns. 
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A Parallel Vector Reduction Architecture 
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Abstract 


The ability to perform fast vector reduction 
evaluations, such as inner product and vector 
summation, is very important in many scientific 
and engineering computer applications. Most vector 
reductions are performed by using a single arith- 
metic unit in a repetitive way. In this paper the 
architecture and performance of a pipelined, par- 
allel vector reduction processor is discussed. 

The performance of the parallel reduction 
processor is given as a function of the size of 
the vector to be reduced and the number of pipe- 
line segments of the reduction processors in- 
volved. In addition to that, the number of re- 
quired reduction processors can be determined as a 
function of the length of the vector to be reduced 
and a prespecified space-time criterion based on a 
doubling improvement ratio. 


|. Introduction 


Vector reduction techniques for arithmetic 
pipelines have been proposed by Kuck [1], Kogge 
[2], and Ni and Hwang [3]. In essence, the vector 
reduction operation is the application of the 
dyadic operator on the elements of a vector, such 
that the order of evaluation is immaterial, i.e. 
the operations are associative and commutative. 
Let "o" be the dyadic operator, X/iJ, 7 =1,...,n 
the vector, and Z the scalar output, the vector 
reduction operation can then be formulated as 

Z = X[1JoxX[2]Jo...oxX[n] (1) 
The vector reduction operation can be performed by 
a pipelined reduction processor (Fig. 1), where q 
is the number of sections in the pipeline. To 
establish the performance of a vector reduction 
method M, it is convenient to partition the pro- 
cess into three phases: 


(2) 


where ve is the number of cycles needed to enter 
all elements of the vector X into the reduction 
processor (feed phase), is the number of cycles 
needed to merge the partial results in the pjpe- 
line to a single result (merge phase), and is 
the number of cycles to drain the pipeline Carain 
phase). 
Ps can easily be observed that 
A + Mo n+qe-#9di1 for all methods M, leading to 
re 


T 


M (3) 


n+q-i1 awe 
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Therefore, the main difference between the methods 
in [1-3] is the merging time . The smallest 
values of tn are obtained by applying the methods 
of Ni and Hwang [3], which are further improve- 
ments of Kuck's and Kogge's method [1,2]. They 
developed two different methods: a symmetric and 
an asymmetric method. 

In Fig. 1 the hardware configuration for both. 
methods is shown. The first min{n,q} cycles the 
elements of X are merged with a constant C into 
the reduction processor (C=0 for vector summation 
and inner product). If "€q, then the merging phase 
is started. If mq the next elements of X are 
paired with the results leaving the pipeline. This 
procedure continues until no element of X is left. 

In the symmetric method (SM-method) two con- 
secutive productive segments are merged, when 
leaving the pipeline. A segment is said to be 
productive, if it contains data which is part of 
the overall reduction operation. The register is 
used to hold a partial result until the next par- 
tial result leaves the pipeline. These two partial 
results are then entered again into the reduction 
processor. If the number of productive segments is 
odd, the last partial result is merged with a 
dummy result, keeping the distance between the 
productive segments always 2° after the i-th iter- 
ation. The merging time y then equals [3] 


Toy = 4 lLogd| + 2 Itogal -q tf wq (4) 


qltogi| + 2ltoml _, sy neq 
The asymmetric method (AM-method) differs 

from the symmetric method in the treatment of the 
last partial result in the case of an odd number 
of productive segments. In this situation the last 
partial result is not merged with a dummy partial 
result, but is left in the register waiting for 
the next partial result leaving the pipeline. The 
merging time of the asymmetric method equals [3] 


Ty = T1209 - 2 [toga] +q tf wmwq (5) 


A 

q[logn| - 2 [tognl , , if neq 

Clearly, the asymmetric method is faster than 
the symmetric method, with the exception of q=2", 
n2q, when both methods have equal performance. 
However, the control sequence of the asymmetric 
method is more difficult to implement, because it 
cannot be expressed in a simple formla [3]. is 

In this paper a parallel processing architec- 
ture for performing the vector reduction opera- | 
tion, consisting of a linear array of pipelined 


processors, is presented. The performance of the merge process from P,,++,P,, to P, takes q fogrl 


proposed architecture is determined. A formula is cycles. The total number of processing cycles 
derived, by which the number of required reduction equals 

processors can be determined as a function of the ee ; . 

size of the vector to be reduced and a prespeci- Ty = [2] +4.[togrl +q-1+ ry(min {F}-4}) (7) 


fied space-time criterion. 

An alternative way of achieving the same 
result as in (7) is shown in Fig. 3c. The 
min {{n/rl a} results are first merged to a single 
result. Then the results are pairwise merged, 
taking q.{Logr] cycles. Therefore, with the inclu- 
sion of the feed phase and the final drain phase, 
the same number of processing cycles, as denoted 
in Eq- (7), is derived. However, this version of 
method-2 has the important advantage that only 
[Logrl instead of (min {fh/rl say)e fogrl results 
have to be transferred between the reduction 
processors. 

The difference. between method-1l and method-2 
equals 


DIFF (M) 


i (method1)- hg (method2) 


| mer (min(r, q) )+q-1-q.[Logr| (8) 
Fvg. i. Pipelined reduction processor For req, we substitute the AM-method, yielding 

DIFF (AM)=2y-2 Ogrl+g-1. The worst case is obtain-— 
ed for r=2“+1, resulting in DIFF(AM)=q+1. Therefore, 


° H or ECS, 
Ml. Architecture of the parallel reduction process ee ee) a eee cha ae 


The parallel reduction processing structure POF adr we get from eq- (8) DIFF (AM)=r+q.[Logq| - 
considered in this paper is the linear array. The 2 '¢09gd1 +2q-1-q.[Logr] . Substitution of eon ae 
linear array consists of N reduction pipelines leads in the worst cage t keqtq [Logg] _-2'°°9T +2¢q- 
(Fig. 2); every reduction pipeline having its own 1 [Log (k.q > ka gaa 094! +42q-1-q.[Logk] . Since 
local memory. In order to be able to produce a k> TLogkl and = 21008N 4¢ 2 -qt2, it follows that 
scalar output value, the W pipelines are connected k.qu2 '*°9T!+2q-1-q.[Logkl> 0. Therefore, it is 
by means of an interconnection network. A single concluded that method-2 is faster than method-l. 


reduction pipeline consists of q segments. Some 
values of qd in actual systems are q=7 (o=*) and 
q=6 (o=+) for the CRAY-1 [5], and p=7 (o=*), q=8 
(o=+) for the Cyber 205 [6]. 

In the following, we will consider the reduc-— 
tion of a vector of dimension 7”, by using r of the 
N reduction pipelines. 

The ” vector is divided into r subvectors, 
each with [n/r| elements. Each of the ” subvectors 
is fed to one of the processors PisPoye0e5P,- The 
merging phase can be done in two different ways: 


method-1: All subvectors are merged into a single 
result in each of the Yr processors. These partial 
results are then transferred, one by one, from Po 
until P,, to P;. In fact this can be viewed as the 
reduction of an ?r element vector into a single 
result. The process is depicted in Fig. 3a. The 
total number of processing cycles is equal to 


T= ie ++2(q-1) +2, (mint [a4 2q))+Ty(min(r, q)) 


(6) 
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es t 
t+-th subvector on the P;-th processor, the ers . 
min {{n/rl,q} results in each pipeline are pairwise . 

: ; connection 
merged; P, and Py in P,, Pz and P, in P3, etc. 

abate. 2 | system 

Then P; and Pz are merged in Ps $ and Py in Pr, 
etc. This process continues until all results are ' 
transferred to P,, where they are finally merged - Ftg. 2. The parttttoned linear array of reduction 
to a single result (Fig. 3b). The transfer and processors 
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Py eccece @0000 00000 —p» e000 @0000 o000e 
P,, ecceee @0000 0000e 
q=5 
P, eceoe @0000 00008 
Py, ecocce3e @0000 oud 
feed merge drain feed merge drain 
(a) 
Pp O000e—> C0CCe —> Cooce @0000 000086 
P, eccce 
q=5 
P. ©0000 —s 000ce 
Py ecccee 
feed transfer merge drain 
(b) 
Pi ecccee @0000 00008 —» ©0000 —» @0000 00008 
P. eee0ee@ ©0000 # 00006 
q=5 
P, rYyYy{) @0000 00008 —»> @0000 
P, 00008 €0000 00006 
feed merge drain transfer drain 
(c) 


Fig. 3. a) Processing according to method 1, 
b) Processing according to method 2, 
e) alternative for method 2. 


A possible realization of the interconnection 
system between the ” pipelines is shown in Fig. 2. 
By means of a linear chain of multiplexors the 
interconnection system can easily be partitioned 
in 1,2,..., [r/2] separate buses. Since the inter- 
pipeline transfers do not overlap in time, this 
simple system is adequate. The interconnection 
network in Fig. 2 has an advantage over non- 
partitionable linear array approaches, since the 
latter needs O(N) cycles to merge the W partial 
results from the individual reduction pipelines, 
whereas by using the approach of Fig. 2, this 
requires only O(logN) cycles. 


Performance of the binary tree 


In the following, we consider the performance 
of a pipelined binary tree processor for vector 
reduction. The structure of such a processor is 
illustrated in Fig. 4. 

Suppose a subtree with r pipelined units of 
the (maximal) N pipelined units is used to reduce 
an M-element vector, r=1,2,4, .-., N. To evaluate 
Eq- (1), ~ elements of the vector are divided into 
[n/rl groups, G(j), for j=1,2, «++, [/rl . Whenn 
is not an integral multiple of 7, [n/r].r-n zero- 
elements are added to the end of the vector, such 
that each group consists of Yr elements. Group G(j) 
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Hel Fig. 4. Reduction on a binary tree 
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is fed into the uppermost 7/2 pipelined units at 
the j-th clock cycle, for j=1,2, ««-, In/rl . After 
[n/r] +] [Logr| cycles, minjq, [n/rl } partial re- 
sults are obtained in the segments of the cumula- 


tive pipeline unit. Next, these min{q, [n/rl} 


partial results in the cumulative unit must be 
merged into the final result, the reduction time 
of this group-merging phase is given in Eq. (4) 
and Eq. (5). Finally, q-1 cycles are required to 
drain the last pipeline. Thus the total reduction 
time is 


Trt (n, r)=[4| +q.[Logr] + 1([*] 4-0-2 (9) 


(rt =reduction-tree) 

From Eq-(7) and Eq-(9), we can see that the 
reduction time is the same for the partitioned 
linear array and the binary reduction-tree. Howev- 
er, the proposed approach is much simpler to 
implement. It is also very regular, so this ap- 
proach is suited to VLSI-implementation. For large 
chains, the mltiplexors will give some extra 
delay. However, if the method of Fig. 3c (method- 
2) is followed, only one result per bus topology | 
connection has to be transferred. Another solution 
is to construct a system with a large number of 
processors in a modular and hierarchical way. For 
example, every p=2° processors form a cluster. 
Within the cluster the interconnection is as 
described above; between every P clusters a simi- 
lar type of (multiplexor) interconnection network 
is established, etc. In vector reduction, the data 
only has to be transferred between the “bottom” 
pipelines of each cluster (P, in Fig. 3). Conse- 
quently, the the maximal interconnection distance 
is 2+log(p) multiplexors. The partitioned linear 
array is also fault tolerant. If one of the pro- 
cessor faults, it can easily be isolated. When the 
global controller of the system has detected a 
faulty processor, it just “shunts” the correspon- 
ding multiplexor to allow the incoming data to be 
passed on to the next processor. Of course, the 
performance will degrade, since the data for the 
faulty processor must be fed to the correctly 
operating processors within the cluster. However, 
the decrease in performance is gracefully. In this 
way, any malfunctioning processors can be local- 
ized. Compared with a fault-tolerant approach for 
a binary tree structure [9], the partitioned lin- 
ear array is much simpler and more superior; there 
is no need of extra processors and connections. 
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ill. Minimal computation time 


For the determination of the minimal computa- 
tion time, we will only consider method-2. Suppose 
we have a maximum of WV pipelines in a reduction 
system. Then the minimal evaluation time of ann 
term vector can be expressed as follows: 


Fain = mint? (nw) [r= 2 Laat | (10) 


We are interested in the value of 7 for which 
Eq- (10) is reached. This is not so simple, since 
the function in Eq. (7) has many discontinuities 
and is difficult to handle analytically. Instead 
of that we will try to approach Eq. (10) by the 
following method. First Eq. (10) is replaced by a 
continuous function by means of canceling the 
ceiling signs. Substitution of the SM-method or 
AM-method then gives 


qlogq “> 
be = 2 rqlogr+q-1+ a“ 
r n, on 
qlog( 7? aa <q 


-— qd on Nn 
d(Tey) oe ne * Pe Ln2 i ee q (1 2) 
si 7 ; tf <q 


The minimum of (11) is reached at either 

P, = min{(n.ln2)/q,N} or rg=min{n,N} (remember 
that NW is the maximum number of available 
pipelines). Whether 7, or ro is chosen depends on 
the minimum of T(n,r, and Pin, 19) « The 
approximation of 7” is denoted as 


2 - -_ J Meln2 . 
Pappr()=0Pt (5g) =opt (min|™=2"2, |, minin, af) (2 3) 


where opt(r_,r J={r|P(r)=min (T(r), T(r,))}, It is 
shown in Fig. 5a and 5b that equation (13) is a 
reasonable approximation. In these curves the 
optimal and approximated values of Y” are shown as 
a function of ”, with as parameter values NW=16 and 
q=7. Fig. 5a also shows a peculiar behavior of the 
reduction system. The optimal and approximated 
optimal value of Yr behave quite discontinuously. 
The curve jumps up and down and gives large dif- 
ferences in Yr for some nearby values of n. In Fig. 
5b it is shown that despite the wild behavior of 
the curve in Fig. 5a, the number of processing 
cycles is a monotone increasing function of n. If 
we fix the choice of 7, to r,,,,=",, the curves 
of Fig. 6a. and 6b result. The latter choice appr 
is a monotone function of n, but the deviation 
from the minimal computation time is slightly 
larger. Simulations with other values of the 
parameters W and q globally give the same results 


[7]- 
IV. Computation time and efficiency, the Dl-ratio 


We will now face the question whether the 
choices of Yr as derived in the previous section 
are efficient choices of ”. The only thing that 
the formula (13) guarantees is that the choice of 
Yr is such that the computation time is within a 
certain bound of the minimal computation time. 
This is not necessary an efficient choice of 7. 
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Fig. 7 shows the behavior of the computation time 
of a vector with length 7=320 on r pipelines, with 
q segments each. It shows that the computation 
time decreases fastly with an increase of r in the 
first part of the curve, but soon the curve 
flattens and the improvement becomes insig- 
nificant. This implies that in order to obtain a 
minimal computation time, many more pipelines are 
required beyond the point where the computation 
time decreases significantly as a result of an 
increase in the number of pipelines. For example, 
if q=8 and n=320, the optimal choice of r is r=3d2 
(with NW=32), however, above the values of Yr in the 
region (4,8), no significant improvements of the 
computation time are obtained. At r=8&, the compu- 
tation time is 7(320,8)=95, and at r=82, the 
computation time is T(320, 32)=81, yielding a 10% 
improvement in speed at the cost of 4 times the 
number of pipelines. Therefore, the minimum 
computation time criterion leads to a very low 
processor-efficiency. 

For practical purposes a different criterion 
on the choice of 7 is needed. We will try to 
establish a criterion for determining P=P oq such 
that beyond the point 7g no acceptable improvement 
in speed can be made. 

Normally, we could use an efficiency measure, 
such as n(r)=T(1)/(r.T(r)). However, 4(r) is 
strongly nonlinear and the above definition leads 
to cumbersome formulas. Moreover, an excessive 
number of processors may be needed to meet an 
efficiency criterion (see also the discussion at 
the end of point 2 of this section). Instead of 
that, we will introduce the Doubling Improvement 
ratio (DI-ratio). We are interested in the rela- 
tive speed improvement (DI-percentage) when the 
number of pipelines is doubled. The DI-improvement 
can be expressed as follows: 


ices sts T(n, r)-T(n, 2r) 
DI ~ Tin, 7) 


Thus, Kpre100% improvement in speed is obtained by 
doubling the number of pipelines. In general, kpry 
will be in the range (0, 0.5). 

The use of the DI-ratio as an efficiency 
measure has the advantage that it is a better 
match to the behavior of the computation time 
curve of a vector reduction, it smoothens discon- 
tinuities, and leads to formulas, which can be 
analyzed. 

In the following the behavior of kp, for a 
system using the SM-method of reduction will be 
analyzed first. Three cases are considered: 

1. 1g{n/r]<q, 2. fn/2rl¢qg[n/rl, and 3. In/2rl>q. 


(14) 


le 1¢In/rleq. For [n/rl =, substitution of Eq. (4) 
and Eq- (7) into (14) leads to a negative improve- 
ment, so we are only interested in the range of 
2<In/ri<q - It can be proved that [Logr| - Logan =-] 
and for [n/rp1 , [LogIn/rl - login/zal] =1 [7]. By 
using these properties, substitution of Eq. (4) 
and (7) into Eq- (14) leads to 
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function of the vector length, 1f Mtg a 


—__—> 


T(n,r) 


REDUCTION TIME 


(1,0) 50 VECTOR LENGTH n ————> 


Ftg. 5b. The reductton ttme as a funetton of the 
vector length. 
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Fig. 7. The reductton ttme as a funetton of the 
number of reductton processors for several 


values of q. 


Let t=[LogIn/r 11, and then differentiating kor in 

Eq. (15) with respect tot yields dkp//dt >0 for 

t>0. Thus, Kpr is maximal at In/rl =q for 1¢[n/rlgz , 
Z (16) 


k 
DES (Tosa +q,[Logq] 4q.[Logr] +q-1 


‘It also holds that 2 lLogalgag-2 [7], therefore, 
from Eq. (16) it follows 


-1 
korsS q~ 5+q.FLogr) +q.flogq)] 


For r=1 and q>2, kp7S0-2, and for g>4, kp <0-18. 
For r32, this figure is kp;¢0.14 for both q22 and 
q24. Therefore, no impressive improvements can be. 
achieved. 

Furthermore, since the inequality (1/7) gives 
only an upper bound, the actual values of Kor may 
be much smaller. Thus, we may summarize the above 
consideration as follows: for 1¢[n/rl¢€q, doubling 


the number ” to 2r leads to a speed up of less 
than 20%. 


(17) 


2. n/arleqe{n/rl . For In/2rl¢qg[n/rl, after some 
substitution, equation (14) can be transformed 
into 


n=320 


kpr [eh zea food [rca] +2 Porta hea over 


; [Logal j 
n 2 (Zogq) -1 
| [2}+a [Logrl + tq 140gq (18) 


- From In/erleqen/rl » it follows g/2<In/ earl , this 


means that either a. 


[Log fn/2rl = [logq|-1 or b. 
[Log In/2rTl = [Logg 


a. Substitution of [Log fn/eril = f[logql-1 in 
Eq- (18) leads to 


frbar pa oat) 


| 77k An C777 (19) 
a [ a lea li Logr| +2 09) +q. [toga] -] 
Singe ml, fn/rl <2. [n/arl < 2.2 [tog Fn/2rll _ 


» and (19) is an increasing function of 
In/rl , it follows that 


2llogd]l - 1, ,ftoed) 


kpr < rlogqi +2 TZ ogg | anne (20) 


2 +q.[logq| -1 


For q>2, can be written as q=2* +t with s20 
and 1gt¢2"; substitution of qz=25 +t in (20), 
leads to 


s+] 


2. 1 
—s07 > SS 


a i 
2.27 4 (2% 44) (841) -1 2°°%4( 2941) (e411) -1 
(21) 


kor < 


After some calculation, it fgllows that for 630 
the maximum of (2°*4-1)/{ 2574+ (2°41) (s+1)-1} 


is 0.233 at s=2. Therefore, it holds kp,;$¢0.233. 


Substitution of [Log fr/2ril = Tlogq] in (18) 


leads to fe iy 
? q 


Kknr = (22) 
DI Ilogq) : 
[2] +qMlogql + 2°94 


+q flogql- 1 


from [n/2rl¢q¢In/rl and In/rl€< 2.[n/erl, it 
follows that Kp <0 - 


From a and b, it has been shown that Kp 780-288 for 
qz2 and [n/2rlyq.[n/rl. Therefore, we may now 
conclude that for 1¢[m/rl¢q and [n/2rl¢q<In/71 , 
the DI-percentage is less than 23.3% for r3l1 and 
Qe2. For r22 and q22, this DI-percentage is even 
smaller, having a value of less than 20%. 


From the above consideration, we conclude 
that in the cases of 1<In/rl¢q and [n/2rl¢q¢eIn/ , 
the Dil-percentages for the SM-method are below 
23.3% for r21 and q32, and below 20% for r>2 and 
q22. For the AM-method, a similar analysis can be 
made. It can be shown that in this case the DI- 
percentage is below 20% for all q22 [7]. 


In order to get an interpretation about the 
consequences of such a low DI-percentage, we want 
to discuss shortly the relationship between the 
DI-improvement and the processor-efficiency. It is 
a well-known fact that the processor-efficiency in 
a multi-processor system is lower than a single 
processor system as a result of the intercommuni- 
cation overhead (which is the price paid for 
greater speed). In the following, we will use the 
reduction time T(r=1) as the reference to define 
the processor-efficiency as 

T,,(1) 


(23) 
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where M denotes the reduction-method, and I,,(r) is 
the reduction time under method M when r pipelines 
are involved in the computation. The relationship 
of the DI-percentage and processor-efficiency can 
be expressed as 


n(r) 

H(2r) = Be(1-k) 

Further, it is shown in [7] that (r/)¢65.2% 
for r=2, 1¢[n/rlg2q and q22 for both SM- and AM- 
method. Therefore, from the fact that k¢€23.3%, it 
follows 1(2r)<0.652/(2*0.767)=42.5% for m2. Thus, 
the processor-efficiency is less than 42.5% for 
1gIm/rl€q and [n/2rl¢gq¢™/rl with q22 and r2. To 
reduce a vector of length ”, the number of 
(dyadic) operations is equal to (n-1). If we use 
the definition n(r)=(n-1)/r.Ty(r), then we will 
have 9(r)<50% for 1<l"/rlg2q and qz2 for both SM- 
and AM-method, leading to 7(2r)<30.6% for rol. 
This processor-efficiency is far too low, so it 
should be avoided. Thus, for a given problem with 
a vector length ”™, we should choose Yr such that 
In/2rl>q, in order to obtain a reasonable proces- 
sor-efficiency (in the case of n¢2q, r=1 is an 
obvious choice). 


(24) 


[n/2rlyq. Substitution of Eq. (7) with the SM- 


3. 
method into Eq. (14), it yields 


[n/rl - [n/2rl-q : 


‘or = 09d 1_, 


= (25) 
[n/r| +q.flogr] +q,flogql +2 


The asymptotic value of kp; is 50% as n/r>o. 
However, the requirement of kyj;=50% is not realis- 
tic, even undesirable for the sake of computing 
speed. If we require that kp;>0.45, then the usage 
of more than one reduction pipeline is only possi- 
ble when ” is very large. So, our aim is to deter- 
mine a reasonable compromise between speed and 
processor-efficiency. 

We want to determine a choice r=f9 as a 
function of m under a given condition kp;=k (ke 
[0.25,0.5)). It is not trivial to solve Eq. (25) 
and to express the DI-choice fp as an explicit 
function of kp,- Furthermore, it is desirable to 
have a simple formula for the determination of 7’. 

By canceling all the ceiling functions in Eq. 
(25), and substituting r=ry and Kp ;=k, the follow- 
ing approximation is obtained: 


(0.5 iaiak k)en 
qt T+k+kLogr ,+kLogq) 


Eq- (26) is an implicit function of ’)-. We want to 
find a simple formula from which an approximation 
to Yp can be easily calculated. 

Substitution of rp=%n into Eq. (26), gives 


0.9 = k 
ike 
q(1+ktk.logat k.logntk.logq) 


(27) 


where @ is an unknown parameter which is a func- 
tion of k, g andn. 

For a given vector of length”, ain Eq. (27) 
can be computed by means of one of the known 
numerical root finding methods (e.g., Newton- 
method). However, it is impractical to iterate a 
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before every vector reduction operation (it may 
even take more time than the reduction itself!). 
From Eq. (27), we have learned that loga varies 
very smoothly as ” increases; it is approximately 
a function of log(logn). One solution is to 
approximate loga by atb.log(log(n)), where the 
parameters a and b can be determined by means of 
the least square error method [7]. 

Thus, by using the approximation function of 
loga, Eq. (26) is transformed to 
1°” 


(28) 
Cote glog(Log(n) )+te,tog(n) 


ro = 


where €,=(0.5-k), cyo=q* (1+k+k.atklogq), ¢z=qk.b 
and ce y=qek with k=kp,- All c,'é are constants when 
a choice of Kk is made. Simulations show that the 
approximation formula (28), together with the 
calculated values of a and b, gives satisfactory 
results (error between Eq. (28) and Eq. (26) is 
less than 5%). Furthermore, Eq. (28) is valid for 
both the SM-method and the AM-method. 

The computation of Log(n) and log(log(n)) can 
be done in a simple way by means of the following 
approximation scheme: determine the position P SB 
of the most significant bit (MSB) of n; then Pyop 
is an approximation of log(n) with an error of 
less than 1. Another approximation with an error 
of less than 0.985 can be obtained by looking at 
the bit next to the MSB, if this bit is 7 then 
Pygpti is used as the approximation for Login), 


otherwise, Pusp is used. Applying now the same 


process to the obtained result of log(n), an 
approximation value of log(log(n)) is determined. 

Fig. 7 shows a plot of the reduction time 
(T(n,r)) as a function of the number of pipelines 
(r) involved in the processing of a given vector 
of length”. The points related to several 
different kp; values are shown by means of the 
markers. The numerical results in Fig. 7 also 
show, that a choice of Kp;<25% will only greatly 
increase the use of the number of pipelines and 
result in very low processor-efficiency as has 
already been pointed out earlier. 


V. Performance 


Generally, in vector processing a parallel 
pipelined system performs better than a parallel 
non-pipelined system. Suppose a non-pipelined 
processor has a cycle time of ft =s ns, then 
with the same technology a pipeline cycle time of 
toi e=(s/q te ) ne can be realized, where q is the 
dumber of segments of the pipelined processor, and 
C represents the delay of a latch which is 
introduced to hold the partial result between each 
segment. For example, consider a 6 segment adder 
(like the one in the Cray X-MP) having a cycle 
time of tnipe 2°? né, then in the ideal case a 
non~pipelined adder has a cycle time of 
ae =45 ns (i.e. assuming that the delay of a 
latch is about 2.0 ne for the pipelined case). 
However, in reality, the ratio Ry of the maximum 
vector processing rate and the maximum scalar 
processing rate is usually in the order of 10 
(e.g., R=15 for Cray-1 [8]), where 2, is defined 
as Ry=hyy/Nog [8], and n,, and Rog are the maximal 


vector and scalar processing rates respectively. 
Thus, in reality the cycle time of a non-pipelined 
(general purposed) processor will have a cycle 
time much greater than the ideal cycle time of 45 
né in our example. This is due to the fact that 
instructions must be fetched and decoded for each 
operation before performing the operation itself, 
whereas this is not the case for a pipelined 
processor. The performances of the non-pipelined 
approach and the pipelined approach are shown in 
Table I for a system of 8 processors. The 
computation speeds are listed for different values 
of kK. In Table I, the performance of both the 
ideal parallel non-pipelined system and the 
practical non-pipelined system with A,=10 are 
shown. 5); is the speed of the pipelined 
approach, ae and Sp are the speed of the ideal 
nonpipelined approach and the practical 
nonpipelined approach respectively. From the 
table, it is clear that the pipelined system 
outperforms the ideal non-pipelined system, except 
for very small vector lengths, and the parallel 
pipelined system outperforms the practical non- 
pipelined system in all the cases shown in the 
table. 


Vi. Conclusion 


In this paper, we have studied vector- 
reduction techniques in a parallel arithmetic 
pipeline processing environment. It has been shown 
that with a simple partitionable linear array 
structure, reduction algorithms can be implemented 
efficiently. The proposed approach performs the 
same as a binary-tree network in the reduction of 
an %-element vector. 

Regularity and fault tolerance of a system 
are important criteria to VLSI implementations, 
the partitioned linear array approach satisfies 
these two criteria. 

The minimal reduction time in a system con- 
sisting of NW pipelines and an approximation method 
to achieve the minimal reduction time also have 
been analyzed. It is known that parallelism and 
processor-efficiency are tradeoffs. In a multi- 
pipeline system, the issue of processor-efficiency 
is a more important problem than in the non pipe- 
lined cases, since the time overhead in mergin 
the partial results of each pipeline bag igael 
see Fig. 3) is larger. The DI-ratio has been 


introduced as a means to be able to make a trade- 
off between computing speed (parallelism) and 
processor-efficiency. 

The DI-ratio can be used to calculate a 
proper choice of the number of pipelined pro- 
cessors, given the typical maximal vector length 
of the operations the system will be dealing with, 
and a required processor efficiency. 
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Abstract 


This paper discusses pipelining in the context of mul- 
tiprocessor systems. It is proposed that for such an archi- 
tecture, it is possible to design near-optimal pipeline ma- 
chines by coupling pipelines, using buffers to balance instruc- 
.tion/data flows, concurrently processing several instruction 
streams, and allowing out-of-order execution of instructions 
within a stream. We have investigated the issues involved in 
the design of such a machine by using a uniprocessor pipeline 
machine as a basis for the type of architecture proposed. 


Introduction 


Pipelining has long been a method of obtaining high 
throughput in uniprocessor systems by overlapping the exe- 
cution of several instructions; an example of such an arrange- 
ment being that shown in the block diagram of Figure 1 in 
which each stage of the pipeline performs one of the phases 
in the sequence involved in processing an instruction. Ideally 
the throughput of such a system is one result per pipeline 
beat. In practice, however, it is rarely the case that this ideal 
throughput is achieved. Instructions requiring the transfer of 
control usually cause severe disruption in the flow of instruc- 
tions; similar effects occur, although to a lesser extent, in the 
operand-fetch stage, as a result of the methods employed to 
resolve conflicts, and a multitude of other (relatively minor) 
reasons. 

In addition to pipelining, performance may be gained 
through the use of multiple processors; examples of machines 
which combine extensive pipelining and multiple processors 
being the S-1 [14] and the HEP[ 11]. This paper deals with 
the subject of pipelining in the context of multiprocessor sys- 
tems. In such a case, we believe that it is possible to over- 
come many of the problems inherent in pipelining by “shar- 
ing” pipelines between different instruction streams and us- 
ing buffers to balance flows through the pipelines. Such an 
architecture is shown, at a fairly high level, in the diagram 
in Figure 2. The main idea here is to take several pipelines 
of the type shown in Figure 1 and to couple them through 
inter-stage buffers (capable of holding several instructions) 
in such a manner that during peak processing rates, the flow 
into the buffer exceeds the flow from the buffer. Thus when 
processing instructions that would normally cause a disrup- 
tion in the rate of inter-stage flow, the overall effect would 


be that while the flow out of some stage would fall to below 
maximum, the succeeding stage would continue to accept in- 
structions at the maximum rate (by processing instructions 
buffered during peak flow); strictly, therefore, buffers are not 
necessary at the output of every stage as implied by Figure 
2. The ability to process several instruction/data streams 
also has many benefits which are discussed below. The same 
techniques can be applied to data streams in order to ensure 
that pipelined vector processing units do not experience any 
performance degradations as a result of disruptions in the 
delivery of data. 

The type of architecture most closely related, in as far 
as the use of buffers goes, to what we propose is the HEP al- 
though essentially the same reasoning also lies behind the use 


0190-391 8/86/0000/0511 $01.00 © 1986 IEEE 


511 


of queues in PIPE [2] and a related scheme, in which several 
independent processors (cyclically) share a buffer, has also 
been described by Goto and Shimizu [3]. We have attempted 
to identify the issues that are involved in the design of a more 
practical machine than that depicted by the block diagram 
of Figure 2 and have been concerned with the complexity . 
of such a task relative to that of designing a multiprocessor 
system consisting of independent pipelines. For this purpose 
we chose a practical pipelined uniprocessor, the MU5 [8], as 
a starting point and attempted to develop a shared-pipeline 
multiprocessor from this. 


The MU5 Processor 


In this section we briefly describe the uniprocessor that 
has formed the basis of our studies [9]. A high-level organi- 
zation of the MU5 processor is shown in Figure 3. It con- 
sists of two major pipelines, the Primary Operand Pipeline 
(PROP) which is largely concerned with the accessing of pri- 
mary (named) operands and the Secondary Operand Pipeline 
(SEOP) which is largely concerned with the accessing of sec- 
ondary operands such as data structure elements. Instruc- 
tions fetches are initiated by the Instruction Buffer Unit 
(IBU) to which the store returns upto eight instructions at a 
time; IBU then supplies the Primary Operand Unit at the 
rate of about one instruction per beat (peak). IBU also 
contains a branch prediction mechanism, the Jump Trace. 
The B-Unit executes fixed point arithmetic instructions, such 
those required for computing array indices, and some organi- | 
sational instructions. The A-Unit executes floating point and 
fixed point arithmetic instructions. 

PROP is a five-stage pipeline: Initial Decode in which 
sufficient decoding is done to isolate the name and select a 
base register (Stack Pointer, Name Base, etc.), Add Name to 
Base Register in which a name is added to the contents of 
a base register; Associate Address which is the first stage of 
accessing the Name Store (a “cache” for named operands); 
Read Value, the second part of Name Store access; and As- 
semble Operand in which the operand is assembled and the 
Program Counter is incremented. 


SEOP consists of three main stages: Descriptor Address- 
ing Unit (Dr) generates secondary addresses using named de- 
scriptors from PROP and the output of the B-Unit as mod- 
ifier, Operand Buffer Store (OBS) makes store requests and 
buffers operands returned, and Descriptor Operand Access- 
ing Unit (Dop) performs masking and shifting to select the 
required element. 


A Shared Pipeline Multiprocessor 


In this section we present a shared-pipeline multipro- 
cessor. The philosophy underlying the architecture is very 
similar to that of the HEP computer [11]; it centers around 
the idea of processing several instruction/data streams in a 


-single pipeline and can be viewed as one that develops the 


multiprocessing-pipelines concept to its ultimate conclusion 
in that several instructions from the same stream may be 
processed concurrently and possibly out of sequence. 

As previously stated, the main driving problem here is 


that of pipeline flow disruptions. Conventional pipelined ma- 
chines have dealt with the problem of flow disruptions by 
designing hardware to limit the effects of these within indi- 
vidual instruction streams; an example in point being branch 
prediction strategies [7,10]. The success of these has, how- 
ever, been limited by the inherent nature of pipelining and 
of the.instructions being processed. Considering conditional 
control transfers, for example, it can be observed that when 
both the instruction setting condition codes and the control 
transfer order fall within the so-called Gulf of Ignorance [5], 
1.e. the separation between the pipeline entry and the exe- 
cution units, the latter instruction is unresolvable and such 
mechanisms as double-fetching (employed in the IBM 360/91 
[1]) or the prediction Jump Trace of MU5 must necessarily 
be of limited success. Pipeline sharing is based on the ob- 
servation that disruptions within an individual stream are 
acceptable as long as some other stream can be processed to 
fill the gap created. The main design issue is, therefore, that 
of being able to switch streams rapidly; in fact, ideally, the 
delay incurred should be nil. This is precisely what the inter- 
leaving of instructions from different streams is intended to 
achieve. There are also gains to be made if the instructions 
in a particular stream can be executed out of their initiation 


order if the data dependencies so permit; this, for example, | 


reduces the number of active processes (streams) that are 
needed in order to sustain maximum throughput. 

From the above description and Figure 3, it will be ob- 
served that in an MU5-like pipeline, the points at which the 
averaged input rate into a stage may exceed the averaged 
output rate out of the stage are: 


e At the output of IBU as a result of processing branch 
instructions and other organisational instructions. This 
is the case even with branch prediction and prefetching. 


e At the Name Store, since some instructions will give non- 
equivalence on association. 


e As with the Name Store, an instruction may fail to find 
its operand in the Operand Buffer Store hence the rate 
of flow into the A-Unit may be disrupted on certain oc- 
casions. 


e At the Control-Point/Assemble-Operand stage. Since 
some instructions are executed at this stage, the flow 
into the stage will, on the average, exceed the outward 
flow into the arithmetic units. 


Thus recalling the diagrams of Figures 1 and 2, buffers 


can be inserted at these points to balance the flow into and 
out of the various stages. This would be the obvious orga- 
‘nization and represents the architecture that was considered 
initially. A much better organization turned out to be that 
shown in Figure 4. In this all the buffers, that one would 
expect, in the primary stages of the pipelines still exist al- 
though in a different form and they, in effect, been moved 
into the IBUs. This excludes buffering at the end of the 
primary pipelines; which we have not, for various reasons, 
investigated at the present. To keep the diagram simple, not 
all the relevant features of the architecture are shown as will 


be apparent from the following discussions. Such omissions . 


include connections from the Instruction Issue to the IBUs 
(via the IBU-PSB network). Although only two inter-stage 
- buffers per stage are shown, a practical system would include 
as many buffers as necessary to balance the average flows 
without any undue performance requirements on the hard- 
ware. It is also to be expected that the number of Scalar 
Execution Units will be much smaller than the number of 
Vector Execution Units; reasonable numbers being, say, 4 
and 32 respectively. In general, the number of units in the 
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primary pipelines can be altered independently of the num- 


-ber of units in the secondary pipelines. No claim is made 


that Figure 4 of itself depicts a practical machine; at this 
point, we have emphasized concepts over practical matters 
whenever conflicts existed. 

There are fundamental differences between the organiza- 
tion shown and one that would be obtained from a straight- 
forward extension of an MU5-like machine to multiprocessor 
system. By far, the most apparent is the decision to separate 
the single primary store into a Program Store and a Data 
Store; in fact the latter is further split into a Scalar Store 
and a Vector Store. The main reason for this is to reduce 
the bandwidth requirements of the various interconnections. 
With a single (interleaved) store an interconnection system 
with enough bandwidth to support the accessing rates of all 
the IBUs, the Scalar Units, and the Vector Units may be 
an impractical proposition. The separation of a (possibly) 
single data store into two separate stores arose from the de- 
cision to have highly pipelined vector processing units; the 


use of a separate Vector Store, apart form reducing band- 
width requirements, also allows vector processing orders to 
be implemented in a more natural way. 


Multiple Instruction Buffer Units 


Essentially, the system runs several processes concur- 
rently with instructions from each stream being initiated in a 
different IBU. A Decode module cycles through the IBUs con- 
nected to it and interleaves instructions from several streams 
into the next stage, after the necessary decoding. Whenever 
an instruction that would normally cause a disruption, e.g. 
a branch instruction, is decoded, the corresponding IBU is 
(temporarily) abandoned until it is known that valid instruc- 
tions may be taken from the IBU, say when a transfer of 
control actually takes place. 

The important point about the use of several IBUs per. 
Decode is that since instructions from different streams are’ 
continually being interleaved, the end effect is that streams - 
(as seen by the pipeline stages) as also being changed at the 
same rate which is also the maximum rate possible. Thus 
when a disruption occurs within some stream, the necessary 
change of stream is obtained, in effect, “for free”. 

No attempt is made to predict control transfers or to 
explicitly trap loop instructions. Regarding the former, the 
implications of trying to do otherwise are quite clear: identi- 
fying the instructions belonging to some stream for which an 
incorrect prediction was made and flushing these out is likely 
to be very messy at best. Hence all instructions entering the 
pipelines must be guaranteed to complete. Loop-trapping on 
the other hand would be quite easy to include since in the case 
of loops an IBU tends to contain the same set of instructions 
most of the time. 


Name-Associate Units and Name Caches 


At the Name-Associate Stage those instructions causing 
a non-equivalence are held-up until it can be guaranteed that 
their operands are available; other instructions may also be 
held-up at this stage in order to resolve conflicts. We have ex- 
tensively modified [9] the original MU5 Name Store (Cache) 
to allow a very large number of instructions to execute out 
of sequence. In additions to modifications to the Name Store 
itself, an associatively addressed buffer has been added to 
each cache unit to store instructions that are held-up in or- 
der to resolve conflicts and a queue has also been added to 
hold instructions with pending store requests. The end re- 
sult, admittedly, is a rather complex organisation that in its 
present form needs to be simplified. . 

Instructions for which the association is successful pro- 


ceed to the next stage, tagged with a specification of a Name 
Cache unit holding the operand, and have their operands read 
out. At the Control Points, branch instructions and some or- 
ganisational instructions are executed and other instructions 
are issued to a Scalar Execution Unit or a Vector Associate 
Unit (VAU) and to a Vector Execution Unit, as appropriate. 


Vector Processing Modules 


For vector instructions, the VAU units compute addresses 
and performs the necessary associate operations, operands are 
obtained from the Vector Cache, (possibly after a store fetch) 
and the instruction and operands are forwarded to a Vector 
Execution Unit for processing. The results of a VEU opera- 
tion are stored in one of the two data caches, depending on its 
type, while those of a scalar operation are stored in the Name 
Cache. The modules involved with the processing of vector 
orders are also another area where we have had to make sig- 
nificant changes to the original organisation. Modifications in 
the former have been made to allow out-of -sequence execution, 
and are similar to the changes made to the Name Store, and 
to allow data-driven communications such as are to be found 
in the MU6-V prototype [6]. 


Miscellaneous 


The Instruction Issue Units (IIUs), which are also the loca- 
tion of Control Points, have also turned out to be more complex 
than their counterpart in the MU5 PROP; a major addition has 
been to include a unit resembling the CDC 6600 Scoreboard 
[13]. Some of the added complexity arose from the attempt 
to maintain compatibility with the MU5 instruction set, prim- 
itive architecture, and general philosophy. An example of this 
is that while compiler-generated code assumes (for simplicity), 
for example, that only one Scalar Execution Unit exists, ITU on 
the other hand assigns schedules instructions according to the 
availability of execution units. This creates many difficulties, 
particularly when code assumes that the stack will be used to 
hold temporary values. The same applies to many orders that 
explicitly use the base registers; since these registers are lo- 
cated in the early stages of the primary pipelines while part 
of the execution takes place in the IJUs or the Scalar Execu- 
tion Units, there is some difficulty in providing the necessary 
communication. Our solution at the moment is to connect 
the I1U-SEU/VAU network and the PSB-IBU network and to 
communicate through these, but this is clearly a solution that 
leaves much to be desired. It, therefore, appears that any solu- 
tion short of changing the instruction set and/or changing the 
primitive architecture is likely to end up being rather complex. 

Other significant architectural changes that have been ne- 
cessitated by the sharing of pipelines are: Certain machine reg- 
isters, such as the Processor Status Register, Program Counter, 
and Base Registers have to be replicated so that there is one 
register of each type for each IBU. As a consequence of this, 


and of the need to provide communication (which takes place 
via the PSB-IBU network) between Control Points and IBUs, 
it is necessary for instructions to be tagged both with a pro- 


cess identifier and the identifier of the IBU in which they are 
initiated. 


CONCLUSIONS 


The original inspiration for the architecture described 


came from [4] although it may now be hard to discern many 
similarities with what is described therein. The abstract archi- 
tecture is one that that combines pipelining and multiprocess- 
ing and whose main features are: shared-buffered pipelines with 
almost no overhead in stream switching and no disruptions in 


the inter-stage flow of instruction/data and, hence, no perfor- 


313 


mance degradation on all types of orders, ability to execute in- 
structions out-of-sequence across a wide window in the stream, 
and high-performance vector processing It is based on a prac- 
tical uniprocessor machine and the relative complexity /major 
issues of realizing such an abstract machine have been partially 
evaluated (detailed in [9]). The main lessons that have been 
learned from the exercise so far are that: (1) such a machine 
probably needs to have a both “tailor-made” instruction set 
and primitive architecture as its basis and (2) there are inher- 
ent difficulties in some of the underlying ideas and these needs 
to be re-examined carefully; the splitting of the data store, for 
example, has resulted in far more difficulties than were antici- 
pated. These represent the major avenues for any further work. 
have 
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Abstract 


Instruction sets of Register-Storage (RS) architectures allow func- 
tional operations to take one operand from memory and one from a 
register. In contrast, Register-Register (RR) architectures always take 


both operands from registers. A simple, generic instruction set is 


defined for an RS architecture. A basic RS pipeline structure is 
developed for studying design and performance characteristics. The 
resolution of register hazards, placement of the memory access, branch 
instructions, and handling of stores to memory are examined in detail. 
For purposes of comparison, a basic RR architecture and pipelined 
implementation are also developed. The RR and RS models are used to 
calculate the performance of several of the Lawrence Livermore Loops 
to gain additional insight into the effect that the architectures have on 
performance. 


1. Introduction 


Register-Storage (RS) architectures are based on instruction sets 
that allow functional operations (e.g. floating point addition) to take one 
operand from memory and one from a register. The result is placed in a 
register. In contrast, Register-Register (RR) architectures always take 
both function operands from registers; explicit loads and stores to or 
from -registers must always be used to communicate with central 
memory. A third class of architectures, those that permit Storage- 
Storage (SS) operations, can also be defined, although we do not study 
them here. 


The suitability of RR architectures for high performance computa- 


tion has been clearly demonstrated. The CDC6600 [THO70], CDC7600 — 


[BON69], and the more recent CDC CYBER 180 [CDC84] architectures 
are based on an RR architecture. In addition the scalar section of the 
CYBER 205 [CDC81] is an RR architecture, as are both the scalar and 
vector sections of the CRAY-1S [CRA80], CRAY X-MP [CRA82], and 
CRAY-2 [CRA85]. Furthermore, the suitability of RR architectures for 
high performance implementations has not escaped the attention of 
researchers studying high-speed microprocessor architectures [PAT85], 
[RAD82]. 


RS architectures are widely used in practice; the IBM 370 archi- | 


tecture can generally be considered an RS architecture, although it does 
support some SS operations. Unfortunately, the IBM 370 architecture 
also contains some features that make high performance implementa- 
tions difficult (e.g. implicitly-set condition codes, self-modifying code, 
and precise interrupts) [AND67]. Consequently, based on real imple- 
-Mmentations, it is difficult to clearly visualize the fundamental structures 
of pipelined RS architectures and to see the relative advantages and 
disadvantages of RS architectures for high performance implementa- 
tions. 


| In this paper, we propose a simple RS architecture, suitable for 
pipelining, in order to clearly study the advantages and disadvantages of 
RS architectures for high performance implementations. To permit 
comparison, we also describe and discuss RR pipelines. The discussion 
in this paper is primarily in the context of numeric (floating point) 
applications, although many of the results and observations apply to 
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other applications. 


2. Model Architectures 


A fundamental RS instruction couples a memory access with a 
data operation. An RS _ instruction is of the form 
Ri <— Rj op (Rk + disp). The content of register Rk is added to the 
displacement to form the memory address from which one of the 
operands is fetched. There are several other ways that a memory 
address could be formed, but we use only this mode. An operation, 
signified by op, is performed between the memory operand and a 
second operand taken from register Rj. The result is placed in register 
Ri. 

Also included as part of a basic RS architecture are simple regis- 
ter loads, register-register operations, register stores, and branches. The 
register-register instructions are of the form Ri < Rj op Rk and 
Ri < Rj op data. Here the second operand comes from a register or 
directly from the instruction. Loads and stores are of form 
Ri — (Rk + disp) and (Rk + disp) < Ri respectively. For single 
operand instructions, the operand can come from either a register, 
memory, or the instruction. Conditional branches are of the form 
BrCond Rk + disp where a branch condition is specified and the target 
address is computed by adding the contents of register Rk to the dis- 
placement. In all cases the register RO is tested and the branch decision 
is based on this test. Condition codes are not used. 


As a model RR architecture, we use the RR subset of the above. 
Hence, it is like the scalar CRAY-1 architecture, except one uniform set 
of registers is used, not two types. 


3. Pipeline Structures 


3.1. The Basic RS Pipeline 


Fig. 1 shows a typical implementation. An instruction fetch 
sequence consists of address generation, memory reference, and instruc- 
tion decode. The instruction fetch step may contain an instruction 
cache. 


The operand fetch phase consists of reading address registers, 
operand address calculation, and memory access. Reading registers 
requires checking for hazards which will be detailed later. In the 
memory access step, an operand cache may be used. In fact, an 
operand cache is an important aspect of RS designs and will be dis-_ 
cussed further later. 


Before an instruction can be issued to the execution unit, register 
operands must also be obtained. This can be performed either in paral- 
lel with or after the memory operand fetch. In our basic pipeline, 
instructions are placed in a queue, shown in Fig. 1, after completing the 
memory operand fetch. Note that the queue may be of length 1. The 
instruction at the front of the queue checks for register hazards, reads 
register operands, and then issues to the execution unit. 


The execution unit may take many forms. It may be a single 
non-pipelined unit, several parallel non-pipelined units, or several paral- 
lel pipelined units. Note however that the type of execution unit used is 


under Grant ECS-8207277. 
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Fig. 1 A typical Register-Storage implementation model 


independent of the type of architecture being implemented. The type of 
execution unit used is based on other issues such as performance and 
cost. In our study, a parallel pipelined execution unit is assumed to 
minimize execution time and expose other delays that are inherent in 
the implementation of the architecture. 


Once execution has completed the result needs to be stored. Most 
instructions return their result to a register, so storage is a simple regis- 
ter file write. However, stores to memory can take place at this point 
also. In a later section, we consider the tradeoffs involved with per- 
forming memory stores at different points in the pipeline. 


3.2. A Basic RR Pipeline Structure 


The basic pipeline phases are much the same as in the RS model. 
Here, the instruction fetch step is followed by either memory operand 
access or execution, but not both. Storage to memory is done at the 
same point as operand access. Fig. 2 shows a typical RR model. As in 
the RS pipeline, instructions flow through the instruction fetch unit 
where they are decoded. Following instruction fetch, all register inter- 
locks are cleared and then an instruction is issued to either the execu- 
tion unit or the memory unit. All registers are read at the same point in 
the pipeline whether they are for addressing or contain operands. Each 
of these units returns its result to the register file. 


4. Design Considerations 


4.1. Register Hazards 


The way register hazards are handled is an important difference 
between RS and RR implementations. In the RS implementation, there 


MEMORY ACCESS 


INSTRUCTION FETCH 


REGISTER FILE 
Fig. 2 A typical Register-Register implementation model 
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are two points at which registers are read. The first follows the instruc- 
tion fetch phase; at this point registers for addressing are read. The 
second follows the memory access phase; here register operands are 
read. We refer to these as issue points. To be more specific, the point 
where address registers are read will be referred to as address issue and 
the point where operand registers are read will be referred to as execu- 
tion issue. 


In an RR pipeline, where all registers are read at one point, one 
bit is typically used for each register to indicate that the register is 
reserved. When an instruction is ready to be issued, it checks the 
reserved bits for all the registers it uses. It is blocked from issuing until 
they are all free. A result register is reserved when an instruction that 
writes the register is issued. The reserved bit is cleared when the result 
register is written. 


We first consider a simple-minded extension of the above method 
for an RS pipeline. Place a separate set of reservation bits at both issue 
points. The reservation bits at each issue point are handled as with the 
RR architecture. The first set is held at the address issue point, and are 
checked and set for all result registers as instructions pass. They are 
also checked for address registers as part of address issue. The second 
set is held at the execution issue point and are also reserved by all 
instructions that write a register. They are checked for both operand 
and result registers for functional operations before instructions issue to 
the execution phase. Both reservation bits are simultaneously cleared 
when the appropriate result register is written. 


However, this method limits performance. This can be seen in 
the following sequence of instructions 


(1) R1<- R2opR3 
(2) R4<- RS5opRil 
(3) R1<- R6opR7. 


Register R1 is reserved by instruction (1) at address issue. Instruction 
(2) does not read R1 until execution issue and is therefore not held up at 
address issue. The problem occurs with instruction (3). Because R1 is 
already reserved, this instruction must wait at address issue until the 
register is no longer reserved. Otherwise, if it is not held, the reserved 


bit for R1 will be cleared when instruction (1) finishes. When in fact, 


R1 should still be reserved for instruction (3). Because only one bit is 
used to reserve a register at address issue, only one instruction that 
writes R1 can be beyond the address issue point at any given time. 


This problem can be corrected by using a small reservation 
counter for each register at address issue instead of a single reserved 
bit. The counter keeps tracks of the number of instructions that have 
passed address issue. The counter is incremented at address issue by an 
instruction that writes the register, and is decremented when the register 
is actually written. When an instruction needs to read an address regis- 
ter, it must wait until the counter becomes zero. When an instruction 
needs to write a register, it does not have to wait at address issue; it 
simply increments the counter and proceeds. This method enables the 
two issue points to make decisions independently, and performance is 
not restricted. In our later performance study, we use the method just 
proposed. The counter method for handling register reservations at the 
address issue point was proposed in [AND67]. There are of course, 
other methods for handling register conflicts but they will not be dis- 
cussed. 


To summarize, register reservation logic is more complex in an 
RS pipeline than in an RR pipeline. This is because registers must be 
read in two different places rather than one. In our RS implementation, 
there is a set of reservation counters, as well as a set of reservation bits; 
there is only a single set of reservation bits in an RR pipeline. This 
additional RS complexity is in terms of the amount of logic required, 
not in terms of the time required to make control decisions. That is, the 
length of time needed to make an issue decision regarding the availabil- 
ity of registers should be no greater with an RS pipeline. 


4.2. Memory Access 


Because memory accesses are often tied to data operations in an 
RS architecture, it would seem to be more difficult to schedule code to 
hide memory access delays. In fact, the organization of the RS pipeline 
with the memory access phase preceding the execution phase can actu- 
ally improve performance. In the RS pipeline, instructions may go 


through the memory access phase and obtain memory operands while 
previous instructions are waiting for execution. In effect, a RS instruc- 
tion is given a head start by beginning the memory access ahead of exe- 
cution operations occurring earlier in the program. In a typical RR 
implementation this does not occur. 


Because load operations sometimes get a head start, cache misses 
also get a head start, reducing their degradation effect on performance. 
Fig. 3 shows an example of code in which this occurs. In this example, 
assume the second load instruction has a cache miss. The first load is 
issued to memory at the same time in both implementations. In the RR 
implementation the second instruction waits for the load to complete 
while in the RS implementation the second instruction follows the load 
through memory. This causes the second load to wait in the instruction 
fetch unit in the RR implementation. Whereas, in the RS implementa- 
tion, the second load is issued to the memory unit without any wait. 
This gives the cache miss a head start over the same cache miss in the 
RR implementation. 


To counter the above advantage of an RS implementation, there is 
a corresponding disadvantage that sometimes occurs at the beginning of 
a segment of code. Fig. 4 shows an example in which this occurs. In 
the RR implementation, the second instruction in this example starts 


execution immediately following the load. But in the RS implementa-— 


tion, the second instruction follows the load through the memory unit 
before starting, causing a delay. Because the three floating point opera- 
tions are all dependent, the initial delay in the RS implementation is 
retained, and the entire sequence is slower. As we shall see later, there 


RS architecture RR architecture 


star: R3 <- (R1 + displ) star: R3 < (R1 + displ) 
R2 <- R3+fR4 R2 <- R3+fR4 

RS <- R2 *f(R1 + disp2) R6 <- (RI + disp2) 
RS < R2 *f R6 


Fig. 3 Code in which RS memory access/cache miss gets a head start. 


RS architecture RR architecture 


star: R5 <- (R1+2(12)) start: RS <- (R1+2(12)) 
R6 <- R2*fR3 R6 <- R2*fR3 
R7 <- R6*fR5 R7 <- R6*fR5 
R8& <- R7+fRS5 R8 <- R74fR5 


Fig. 4 A code segment with lower RS performance. 


is a similar effect involving conditional branches. This effect can be 
reduced by shortening the length of the memory access phase; this is 
most easily done by using an operand cache. 


It should be noted that in the examples given, we have taken a 
small piece of code out of its original context. If R2 or R3 are depen- 
dent on an instruction occurring just before the sequence given in Fig. 4 
begins, then the second instruction may be delayed by the same amount 
in both sequences. 


It is our observation from the code we have examined, that 
overall performance speedups for RS architectures as typified in Fig. 3 
are more common that performance losses as in Fig. 4. To summarize, 
placing the memory access phase ahead of the execution phase leads to 
performance advantages because memory accesses may be started ear- 
lier than they would with an RR pipeline. A disadvantage comes occa- 
sionally when execution must sometimes wait longer because of an 
unnecessary transit through the memory access phase. 


4.3. Stores to Memory 


In an RS pipeline, there are several points at which stores can be 
issued to memory. First, they can be issued at the same point as loads. 
This has the advantage of keeping all memory accesses in program 
order, preventing many hazards involving memory. It also eliminates 
conflicts for a single memory path; the path is required at only one 
point in the pipeline (disregarding the instruction fetch path). A major 
disadvantage is that a store, waiting for the data to be stored, blocks at 
an early point in the pipeline and holds all the following instructions 


517 


until the data becomes available. A second disadvantage is that it 
makes implementation of precise interrupts more difficult because stores 
may complete before previous instructions in the instruction stream are 
known to be error-free. 


Second, stores can be issued at the same point at which function 
execution begins. With this method, the store instruction can read its 
address registers and "reserve" its result address in the memory access 
phase, then wait for its data at the execution issue point. When the store 
data are available, the instruction must then "steal" a memory access 
cycle to perform the store. The act of "reserving" its result address 
amounts to saving the address in a table so that hazards involving later 
loads from the same address can be resolved. With this second method, 
the primary advantage is that instructions following the store are not 
blocked as early in the pipeline. The disadvantage is that control com- 
plexity is increased. Memory hazards must be resolved with additional 
logic, and the memory access path may be used from two different 
pipeline phases so conflicts may occur. 


Third, stores may be sent to memory following the execution 
phase. Here, registers are read at the same point as in the second 
method. This method further increases the control complexity of resolv- 
ing memory hazards by delaying the store to memory. But all instruc- 
tions would complete in order making the implemenation of precise 
interrupt easier. 


4.4. Branches 


Up to this point we have said very little about branch instructions, 
which are a very important aspect of pipeline design. Both RR and RS 
architectures can be made to handle branches in a similar way. Because 
of the variety of ways of performing branches, we have chosen a 
specific one for illustrative purposes. Conditional branches all test 
register RO. In an RR pipeline, this test occurs at the single issue point, 
and in an RS pipeline, at the address issue point. This assumes that 
there is hardware in the instruction fetch phase to perform the test of RO 
for branches. This eliminates the delay of issuing the instruction to a 
functional unit to perform the test. 

The primary performance advantages and disadvantages are very 
similar to those already noted for the placement of the memory access 
phase. The advantage for an RS pipeline occurs when RO is set early 
and an independent instruction, preceding the branch, is blocked due to 
data dependencies. In this case, an RR pipeline holds the branch instruc- 
tion behind the blocked instruction. In the RS pipeline, the blockage of 
the independent instruction may occur at the execution issue point, so 
that the branch is not blocked and may be performed earlier. 


A potential performance loss occurs when an instruction modify- 
ing RO is followed closely by a conditional branch testing RO. In this 
case, the branch instruction may block waiting for RO to be written. 
The wait can take longer in an RS architecture if the instruction modify- 
ing RO must pass through the memory access phase before being exe- 
cuted as in our implementation model. This can be viewed as part of 
the penalty one pays in return for earlier execution of conditional 
branches as described above and for early accessing of memory data as 
was pointed out in section 4.2. 


For the FORTRAN code we have examined, RO is often set well 
in advance of a conditional branch that tests it. This is because the 
branch is typically based on a loop counter which can be incremented 
early in a loop. Hence, the performance penalty for these kinds of 
branches is low. Nevertheless, for other types of code where a test is 
made closely before a branch, an RS pipeline could suffer significant 
delays for conditional branches. 


5. Experimental Results 


Following are the results of a preliminary study we have per- 
formed to test some of the conclusions given above. Because of the lim- 
ited set of benchmarks and the methods used, no general conclusions 
should be based solely on these results. 


Using the above-defined architectures and the implementation 
models, the execution time for eight of the Lawrence Livermore Loops 
[MCM72] were calculated. These are a standard set of small FOR- 
TRAN kernels. These eight loops were chosen because a hand compila- 
tion method was used, and quite simply, these were the easiest to hand 


compile. Compiler output from the Cray FORTRAN Compiler was 
used as a guide, and our hand compilation is at the same level of optim- 
ization. 

We assumed execution times similar to those in the CRAY-1. For 
models with no operand cache we assume an 11 clock period memory 
access time. For models with operand cache, we assume a 3 clock 
period memory access. We assume the instruction fetch unit contains 
an instruction cache with a 100% hit ratio. The code is scheduled 
separately for each architecture and each memory speed to give 
optimum performance for each case. Given these assumptions, we cal- 
culated the time required for one iteration of each loop. The results of 
the timings are given in Table 1. 


These results indicate that six of the eight loops executed in the 
same time for both architectures regardless of the memory speed used. 
It would seem that there is little difference between the performance of 
the two architectures based on these timings. However two of the loops 
did have different execution times. Loop 2 is a case where the RS 
architecture is able to take advantage of issuing fewer instructions to 
perform better than the RR architecture. The RS architecture had five 


1 
2 
3 
3 
6 
7 
1 
2 


proch povnh 


Table 1 Results of Timing of Models. Execution time is in clock 
periods. M is the memory access time in clock periods. 


fewer instructions to issue in this loop. Loop 3 also had a different exe- 


cution time. In this loop, the RS architecture is faster because it reads © 


operand registers at a different place than it executes branch instruc- 
tions. Since there is no store near the end of this loop, the branch at the 
end of the loop is not held up waiting for the final result to be stored. 
In the RR implementation the branch must wait for the last data opera- 
tion to obtain its operand data before it can be issued. In the RS imple- 
mentation the branch is able to execute sooner because the last data 
operation reads operand registers at execution issue and does not hold 
up the branch instruction. 


6. Conclusions 
RS architectures have three primary performance advantages. 


(1) There are fewer instructions; in some code sequences, the limita- 
tion of issue rate to one per clock period is important, and in 


these cases, RS architectures can give better performance. 


(2) Placing the execution segment of the RS pipeline after the 
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memory access segment permits memory load instructions to 
proceed even though an earlier instruction may be blocked at the 
execution units due to a data dependency. 


Placing the execution segment of the RS pipeline after the condi- 
tional branch logic permits some conditional branches to proceed 
even though an earlier instruction may be blocked at the execu- 
tion units due to a data dependency. 


Occasionally, the last two advantages are offset by a performance loss 
when an execution operation must pass through the memory access seg- 
ment part of the pipeline before it can begin. Nevertheless, it is our 
observation that, at least for scientific code, the performance advantages 
outweigh the disadvantages. Our experimental results show a relatively 
small difference in performance between the two architectures. When- 
ever there is a difference, however, the RS pipeline is faster. 


The primary disadvantage of RS pipelines is the additional imple- 
mentation complexity. This comes about because registers are read from 
at least two different points in the pipeline. This at least doubles the 
amount of register interlock logic that is needed. On the other hand, the 
control complexity at the individual stages is not significantly increased, 
so the clock frequency does not necessarily need to be reduced. 


(3) 
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ABSTRACT -- Ways of coordinating large numbers of 
processors to execute a parallel program as fast as possible 
are of key importance to the design and efficient use of paral- 
lel processor systems. In this paper we discuss issues on pro- 
gram parallelism and scheduling for parallel processor sys- 
tems. We break the scheduling process into three basic activi- 
ties, and focus on processor assignment to parallel loops. 
Optimal processor assignment algorithms are presented for 
simple and complex nested parallel loops. These algorithms 
can be applied at compile-time, or can be implemented as 
hardware modules to solve the processor assignment problem 
optimally at run-time. Speedup measurements for EISPACK 
and IKEE DSP subroutines that result from the optimal 
asstgnment of processors to parallel loops are also presented. 
These measurements indicate that optimal assignments result 
tn almost linear speedups on parallel processor machines with 
a few tens of processors, and significantly high speedups for 
machines with hundreds or thousands of processors. 
1. INTRODUCTION 


It is becoming increasingly clear that modern and 
future supercomputers will be based on the shared memory 
parallel processor architecture. The flexibility, scalability and 
high potential performance offered by parallel processor 
machines are simply necessary "ingredients" for any high per- 
formance system. The performance potential for these 
machines is indeed greater than that of single array processor 
computers [7], and they can run more efficiently a larger 
family of programs. However we have little experience in 
using efficiently a large number of processors. This is espe- 
cially true when maximum (single) program speedup (as 
opposed to high throughput) is our objective. This inexperi- 
ence in turn is reflected in the small number of processors 
used in modern commercially available multiprocessors such 
as the CRAY-XMP and Alliant FX/8 systems. 


Truly parallel languages, parallel algorithms, and ways 
of defining and exploiting program parallelism are still in 
their infancy. Only recently these problems have attracted 
the appropriate attention. Several factors should be con- 
sidered when designing high performance supercomputers |7]. 
Parallel algorithms, carefully designed parallel architectures 
and powerful programming environments including sophisti- 
cated restructuring compilers, all play equally important 


This work was supported in part by the National Science Foundation 
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roles on program performance. This unified approach has 
been adopted in the design of the Cedar multiprocessor at 
the Center for Supercomputing Research and Development of 
the University of Illinois. In addition several crucial problems 
such as scheduling and synchronization and communication 
overheads, must be adequately solved in order to take full 
advantage of the inherent flexibility of parallel processor sys- 
tems. 


Loops are the largest source of parallelism but the prob- 
lem of using several processors for the fast execution of com- 
plex parallel loops had not been given enough attention until 
recently [10], [1]. A key problem in designing and using 
large parallel processor systems is finding out how to 
schedule independent processors to execute a parallel pro- 
gram as fast as possible. There is more to be said on ways of 
exploiting loop parallelism, both from the system and the 
compiler point of view. From the system’s point of view we 
know little about coordinating large numbers of processors to 
execute multiply nested parallel loops; and from the the 
compiler’s point of view no significant work has been done so 
far to adequately solve this problem. This paper addresses 
the second part of this problem namely, compile-time 
scheduling of arbitrarily complex parallel loops on parallel 
processors. We discuss the different activities of the problem 
traditionally referred to as scheduling of a parallel program 
on a parallel processor system, and present algorithms that 
generate optimal loop schedules at compile-time. Some exper- 
iments we conducted to illustrate the speedups resulting 
from such schedules are also presented. . 


The rest of this paper is organized as follows. In Sec- 
tion 2 we present some background for the following discus- 
sion, including basic definitions and the structure of the 
Parafrase compiler used for our experiments. In Section 3 we 
briefly consider the general problem of scheduling parallel 
programs on parallel processor systems. In Section 4 we dis- 
cuss processor assignment issues for parallel loops, and pro- 
pose an optimal processor assignment algorithm based on 
dynamic programming. In Section 5 we present some experi- 
mental results for EISPACK [13], and IEEE DSP subroutines 
[2], and finally Section 6 gives the conclusion of this paper. 

2. BACKGROUND AND BASIC CONCEPTS 


In this paper we consider parallel Fortran programs. By 
parallel, we mean programs that have been written using 
language extensions or programs that have been restructured 
by an optimizing compiler. For our purposes we use output 
generated by the Parafrase restructurer [6], [15]. Parafrase is 
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Figure 1: The structure of PARAFRASE 


a restructuring compiler which receives as input Fortran pro- 
grams and applies to them a series of machine independent 
and machine dependent transformations. The structure of 
Parafrase appears in Figure 1. The first part of the compiler 
consists of a set of machine independent transformations 
(passes). The second part consists of a series of machine 
dependent optimizations that can be applied on a given 
program. Depending on the architecture of the machine we 
intend to use, we choose the appropriate set of passes to per- 
form transformations targeted to the underlying architec- 
ture. Currently Parafrase can be used to transform programs 
for execution on four types of machines: Single Execution 
Scalar or SES (uniprocessor), Single Execution Array or 
SEA (array/pipeline), Multiple Execution Scalar or MES 
(multiprocessor), and Multiple Execution Array or MEA 
(multiprocessor with vector processors) architectures [7]. 


In a restructured Fortran program we observe several 
types of parallelism and all of them can be potentially util- 
ized by an MES machine. We can roughly classify the 
different types of parallelism into two categories: Fine grain 
parallelism and Coarse grain parallelism. Fine grain paral- 
lelism includes the parallel execution of different statements 
of the program on different processors, or even different 
operations of the same statement on different processors or 
functional units. Coarse grain parallelism arises from the 
parallel execution of independent disjoint modules of the pro- 
gram, or from parallel loops. For the purposes of this paper 
we ignore branching statements (IFs and GOTOs) without 
- loss of generality. In Parafrase the branches of each such 
statement are assigned branching probabilities either by the 
user or by the compiler automatically. We can therefore view 
a program as consisting of a sequence of block of assignment 
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statements (BAS), with each such BAS having a weight asso- 
ciated with it. 


To expose program parallelism, Parafrase builds for 
each program the data dependence graph [5] which indicates 
data and control dependences between statements of the pro- 
gram [4]. A series of machine in/dependent optimizations are 
then applied on the program using the data dependence 
graph as a guide. During parallel execution of a program, 
data and control dependences must be observed in order to 
preserve the semantics of the program. One of the most vital 
Passes in Parafrase involves the recognition of several types 
of parallel loops. 


In the transformed programs, loops can be of one of the 
following four types: Serial, reductions (e.g., vector 
sum/product), do all or DOALL (all iterations of the loop 
can execute in parallel) [9], and do across or DOAOR (suc- 
cessive iterations can be partially overlapped) [9], [1]. The 
restructured programs are to be executed on a multiprocessor 
such that DOALLs and DOACRs can be distributed across 
many processors. Furthermore, execution of many loops of 
different kinds may also be overlapped. Given a Fortran pro- 
gram that executes on a P processor multiprocessor system, 
we define program speedup Sp, as follows: S, = T,/ Tp, 
where T, is the serial execution time of the program and Tp 
the execution time when it runs on FP processors. The 
efficiency of the execution of a parallel program that achieves 
a speedup of S, on P processors, is then defined as 
Ep = Sp /P, and obviouslyO < Ep < 1. 

3. PROCESSOR ALLOCATION, 
ASSIGNMENT, AND SCHEDULING 


Our approach to program scheduling is a combination 


of compile and run-time schemes. At compile-time we decide 
how to partition a given program into independent tasks. A 
task graph called a Task Flow Graph (TFG) is then con- 
structed by the compiler. Each node of the TFG corresponds 
to a program segment, and arcs represent data and control 
dependences. A task node for instance, may correspond to a 
block of assignment statements, or a nested parallel do loop. 
Nodes of the TFG are executed in order so that data and 
control dependences are satisfied. Each node of the task 
graph may execute serially or it may be distributed across 
different processors. At compile-time it is decided whether 
the iterations of a given loop will execute in a pre-scheduled 
or a self-scheduled fashion as explained below. 


The scheduling of a restructured program (TFG) on a 
parallel processor system is performed in three phases. As 
mentioned above the first phase decides whether a loop will 
run in a pre- or self-scheduled fashion. During the second 
phase which we call processor allocation, we decide how to 
distribute the available processors to the nodes of the graph. 
Each node receives a number of processors by the second 
scheduling phase, and it is up to the processor assignment 
(third) phase to decide how to use the allocated processors 
for each particular node. Recall that a single node of the task 
flow graph can be an arbitrarily complex nested parallel loop. 
Processor assignment decides how to assign the allocated 
processors to the parallel loops of each node (if any). 


Allocation is usually done by the run-time system by 
taking a subset of the available processors (real or virtual), 
and devoting them to a running program or to a part of it. 
Allocation requires decision on the number of processors. 
The decision on the number of processors could be totally 
made by the run-time system. However, this is inefficient. To 
circumvent this problem, the user program could advise the 
system on the number of processors (or range of them) that 
it can efficiently use. Processor allocation may be performed 
at load-time, and remain constant until execution completes. 
On the other hand, processors may be allocated at different 
times during execution. The decision on how often allocation 
is to be performed depends on the amount of overhead 
involved in processor allocation, since frequent allocations is 
clearly better if overhead is ignored. 


After allocation, a decision has to be made on the 
number of processors to assign to the different program com- 
ponents. For the purposes of this paper we are interested in 
processor assignment to do loops. The assignment is trivial 
if the process consists of a sequence of singly nested parallel 
loops and sequential code. The assignment may be more 
complex if multiply nested parallel loops are present. The 
algorithms of this paper deal precisely with this problem. If 
the loop bounds and the number of processors to be allocated 
at run-time, are not known at compile-time, then the algo- 
rithms described below will have to be executed at run-time 
once these values are known. A hardware device will 
accelerate the computation required for assignment. 


Loops may be executed in a pre-scheduled or self- 
scheduled fashion. This decision can be made when the pro- 
gram is written or at compile-time. Pre-scheduled means 
that the order and the iterations to be executed on each 
assigned processor is decided at compile-time. The output 
from the compiler can be parametrized on the number of 
processors assigned to the loop by producing a blocked loop 
with a variable blocking factor. Loops can also be executed 
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in a self-scheduled way. This means that each processor 
upon completion of an iteration will continue by executing 
the next iteration not yet executed or being executed. If p 
processors are assigned, the first p iterations are executed 
first, one by each processor. Pre-scheduling has the advan- 
tage that the local memory of a processor (including its 
registers) can be used to transfer information from one itera- 
tion to subsequent ones executed in the same processor, so 
that in certain cases interprocessor communication cost may 
be greatly reduced. Consider for example two disjoint adja- 
cent loops at the same nest level. Both loops will be allocated 
the same number of processors and corresponding iterations 
will execute on the same physical processors. When data are 
communicated from one loop to the other they can be tem- 
porarily stored into a local memory, thus eliminating the 
need for interprocessor communication. On the other hand, 
self-scheduling is useful if the execution time of the different 
iterations varies widely. These factors should be taken into 
consideration by the compiler to decide how to compile a 
given loop. Also, the presence or not of hardware for self- 
scheduling should be taken into account. It should be noted 
that pre-scheduling and self-scheduling can be combined in a 
program, and even in a multiply nested loop. Thus, the outer 


loop could be scheduled in one way, and the inner loop in 
another. 


4. OPTIMAL PROCESSOR ASSIGNMENTS 


It has been shown that in most programs parallel loops 
are the source of the greatest percentage of parallelism [7]. In 
this section we will investigate the problem of processor 
assignment to parallel loops. This problem becomes espe- 
clally important when we deal with nested parallel loops 
where inefficient assignment algorithms may result in an exe- 
cution time far worse than the optimal. In programs with 
several nested parallel loops the efficiency may then drop 
down to unacceptable levels. We can informally define the 
limited processor assignment problem as follows: Given an 
arbitrary multiply nested loop which contains serial and 
parallel (DOACR, DOALL) loops and a number of P proces- 
sors, find the (optimal) way of assigning the P processors to 
the loops so that the parallel execution time of the entire 
module is minimized. For loops with very few nest levels 
and systems with a small number of processors, exhaustive 
search might be affordable at compile-time. But as the 
number of processors increases, the number of processor-loop 
combinations grows exponentially. Moreover, loops with 
large nest levels are not very uncommon in scientific compu- 
tations. As an example, 10 to 17 nested parallel loops were 
observed in several subroutines of the restructured (by 
Parafrase) IEEE Digital Signal Processing Package. 


4.1. Simple Assignments to DOALLs 


In the following sections we discuss processor assign- 
ments schemes. A metric called the efficiency indez is 
defined in this section. The usefulness of this metric is two- 
fold. First it makes it easier to formulate the processor 
assignment problem, and secondly it allows us to observe 
several interesting properties of the problem that are other- 
wise hidden in modular arithmetic. 


A processor assignment algorithm (OPTAL) is proposed 
that solves the general problem optimally. The optimal pro- 
cessor assignment is guided by the use of a function called 
the assignment function. The assignment function can be 
easily defined to measure parallel execution time. 


Before we discuss processor assignment issues we need 
to introduce some notation and definitions. To simplify the 
notation each loop is assumed to be normalized and denoted 
by the upper bound of its iteration space. Thus, N,; denotes a 
DOALL whose loop body is executed N, times, and 
L=(N,, No, . , N,,) denotes an m_ level nested 
DOALL where loop N, is surrounded by N,_, and surrounds 


Njay *=2, . ,m-1 (N, and N,, are the outermost and 


innermost loops respectively). 


In what follows the number of available processors P is 
always assumed to be “‘useful’’, that is, less or equal to the 
maximum number of processors that a loop LZ can fully util- 
ize. As mentioned in Section 3, processor allocation (phase 
two) allocated a number of processors to each outermost loop 
(task node). Theorems and lemmas given in this paper are 
stated without proof [11]. 

Definition 4.1. For a DOALL with N, iterations that has 
been assigned p; processors we define €,, the efficzency index 


or EI of N, as follows: 
P, N; / P; 


_ IN; / | 


t 
The efficiency index is an indicator of how efficiently a loop 
runs on a given number of processors. The higher the EI the 
higher the efficiency (as defined in Section 2). Some other 
properties of the efficiency index that will be used directly or 
indirectly in the following sections are: 
P1: For any DOALL N, and any number of processors p we 
have: 0< e; <. 1, F 
P2: For any N,, €; = 1. 
P3: For N, > p, oH > 1/2. 
It should also be noted that p #q does not necessarily 
imply € : FE ff 
Definition 4.2: For a nested DOALL L=(N,, No, ..., Nia); 
a number of P = p,p. ... P, processors and a particular 


assignment of P to L we define the effictency index vector 


(1) 


P Pn P; . 
w= (€, : Ey See €,, ) Of L, where €, is the EI for loop 
N, and for p; processors. 


In what follows the terms “assignment of P” and 
‘decomposition of P”’ are used interchangeably. Any assign- 
ment of P to L defines implicitly a decomposition of P into 
factors P=p,p. ... p,, Where each of the m different 
loops receives p;, ?=1,2,...,.m processors. A processor 
assignment profile (p,, Po, ---) Pj,) can also be described by 
its efficiency index vector as defined above. 


p P 
Definition 4.3: For an assignment w= (€, fags €, ) of 


P = p,Po ---+ Pm processors to L, we define E,, the com- 
pound efficiency index ( CEI) of L as 

m P, 
E, = I1é; (2) 

i=l 

For any L, P, we also have O< E, < 1. Let T, be 

the serial execution time of a perfectly nested DOALL L. 
Next suppose that [ is executed on P processors and let 7, 
and J, denote the parallel execution times for two different 
assignments W= (€,, €5, ...,€,) and W= (€), €9, ---) €m) 
of P to L, where P=p,...p,, = Py --+D,. We can 
express the parallel execution time T, of L in terms of its 
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CEI as follows: 


™ 


Tp = TIIN, /»,1B = 


| 


™m 
IIA; / »; 
i=l 


Se) 


m 
Il p; / N; == 


. TIEN, / 1 
i=l 
| m N./ IIp; — 
i/ Pi; |) i=1 " p. P L 
1/B ee ee Se or 
ue {m1 eer 7 a NB NB 
| 
es NB 
P~— E, P (3) 


where B is the execution time of one iteration of the loop, 


and N = |[[NQ,. The following is a direct application of 
ts 
(3). f) 


Lemma 4.1: Tp < Tp iff E, > E;. 


In the next few sections we show how the efficiency 
index can be used to direct the efficient assignment of proces- 
sors to perfectly nested parallel loops. 


Given a nested loop L and a number of processors P 
we call a simple processor assignment one that assigns all P 
processors to a single loop N; of L. A complex processor 
assignment on the other hand is one that assigns two or 
more factors of P to two or more loops of L. 


Theorem 4.2. The optimal simple processor assignment 
over all simple assignments of P processors to loop L is 
achieved by assigning P to the loop withe = max {e; }. 
t=1,m 

For the next lemma and most of what follows we assume 
that processors are assigned in units that are equal to pro- 
ducts of the prime factors of P unless stated otherwise. 
Therefore each loop is assigned a divisor of P including one. 


Lemma 4.3. If N _ is a_ (single) DOALL loop, 
P = pp, --- P,, 1s the number of available processors 
and €,, i=1,2,...,m are the efficiency indexes for assigning 
Pi) PyPo PyPoP3 +) PyPo-- + Pm Tespectively, then 

6é, 2 & 2 & 2 De ae 


From Lemma 4.1 we conclude that the optimal processor 
assignment of P to L is the one that maximizes E,. Each 
assignment defines indirectly a decomposition of P into a 
number of factors less than or equal to the number of loops 
in L. As P grows the number of different decompositions of 
P into factors grows very rapidly. From number theory we 
know that each integer is uniquely represented as a product 
of prime factors. Theorem 4.4 below can be used to prune 
(eliminate from consideration) several decompositions of P, 
or equivalently, several assignment profiles of P to L that 
are not close to optimal. From several hand generated tests 


we observed that the use of Theorem 4.4 in a branch and 
bound algorithm for determining the optimal assignment of 
processors eliminated more than ninenty percent of all possi- 
ble assignments. In some instances all but the optimal 
assignments were pruned by the test of Theorem 4.4. 


Again let L = (N,, ..., N,,) be a perfectly nested 


DOALL that executes on P_ processors and 

P = p,p.-.-.p, be any decomposition of P where 

k < m. Now let € = max {e,} be the maximum 
l1<i<cm 


efficiency index over all simple assignments of P to L, and 


;: max 


sis 
efficiency indexes (over all loops of L) for the factors 
Py, Por +++) Py of P respectively (where 


es = (N;/p;) / (IN; / p;\)). Note that here we do not 
perform any actual assignment of processors to loops but 
simply compute the maximum efficiency index for each factor 
p,; of P over all loops of L excluding the loop that 
corresponds to €. If 7, and 7, are the parallel execution 
times for L corresponding to the optimal simple assignment 
of P and the optimal complex assignment of the specific fac- 
tors of P respectively, and S, and S, their respective speed- 
ups, we have the following theorem. 


Theorem 4.4. If there exists ¢ €{1,2,...,.4} for which 
«€ > €; then T, < T, and thus S, > S.. In other words 
if one of the factors of P has a maximum efficiency index 
equal or less than the maximum efficiency index of P, then 
we gain more speedup by assigning the entire P to a single 
loop than from any complex assignment of the factors of P 
(including the optimal). 


p. 
tea t=1,2,..,k be the maximum 
m 


Thus, given any decomposition of P into factors 
P = py,...p,, a necessary (but not sufficient) condition for 
a complex assignment to be better than the best simple 
assignment is € < €,, for all 7=1,2,...,4 (where €, is the 
maximum efficiency index for factor p, over all loops of L). 
Obviously if € 1 then the optimal simple assignment is 
the overall optimal as well. An example of the application of 
Theorem 4.4 is shown in Figure 2(a), where the simple 
assignment of 32 processors to the outermost loop is also the 
optimal one. 


m 
Corollary 4.5. If N = [[N,, € is the efficiency index 
i=l 
of the optimal simple assignment, €;, 1=1,2,...,m is the 
efficiency index for the z-th loop in a complex optimal assign- 
ment and E, the corresponding compound efficiency index, 
then any optimal complex assignment should satisfy, 


é =. €. 5 A, e< E, <1. 


t= 1,2,...,m and 


N/P 
[N/P] 
Then any optimal assignment of P to L satisfies 


E, < E 


m 
where N = ][JN,. 


t=1 


Let Kyo = 


(4) 
where E,, is the CEI of an optimal assignment. Only in spe- 
cial cases there would be an optimal assignment of P to L 
for which the equality in (4) holds. A compiler transforma- 
tion called “loop-coalescing" can be applied to certain types 
of loops and always achieves E, = EK, [11]. 
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Corollary 4.5 can be used to check whether a given 
complex assignment is better than an optimal simple assign- 
ment. It would be useful however to be able to answer the 
question about the existence of such assignment. That is, 
given a loop L and a number of P processors, is there an 
optimal complex assignment better than the optimal simple 
assignment? If for a particular loop the answer is negative, 
the optimal simple assignment is chosen and therefore the 
problem for that loop is solved in constant time, assuming 
the efficiency indexes have been computed. Proposition 4.6 
below provides the test for the existence of an optimal com- 
plex assignment. 


For each loop N; EL we define the cretscal capacity g,, 
of N, as the maximum number of processors that can be 
assigned to N, with its efficiency index remaining strictly 
greater than € (the maximum efficiency index of P). In other 
words, for each N, g, is chosen to satisfy, 

: g.+r 
€. > € and €.° < € 


2 1 


for any r > 1. Then we have the following proposition. 


Proposition 4.6. A necessary condition for the existence of 
a complex assignment of P to L which is better than the 
corresponding optimal simple assignment is 

m 


Il9; 2 P. 

+=] 
The obvious approach for optimally solving the processor 
assignment problem is exhaustive search. For small nested 
loops and a very small number of processors exhaustive 
search would probably be tolerable at compile-time. For 
medium size loops and a few tens of processors however, the 
cost of exhaustive search becomes intolerable even at 
compile-time. For example, the number of different assign- 
ments of 50 processors to 15 nested loops is 4.8X 10°°. If it 
takes 1000ns (on a fast machine) to process each different 
assignment it would take more than 555 days CPU time to 
find the optimal assignment of 50 processors to 15 loops. 


Using the results of this section however, we can design 
a branch and bound algorithm that greatly reduces the 
number of candidate optimal assignments. In several cases 
the tests of this section can prune all possible assignments 
but the optimal and in practice such a branch and bound 
algorithm would have polynomial complexity for most cases. 
The problem remains unsolved though since we can never 
guarantee polynomial complexity and we can always come 
up with an example loop which can make even the branch 
and bound algorithm run in exponential time. 


In the next section we present an optimal processor 
assignment algorithm that has a low polynomial complexity 
and finds the optimal assignment for all types of loops and 
any number of processors. 


4.2. Optimal Complex Assignments 


In order to better illustrate the ideas of this section we 
start by considering perfectly nested DOALLs and P —9" 
processors. As we proceed the concepts are generalized to 
include more complex loop structures such as nonperfectly 
nested combinations of serial, DOALL, and DOACR loops. 


Let us consider an m-level nested DOALL 
L=(N,, No, .- N,j,) and a number of P=2™ processors. 


DOALL 1 I1=1,63 
DOALL 2 I2=1,7 
DOALL 3 131,31 
DOALL 4 141,20 


DOALL 1 11=1,15 
DOALL 2 I2=1,17 
DOALL 3 I3=1,17 
DOALL 4 14=1,25 


(a) 


CONTINUE CONTINUE 


4 4 

3 CONTINUE 3 CONTINUE 
2 CONTINUE 2 CONTINUE 
1 CONTINUE 1 CONTINUE 


Figure 2. (a) An application of Theorem 2.2. 
(b) The nested loop of example 4.1.1. 


Algorithm OPTAL which is analytically described below will 

give us the optimal assignment profile of the P processors to 
the m loops of LZ. For each loop ZL we compute the 
efficiency table M. Each column j of M corresponds to a 
loop N; of L and each row 7 corresponds to a number of 
2’, i=0,1,...,4 processors. An entry (7,3) of table M con- 
tains the efficiency index for assigning 9" processors to loop 
N;. This (m Xk) efficiency table will be used repeatedly by 
OPTAL to obtain the optimal assignment of P processors to 
loop L. 


From Lemma 4.3 we observe that each column of M is 
ordered in nonincreasing order. If the loops are ordered by 
size then each row of M is also ordered in nonincreasing 
order. Therefore if €,, is the element of M in the 7-th row 


ij 
and 7-th column, then 


ex for 


aj €; 


w > j. 


It is clear that in any assignment of P to L there can be at 
most one entry of the lower half of M involved in that 
assignment. Let us give an outline of the basic steps of the 
algorithm. We start by assigning the P processors to the 
innermost or outermost loop, and let us always start from 
the innermost in our case. The second step finds the optimal 
assignment of P to the two innermost loops. In the process 
we also need to compute the optimal assignment of 
1,2, 2” ..., 2 processors respectively to the two inner- 
most foo These assignments however are computed only 
once for each loop and stored for later use by successive 
steps. 


In general, after the (m — 7)-th step OPTAL has 
found the optimal assignment of 1, 2, 2°, ..., P processors to 
loops L,=(N,, Nay > Ni): The next (m — ¢ + 1)th 
step considers loop N,_, and finds the optimal assignment of 
12. 2”, ., PP processors to loop (N,_,, L;) possibly by 
pcassianiae processors from L, to N,_,. All possible assign- 
ments for N,_, are considered. Note that all possible assign- 
ments for L; have already been computed. At the end of the 


m-th step OPTAL outputs the profile of the optimal 
assignment of P = 2* to loop L=(N,, N,, .... N,,)- 
Based on Lemma 4.1 the optimal assignment of P to L 
would be the one that maximizes E,. This is precisely what 


OPTAL does. 


4.2.1. The Perfectly-Nested Loop Case 


In this section we describe processor assignment for per- 
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fectly nested DOALLs and P = 2. We use this case as 


an example of the application of the general algorithm 
described in the next section. The simplicity of this case 
helps illustrate the algorithm clearly. It is followed by a sim- 
ple example that describes the details of computing the 
optimal assignment. The heart of OPTAL is a recursive 
function G, that is defined as follows: Given Pp=—2* and L 
an m-way nested loop as previously, we define G;(q) as the 
product of efficiency indexes of the optimal assignment of q 
processors to loops (N;, N,4,, ..., N,,). More specifically a 
closed form expression of function G;(q) is given by, 


m 
p. 
Gq) = max {11 
1 <P; <9 jt 
m 
and such that q = IIp; < P. The recursive definition 


j=i 
of G,(q) that we will be using from now on is given by (5) 


G(P) = ~~ max {<? G,4(P/2’) or (5) 
Oo<rc<ck 
GP) = max {Gin(P) Gi(P/2), € iG; (PA), 
8 P 
G24 P78); cex9.€ a.u00 
where e. is the efficiency index for assigning q processors to 


loop N; {available from table M/). The optimal assignment of 
P processors to loops (N;, N41, --- N,,) can be found by: 
selecting from all assignments of 9" ae . Br N, 


and 2" processors to (Neap eo Nah T= _k, the 


one that maximizes [[¢, (from (5)). 


=t 


The function in (5) is computed for 1=m, m-l,..., 1 
and for each t we also souls 
G;(1), G,(2), G;(2 Py , G(P=2 ry: The optimal assignment 


of the P processors to losn L will be given at the end of the 

m-th step ef G (2 2"). Initially (first step) for «=m we have 
GW) = . For each G;(q) the corresponding processor 
assignment aa: is sored and when G,(2') is computed 
the profile for the optimal assignment is available. 


The algorithm completes in m steps. In each of the m 
steps, k =logP function evaluations are performed and each 
of the r=1,2,...,4 function evaluations involves the computa- 
tion of the maximum of r values. The overall complexity of — 
the algorithm is therefore O(mlog *P). Using the results of 
the previous sections we can easily avoid unnecessary compu- 
tations and further reduce the complexity of OPTAL. 


The explicit processor assignment vector (with the 
exact number of processors assigned to each loop) is com- 
puted as a side effect of the computation of G;. When a par- 
ticular G; is chosen as optimal the corresponding assignment 
vector can be trivially reconstructed. In order to illustrate 
the computational details of the algorithm we give below a 
simple example involving four DOALLs and 2. processors. It 
should be noted that this approach not only finds the 


optimal assignment of the given P processors to a particular 
loop nest, but it also finds the optimal assignments of 
P/2,P A, P, P/16,..., 1 processors to the same loop. We 
can therefore determine the minimum number of useful pro- 
cessors with little extra cost. 

Example 4.1.1: Consider the loop 
L=(N,= 15, N.= 17, N,= 17, N,=25) of Figure 
2(b) and let P=2° be the available processors. The optimal 
assignment of P to L is computed as follows: First the 5x4 
efficiency matrix M is computed. At the first step for 7=4 
we have G2") = e for r= 0, 1,..., 5. The computations 
for the remaining three steps are shown analytically below. 


Step 2 
G,(2)=max {s,e.2040} 


G,(4)=max {e(a)<fe.e, cso} 
G,(8)=max {e.8)<36.0 6,2).36,0)| 
G,(16)—=max {s.0s).dovsredoia).dayarer*aco} 


G,(32)=max {cuse, €3G,(16),¢3G,(8),¢5G,(4), Se,2).c,0} 
Step 3 


G,(2)=max {e., doxs| 

G,(4)=max {c., do,a).fe(0| 

G,(8)=max {asen<dey, e,2).26,.)| 
G,(16)=max {c,00)<2e,0), $6,).<26,2).30,0)| 


G,(32)=max {c.9, 6,00) $040), do,ayes"e,2).<76,0)} 
Step 4 


G,(2)=max {e.), <io,1)} 

G,(4)=max {ox 7G,(2),€ ‘e0| 

G,(8)=max {sds,<7o44)<fo,0.<tex0} 
G,(16)=max {ox feys).foxtnetoxee!*e40| 


G,(32)=max {ox 1G,(16),€;G,(8),€7G(4),€°G,(2), Faun} 


In each case the maximum element appears in bold letters. 
The optimal assignment in this example is therefore the one 
that assigns 16 processors to loop N, = 15 and 2 processors 
to loop N, = 25. The processor assignment profile is recon- 
structed as follows. The number of processors assigned to a 
loop I, is equal to the superscript of the € factor of the max- 
imum term in G(P). If an € factor does not exist, the 
corresponding loop receives 1 processor. First we look at the 
maximum element of G,(32). This element is e G,(2) which 
tells us that loop N, receives 16 processors, and the remain- 
ing processors are allocated to G,(2). The maximum element 


of entry G,(2) is G,(2) which indicates that loop N, receives 
1 processor. Continuing in the same way, the maximum ele- 
ment of entry G,(2) is G,(2) which again tells us that loop 
N, is assigned 1 processor, and therefore loop N, is assigned 
the remaining 2 processors. 
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4.2.2. The General Algorithm 

Although most real multiprocessor systems have 
P = 2" for some integer k, OPTAL can be used to generate 
optimal processor assignments for any integer P. It also 
handles arbitrarily complex nested parallel loops. Before we 
describe the details of the general algorithm however, we 
need to define the concepts of DOACR and loop nesting 
more precisely. The details about the DOACR [9], [1] con- 
struct are beyond the scope of this paper and we only define 
terms used in later discussion. 


ii =2 
ii =3 
ii =4 
if =5 


ii =6 


Figure 3. A nested loop and its tree representation. Squares and leaves denote BASs. 


As mentioned in Section 2, a DOACR loop can be infor- 
mally defined as a parallel loop in which data dependences 
allow for partial overlap of successive iterations during exe- 
cution on an MES system. In other words, if iteration 7 
starts at time ¢ on a given processor, iteration (¢-H) can 
start at time ¢ + d, where d is a constant. Constant d is 
called delay and represents the execution time of a subset of 
loop statements whose data dependence graph forms a cycle. 
If B is the size (serial execution time) of the loop body, then 
d/B is defined as the percentage of overlap, (or doacross per- 
centage). When d=B the loop is serial while if d=O the 
loop is said to be DOALL. In DOALL loops all iterations 
may start execution simultaneously. DOALL and serial 
loops are therefore special cases of DOACR loops. The 
parallel execution time of a DOACR loop with N, iterations, 
d, delay and a body size of B, that executes on P processors 
is given by the following [12]. 


N. 
Tp(B;) = ie - 1] + ma B,, Pd;} + 


d, *((N -— 1) mod P) + B, (6) 


To simplify the notation in the following discussion, we 
assume that a block of assignment statements (BAS) can be 
considered as a DOAOCR loop with N, = 1, and d; = 0. 


An arbitrarily complex nested loop can be uniquely 
represented as a k-level tree where k is the maximum nest 
depth. The leaves of the tree correspond to BASs and inter- 
mediate nodes correspond to (DOACR) loops. Obviously the 
total number of nodes in a loop tree is \ + “ where X is the 


number of individual loops in the structure and ps the number 
of BASs. An example of a nested loop and its tree represen- 
tation are shown in Figure 3. Intermediate tree nodes at 
level m correspond to loops at nest depth m. We assume 
that individual loops in an arbitrarily nested loop are num- 
bered increasingly, in lexicographic order. 


In the general case loops are not perfectly nested and 
therefore the efficiency index as defined in Section 4 is not 
useful. We can redefine the efficiency index for the general 
case but it is more convenient to define the assignment func- 
tion so that it measures directly parallel execution time. 
Therefore the max term of the assignment function becomes 
min in this case since our objective here is to minimize exe- 
cution time and thus maximize speedup. 


The steps of the general algorithm are almost identical 
to the case of perfectly nested loops. The example of Figure 
3 is used whenever it helps illustrate the computations 
involved. A AXP table can be used to store intermediate 
values. (A, P are the numbers of loops and _ processors 
respectively). During the first step we compute the parallel 
execution time of the DOACR loops at level k on the tree, 
where k again is the maximum nest level. This is done as 
follows: 


Gig) = TB), (7) 


qq = 


and for all leaves i. 


where T / is given by (6). The general step is defined recur- 
sively as in the perfectly nested loop case. The optimal 
assignment of P processors to loops in levels 7 through k 
(¢ < k), (assuming the optimal assignment of P to loops at 
level ¢+1 is known), is then generated by: 

led} () 


: :] n 
min JT, >» Gay 
eee | n child of j 


and for qg=1,2, 3, ..,P 


where (8) is computed for all nodes (loops) j at level 7, and 
T(*) is given by (6). The summation in (8) accounts for all 
nodes at level 2 + 1 that are descendants of node 7, that is, 
all loops nested inside loop N;. The optimal assignment of 
P processors to a given loop is given by G, (P). Recall from 
the example of the previous section that the detailed proces- 
sor assignment vector is automatically constructed during 
the evaluation of 8. For each loop the number of processors 


assigned to that loop correspond to the minimum term in 8. 


It should be noted that all optimal assignments of 
1, 2,..., P-1 processors to a loop L are computed as inter- 
mediate results of the computation of Gi (P). We therefore 
have the following. 


Lemma 4.7. The maximum number of useful processors 
given P for a loop L is the minimum Q, such that 
1<Q <P and G{(Q)=Gj(P). 

Theorem 4.8. For any loop L of maximum nest depth k, 
and any integer P, OPTAL terminates after k iterations and 
generates the optimal assignment of P processors to L. 


The complexity of the algorithm can be easily deter- 
mined. The assignment function G! is computed P times for 
each node (loop) in the tree, or a total of \P times. Each 
evaluation of the assignment function also involves finding 
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the minimum of an average of P/2 terms. The complexity 
therefore (without counting additions) is O(\P” / 2). The 
complexity can be reduced to O(AP log P), and OPTAL can 
be used to implement a systolic array control unit that con- 
sists of P log P nodes and determines the optimal assign- 
ment of P processors to a given loop in X steps [11]. The 
speedup resulting from the optimal assignment pf P proces- 
sors to a loop L is given by, Sp = Gj (1) / Gy(P). 


An interesting point of this approach is that although 
loops at the same nest level are allocated the same total 
number of processors, each loop manages (assigns) its own 
processors in an independent way. For example, suppose that 


loops 3 and 6 of Figure 3 are allocated 8 processors each. A 


possible assignment then may assign 1 processor to loops 3 
and 4, and 8 processors to loop 5, while in the second case 
we may have 2 processors assigned to loop 6, and 4 proces- 
sors to each of the loops 7 and 8. It is clear that loops on the 
same nest level must be assigned the same total number of 
processors when executing on a parallel processor system. 
Otherwise we have suboptimal parallel execution times since 
some processors will be forced to remain idle. 
5. EXPERIMENTS 


We implemented this processor assignment algorithm as 
a pass in the Parafrase compiler. Processor assignment is 
performed after DOALL and DOACR loops are recognized 
and delays computed. In our experiments we measured 
speedup values for some~subroutines of the EISPACK and 
IEEE DSP packages, even though analysis of the entire pack- 
ages is under way. 


Speedup values were computed as discussed in Section 
2. In our case Tp, the parallel execution time, was measured 
for P=32, P=256, and P=2048 processors, and for loop 
bounds set to 40. In some EISPACK subroutines where loop 
bounds correspond to the bandwidth of band-matrices, we 
used loop bounds of 1 or 4. The speedup values measured for 
the three different numbers of processors are shown in Tables 
I and 2. The subroutines from the two packages used in 
these experiments were randomly selected. 


From the speedup values we observe that for 32 proces- 
sors the average speedup is almost linear for both EISPACK 
and IEEE subroutines. For 256 processors the average 
speedup for EISPACK subroutines is about 137, or more 
than P/2. In other words, we have an efficiency of more than 
50% for P =256. For the IEEE subroutines we observe an 
even higher average efficiency for the same number of proces- 
sors. The third column in the tables corresponds to an 
unlimited number of processors. Since most of the 
EISPACK subroutines deal with square matrices, for 40X40 
arrays the the maximum expected speedup is 1600. Taking 
into account several loops with bounds of 1 or 4 and the 
number of one-dimensional loops, the average maximum 
speedup should be expected to be considerably lower than 
1600. The average speedup of the third column of Table 1 is 
about 310, which corresponds to an average efficiency of 
about 15%. Since at most 1600 processors would be useful 
for most of the EISPACK routines, in reallity we would have 
an efficiency of about 20%. The corresponding values for the 
third column of Table 2 are quite higher than those of 
EISPACK. Generally, supercomputers deliver a wide range 
of performances from program to program. This is true of 
real machines [16], and has been observed in our earlier 
experimental work [7]."It appears, from the experiments we 


[ELMBAK 
ELMHES 
ELTRAN 
HQR2 
TRED1 
MINFIT 
TRED2 
CBABK2 


WFTA 
TRBIZE 
PCORP 
POWER 
COSYFP 
FREDIC 
FLPWL 
DIINIT 
SRINIT 
SMINVD 
DEFIN4 
FFT 
LOAD 
COVARI 
CLHARM 
FLCHAR 
REMEZ 
D 
LPTRN 


Table 1: Speedup values for 32, 256, and 2048 processors 
for EISPACK (Table 1), and IEEE DSP subroutines (Table 2). 


have conducted so far, that when OPTAL is used there is 


very little variation when programs are run with limited 
number of processors (i.e, when the number of processors is 
proportional to array sizes). 


Considering the fact that efficiencies in the range of 
20% are characterized very satisfactory in modern supercom- 
puters, we can claim that optimal assignments to parallel 
loops result in high speedups for most cases. Processor allo- 
cations to independent code segments can increase the aver- 
age speedup at least by a factor of two [14], [11]. 

6. CONCLUSION 


The problem of coordinating several parallel processors 
to execute a parallel program as fast as possible is a key 
problem to the efficient use of large parallel processor 
machines, and the design of multiprocessors with hundreds 
or even thousands of processors. It is very important to be 
able to identify parallelism in a program and use all the 
available processors to maximize speedup. This problem how- 
ever is complex and should not be shifted entirely to the 
user. Compilers, operating systems, and special hardware 
modules can be used to solve the problem satisfactorily with 
little or no user assistance. 


527 


In this paper we discussed issues of program parallelism 
and scheduling for parallcl processor systems. We divided the 
scheduling process into three basic activities and focused on 
processor assignment to parallel loops. An algorithm was 
presented that gives an optimal solution at compile-time for 
processor assignments to complex parallel loops. This algo- 
rithm can be also implemented as a hardware device to per- 
form processor assignment at run-time. We also presented 
some speedup measurements for EISPACK and IEEE subrou- 
tines, that result from the optimal assignment of processors 
to parallel loops. These measurements indicate that optimal 


processor assignments result in almost linear speedups on 
medium scale parallel processor machines. 
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Abstract 


Processor self-scheduling is a useful scheme in a mul- 
tiprocessor system if the execution time of each iteration in a 
parallel loop is not known in advance and varies substantially, 
or if there are multiple nestings in parallel loops which makes 
static scheduling difficult and inefficient. By using efficient 
synchronization primitives, the operating system is not needed 
for loop scheduling. The overhead for the processor self- 
scheduling is small. 


We presented a processor self-scheduling scheme for a 
single-nested parallel loop, and extend the scheme to 
multiple-nested parallel loops. Barrier synchronization 
mechanisms in the processors self-scheduling schemes are also 
discussed. 


1. Introduction 


Processor scheduling is a problem that must be solved in 
order to run a program on a multiprocessor system efficiently. 
‘The multiprocessor system considered here is a collection of 
identical processors with a global shared memory. Cray X-MP 
[1], Cedar [2], Ultracomputer [3] and RP3 [4] are some exam- 
ples of such systems. 


Parallel loops in a program, whose iterations can be exe- 
cuted concurrently on different processors, provide the greatest 
potential of parallelism to be exploited by multiprocessor sys- 
tems. It is called a DOALL-loop, if the iterations of a parallel 
loop are independent. If there are data dependences across 
‘iterations of a DO-loop, its iterations. can still be executed 
concurrently on different processors provided that the data 
dependences are enforced by synchronization across the proces- 
sors during the execution [8]. This kind of parallel loop is 
called a DOACROSS-loop [11]. Both DOALL-loops and 
DOACROSS-loops can be nested in many levels. In this paper, 
we only consider DOALL-loops. 


DOALL-loops can either be recognized by an optimizing 
compiler like Paraphrase [10] or explicitly specified by a pro- 


grammer. Processors need to be scheduled properly so that 


execution time of a DOALL-loop is minimized. A task of 
scheduling here is one or several iterations of a DOALL-loop. 


We will only consider non-preemptive scheduling schemes, 1.e., 
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once a processor is assigned an iteration, it will continue to 
execute the iteration until its completion. 


If execution time of each iteration of a DOALL-loop is: 
the same, the optimal scheduling is to assign iterations evenly 
among processors. This scheduling can be done before 
program execution and is called static scheduling. — 

If execution time of each iteration is different and is not 
known until program execution, the scheduling of iterations to 
processors is more efficient if it can be done dynamically dur- 
ing the program execution, 1.e., assign a new iteration to a 
processor whenever it becomes available. The total number of 
iterations assigned to each processor may not be equal, but the 
workload of each processor tends to be balanced. This schedul- 
ing is called dynamic scheduling. In this paper, we only con- 
sider dynamic scheduling. 


Dynamic scheduling will incur scheduling overhead at run 
time. One technique to reduce overhead is to schedule several 
iterations (called a chunk of iterations) to a processor at a 
time. As long as the number of chunks is large enough, work- 
load among processors can still be balanced. 


Scheduling overhead can be very large if dynainic 
scheduling is done by system calls to the operating system. . 
One way to reduce scheduling overhead is to use processor 
self-scheduling [5] [6] [7]. Rather than issuing a system call to 
the operating system for scheduling, processors can schedule 
themselves by fetch-and-adding a shared variable to get loop 
indices of a chunk of iterations. If a multiprocessor system 
has efficient hardware-implemented synchronization primitives 
like those in Cedar [8], Ultracomputer [9] and RP3 (4], 
scheduling overhead for a chunk of iterations can be reduced 
quite significantly. 


However, so far, all processor self-scheduling schemes 


only deal with the outermost parallel DO-loop [5] [6] [7]. The 


rest of the parallel loops nested inside are treated as serial 
DO-loops. Hence, parallelism of the nested DOALL-loops is 
not fully exploited. 


In this paper, we present a self-scheduling scheme for 
nested DOALL-loops using Cedar synchronization instructions. 
Cedar synchronization instructions [8] are briefly introduced in 
section 2. In section 3, we describe barrier synchronization 
mechanism needed in processor self-scheduling and processor 
self-scheduling schemes for single-nested DOALL-loops. In sec- : 
tion 4, we present a processor self-scheduling scheme for 
multiple-nested DOALL-loops. An estimation of overhead for 
scheduling an iteration and the performance of the proposed 
processor self-scheduling schemes are given in section 5. In 
section 6, we have some concluding remarks. 
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2. Cedar Synchronization Primitives 
In a Cedar Multiprocessor System, a variable can be 
declared as a synchronization variable. A synchronization 


variable x has two fields: KEY and DATA. The KEY field is 
for storing synchronization information (which is an integer) 


The iterations of a DOALL-loop are scheduled through a 


. shared variable, and after all of the iterations are scheduled, a 


and the DATA field is for storing the value of the variable 


(which can be a floating point number). The format of a 
Cedar synchronization instruction is as follows: 


{x; test on KEY; 
operation on KEY; operation on DATA}. 


Here x is the name (or the address) of the synchronization 
variable. The "test on x.KEY” specifies the condition to be 
tested between the KEY field of x (denoted by x.KEY) and a 
key provided by the instruction (denoted by i.KEY). The test 
includes >, > <, G =, #and NULL. The NULL test means 
that no test is needed and therefore the result of the test is 
always true. The operation on x.KEY can be Increment, 
Decrement, Add, Fetch, Fetch & Add, Store, Fetch & Incre- 
ment, Fetch & Decrement and No Action. The operation on 
DATA can be Fetch, Store and No Action. The execution of 
the whole synchronization instruction, i.e. test on x.KEY, 
operations on x.KEY and x.DATA, is done in globally shared 
memory modules and is indivisible [8]. The operation on 
x.KEY and the operation on x.DATA are executed only when 
the result of the test is true. The memory module will inform 
the processor of a “failure” if the condition of the test is not 
satisfied, or a “success” if the condition is satisfied and execu- 
tion of the instruction is completed. For some application, 
test on x.KEY has to be done repeatedly until the test condi- 
tion becomes true. A star on the test condition (as shown 
below) is used to indicate this situation. In other words, 


{x; (test on KEY)*; 
operation on KEY; operation on DATA } 
is equivalent to 
1: {; test on KEY; 
operation on KEY; operation on DATA } 
if failure then goto 1 


Cedar synchronization primitives are very effective in 
handling low level synchronizations required in numerical com- 
putations like enforcing data dependences across loop itera- 
tions [8]. In those applications, KEY field stores loop iteration 
numbers and DATA field usually stores the value of the data 
(which is usually a floating point number). In this paper, we 


use this synchronization primitives mostly for non-numerical: 


scheduling problems. Hence, only the KEY field is used. 


3. Barrier Synchronization 


Following are several assumptions used in our processor 
self-scheduling schemes: __ | 
(1) We assume that the operating system assigns certain 
number of processors, say P processors, to a program 
before its execution. After that, the operating system will 
not be involved in scheduling. 
(2) Code for processor self-scheduling is embedded in a user’s 
program, and each processor will execute the same pro- 


gram. 


(3) For the simplicity of the discussion, we assume that a 
task in our scheduling is only one iteration. A task which 


has several iterations can be similarly implemented. 
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barrier synchronization is needed for the completion of the 
DOALL-loop. In this section, we present two barrier syn- 
chronization mechanisms and _ processor _ self-scheduling 
schemes for a single-nested DOALL-loop. 


In the first barrier synchronization mechanism, processors 
will be blocked at the barrier until all of the processors com- 
plete their tasks and arrive at the barrier. After that, the bar- 
rier will open and all of the processors can pass through. 
There is a counter which counts the number of arriving pro- 
cessors, and it must be reinitialized before the barrier can be 
reused for the second time. Algorithm 3.1 is a processor self- 
scheduling scheme for a DOALL-loop using this barrier syn- 
chronization mechanism. 


Algorithm 3.1 


/* J: loop index for the DOALL-loop with M iterations 
A: a counter to count the number of processors 
that have arrived at the barrier 
B: a variable acting as a barrier */ 


/* Initially J=1, AP and B=0, where P is the 
total number of the processcrs assigned to 
execute the program. */ 


L,: {J; <M; Fetch(LOCJ)&Increment} 
if failure then goto L, : 


. /* Original Do loop-body with index LOCJ */ 


goto L, 
/* Processors go back to L, 
to get another iteration. */ 


L,: {A; >1; Decrement} 

if failure then 

begin 
J:=1; A=P; B:—=P 

end 

{B; (>0)*; Decrement} 

/* B acts as a barrier and is reinitialized 
after P processors pass through. */ 


If a DOALL-loop is enclosed in an outer serial loop as 
shown in Figure 3.1, the implementation of the outer serial 
loop is quite simple. Since all of the processors are synchron- 
ized at the barrier of the second DOALL-loop J,, each proces- 
sor can have a local copy of loop index variable LOCI and 
update it after the execution of the DOALL-loop J,. The pro- 


cessor self-scheduling code for the whole program is illustrated 
in Figure 3.2. 


However, a barrier synchronization based on the number . 
of arriving processors has a problem. If the actual number of 
processors assigned to the program is less than that specified 
in the program (i.e. P processors), the barrier of J, will never 
open and all of the processors will be stuck there. In other 
words, the operating system must assign exactly the same 
number of processors as specified in the program; otherwise, 
we will have a deadlock. 


However, the precedence relation only requires that all of 


_the iterations of the first DOALL-loop be completed before the | 


second DOALL-loop can be started. In the following, a barrier 

is controlled by the number of completed iterations of the 
DOALL-loop instead of by the number of arriving processors. 
Counter A is initialized to M, the total number of iterations, 
instead of P. When all of the iterations are completed, the 
barrier will be opened. Thus, this barrier synchronization 
mechanism will work even if the operating system assigns 
fewer than P processors to the program. The processor self- 
scheduling scheme using this barrier synchronization mechan- 
ism is shown in Algorithm 3.2 


Algorithm 3.2 


/* C: a lock to block processors after P processors 
have entered the DOALL-loop 
J: loop index for the DOALL-loop with M iterations 
_ A: a counter to count the number of completed iterations 
B: a variable acting as a barrier 
T: a counter to detect the last processor, 
which is responsible for reinitializing 
all of the synchronization variables */ 
/* Initially C=P, J=1, AM, B=0 and T=P. 
P is the number of processors assigned 
to the program. ac | 


{C; (>0)*; Decrement} | 
/* This is to block processors after P processors 
have entered the DOALL-loop. */ 
L,: {J; <M; Fetch(LOCJ)&Increment} 
if failure then goto L, 


. /* Original Do loop-body with index LOGJ | 


{A; >1; Decrement} 
if failure then B:—=1 
/* Counter A counts the number of completed 
iterations and controls the opening of 
the barrier. ‘3 
goto L, 
/* Processors go back to L, to get 
another iteration. */ 
L.: {B; (>0)*; No Action} 
/* B acts asa barrier. */ 
{T; >1; Decrement} 
if failure then 
/* This is to reinitialize all of the 
synchronization variables. */ 


begin 
T:=P; B:=0; A:—=M 
J=1; C:=—P 
end 


Note that all of the synchronization variables in the algo- 
rithm will be reinitialized after P processors have passed the 
variable T. If there are fewer than P processors assigned to the 
program, some processors must come back and make up the 


discrepancy, and the synchronization variables will be reini-' 
tialized eventually. | 


Lock C in the beginning of the code is used to prevent 
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the race problem caused by the delay in reinitializing the bar- 
rier. Note that there is a delay between the time when all of 
the P processors pass the barrier and the time it is reinitial- 
‘ized. During this period of time, it is possible that some pro- 


cessors may revisit this DOALL-loop again for the next itera- 
tion of the outer serial loop. If there is no such a lock to block 
those fast processors, they will pass the barrier for the second 


time and start executing the following DOALL-loop before this 
| DOALL-loop is restarted. 


The disadvantage of the barrier synchronization mechan- 
ism based on the number of completed iterations is that the 


‘implementation of the outer serial DO-loop is more compli- 


cated. Because processors are no longer synchronized after 
each iteration of the outer serial DO-loop, the number of the 
times each processor will traverse the outer serial DO-loop 
may be different. A shared index variable for the outer serial 
DO-loop is thus needed. The branch node needed to imple- 
ment the outer serial DO-loop is given in Algorithm 3.3. Fig- 
ure 3.3 illustrates the processor self-scheduling code for the 
entire program in Figure 3.1 using this barrier synchronization 
mechanism. 


Algorithm 3.3 


/* C: a lock to block processors after P processors 

have entered the branch node. . 

S: a semaphore to guarantee that only one 
processor can control the outer serial loop 

B: a variable acting as a barrier 

T: a counter to detect the last processor, 
which is responsible for reinitializing 
all of the synchronization variables */ 


/* Initially C=P, S=1,B—=0 and T=P, where P is 
the number of processors assigned to the program. */ 


{C; (>0)*; Decrement} 
/* This is to block processors after P processors 
have entered the branch node. */ 
L,: {8; >0; Decrement} 
if failure then goto L, 
/* This is to allow only one processor 
to execute the following code. */ 
if I<N then 
begin 
COND:= true; [:=I+1 
end 
else 
begin 
COND:= false; I:= 1 
end 
/* Update the shared index variable I 
and resolve the branch condition, 
which is stored in COND */ 
B:=1 
/* Open the barrier B */ 
L,: {B; (>0)*; No Action} 
MYCOND:=COND 
/* Save the branch condition */ 
{T; >1; Decrement} 
if failure then 
begin 


T:=P; B:=0; S:=1; C:=P 
end 
/* This is to reinitialize all of the 
synchronization variables. */ 
if MYCOND=true then goto START 
/* START is the head of the serial DO-loop. */ 


4. Processor Self-Scheduling for 
Nested DOALL-loops 


In this section, we extend the processor self-scheduling 
schemes to nested DOALL-loops. Before going further, we 
have to define a few terms which will be used later. 


If each inner DOALL loop is surrounded immediately by 
an outer DOALL loop with no scalar code between them, this 
multiple DOALL loop nestings is a perfectly-nested 
DOALL structure. 


Otherwise, a nested DOALL structure is called a non- 
perfectly-nested DOALL structure. Figure 4.1 shows an 
example of non-perfectly-nested DOALL structure. There, we 
use a left bracket "|" to denote each DOALL-loop nesting. In a 
‘non-perfectly-nested DOALL structure, each innermost 
DOALL loop nesting will contain a loop body. The loop-body 
with all of its surrounding DOALL-loop nestings form a 
nested DOALL component. For example, the non- 
perfectly-nested DOALL structure in Figure 4.1 has three 
nested DOALL components. A perfectly-nested DOALL struc- 
ture contains only one nested DOALL component. 


We only discuss self-scheduling scheme of a non- 
perfectly-nested structure. A perfectly-nested DOALL struc- 
ture is a special case of a non-perfectly-nested DOALL struc- 
_ ture. 


Assume that a non-perfectly-nested DOALL structure 
like the one in Figure 4.1 consists of m nested DOALL com- 
ponents. Each nested DOALL component can be character- 
ized by two vectors: an index variable vector and a loop-bound 
vector. The index variable vector of a nested DOALL com- 
ponent (J,,J.,.-.,J,) is a vector which consists of the index 
variables of the loop nestings. Here, J, is the index variable of 
the outermost nesting and d is the depth of the loop nestings. 
The loop-bound vector (N,,No,...,.Ny) is a vector of the 
corresponding loop-bounds. For example, the index variable 
vectors for the three nested DOALL components in Figure 4.1 
are (I,,J.), (Iy,J9,J3) and (1,,J5,J€,), respectively, and their 
corresponding loop-bound vectors are (3,7), (3,2,3) and (3,2,4). 

For a nested DOALL component with index variable vec- 
tor (I,,J.,...,J,) and loop-bound vector (N,,N,,...,N,), there 


are total of N,---: N, iterations for the component. Each of: 


them can be identified by a sequence number from 1 to 
N,:°°:°N,. If we split the d loop nestings at level k 
(1<k <d) into (J,,...,.4,) and (1,4,,--Jq), we can notice that 
there are N,,, °°: Nj, iterations of the component that share 
common loop indices of the k outermost nestings (J,,...,J,). An 
iteration with [,=?,, [,=t., ..., [,=1, is called an iteration 
at level k. The sequence number for the iteration is defined 
to be 


t=(1,-1)N, aa N+: = Hi, _,-1)N, + t, 


There are N,--- N, iterations at level k, each of which can. 


be identified by a sequence number from 1 to N pee Nys 
Let (1,,J.,---,J4.) and (N,,N,,...,N,) be the index variable 
bee 4 ; . J 
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vector and the loop-bound vector of the j-th component. It 
has two parameters k; and [; (l<j<m,1< k;<a,, 1<1,<a4,), 
defined as follows: if its deepest. common loop nesting with the 
(j-1)-th component is at (p-1) level, we will have k,=p. 
Similarly, if the deepest common loop nesting it shares with 
the (j-+1)-th component is at (q-1) level, we have J: =q. 


From the definition, we have L =a 1<j<m. We also 


define that k,=1 and |, =1. For the non-perfectly-nested 
DOALL structure in Figure 4.1, we have k,=1, !,=2, k,=2, 
1,=3, k,=3, 1,=1. 


Assume that the deepest common loop nesting for the 
(j-1)-th and the 7-th components is at level p-l, i.e. Lo 
k,=p (See Figure 4.2). According to the semantics of nested 
DOALL-loops, an iteration at level p-1 for the j-th component 
can not be started until the corresponding iteration for the (j- 
1)-th component is completed. We use a variable B’ to 
enforce this precedence constraint. When B’*=t, the first t 
iterations at level p-1 for the (j-1}-th component are com- 
pleted, and, hence, we can start to schedule the j-th com- 
ponent for those iterations. 


The components of a DOALL loop structure will be 
scheduled in sequence, and the iterations of each component 
will be scheduled in the order of their sequence numbers. 
Thus, the processor self-scheduling schemes proposed here for 
nested DOALL-loops can also be applied to _ nested 
DOACROSS-loops or mixture. of nested DOALL-loops and 


DOACROSS-loops without causing deadlocks. 


Processor self-scheduling for the j-th nested DOALL com- 
ponent is realized by fetch-and-incrementing an index con- 
trol variable J’. The initial value of J’ is 1, which 
corresponds to the first index vector (1,1,...,1). The maximum 
value of I’ is M,;=N,N,:--N,, which corresponds to the 


. 4 
last index vector (Ny,No,-.-.Nq.)- Let Fy UN, Sn@iNy. be a func- 


J 
tion that maps the sequence number of an iteration into its 
index vector. Table 4.1 shows an example of function Fy, 5. 


Function F NyNy---,N, C20 be formulated as follows: 
j 


Fy WN» IN [=] —> (21,25, = 114); 


Jj 


where for 1< k<a,, we have 
; x-l 
1, = a mod N,+1. 
Nea Ng 


le] is the largest integer smaller than tf. This computation 
can be done locally in a processor after it fetches a sequence 
number from I’. However, as mentioned before, if the 


-sequence number of an iteration at level p-1 for the j-th com- 


ponent is larger than B’ , it can not be started. Given a 
sequence number x of an iteration, the sequence number of the 
iteration at level p-1 is 


z-1 
Ui es tt ee 
N,---N, 
J 


So, before starting to execute an iteration of ithe component, a 
processor needs to check if B’ >y. If B’<y, a processor 
has to wait until B’ becomes larger. 


+1. 


The processor self-scheduling code for the non-perfectly- 
nested DOALL structure is given in Algorithm 4.1. 


Algorithm 4.1 


{C; (>0)*; Decrement} 


J The code for the j _j-th component starts here. 
Initially I’ =1, B’ =0, 


A’? CP Pee i_lv= =N Nua” Ni. 
(14, <N,,15 1,<Nq, .-. 154, aA, -1) a 
L’: {I?; <M js Fetch(x)flnerement} 
rh M; can: ec INas. / 


J . 
if failure then goto Fil 


ye L’* is the beginning of the next component */ 
y= [(e-AN, ++ N,)|+1 

{Bi (>y)*; No action} 

for k= 1 to d; do 


LOCI, =|: Nj) | mod N, +1 


/* Compute the index vector from the sequence 
number x using function F N eN at A 
: 


. /* original Loop-body with index vector 


(LOCI, ,LOCI,, --- ,LOCI,) */ 
J 
{A?(LOCI,,LOCI,, ...,LOCI, 1); >1; Decrement} 
if failure then 
begin 
—— +++ LOCI, ,):=N, +++ Ny 


= \(@-AN, - -N, yet 
ne (=z-1)*; Increment} 
end . 


goto L’? 
/* The j-th component ends here. */ 


L™*: {B™; (=1)*; No action} 
/* B™ acts as a barrier. */ 
{T; >1: Decrement} 
if failure then 
begin 


Algorithm 4.1 consists of three portions. First portion is just a 
single statement 


| {C; (>0)*; Decrement} 


which functions as a lock in the beginning of the code to. 
prevent race problem. Lock C is initialized to P, the number: 
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nestings are souipieted: the element of A’ 


of processors assigned to the program. The second portion is 


the main self-scheduling code for each nested DOALL com- 
ponent. They are the same except for the different parameters 
such as loop-bound vector, k; and l.. We only present the 
code for the j-th component. 


The main self-scheduling code for each component con- 
sists of three parts: loop self-scheduling, the original loop-body 
and bookkeeping of the completed iterations. We already dis- 
cussed the "self-scheduling” part. In bookkeeping of the com- 
pleted iterations, recall that for each iteration at level q-1, 
there are total of N, -- - N,_ iterations for the (d;—q-+1) inner- 


most loop nestings of the component. We use an element of 
array A’ to record the total remaining iterations. It will be 
decremented by 1 after each iterations of the innermost 
(d;-q-+1) loops has been completed. 


When N, °-- Ny. iterations of the innermost (d;-q-+1) 


is reinitialized. 
Given the sequence number x of the iteration fetched from J’, 
the sequence number z of the corresponding iteration at level 
q-1 is 


| 7 
ee 
N, ae Ny 


+1. 


B’, which is used to control the execution of the (j+1)-th 
component, is then updated. Note that B? is incremented only 
when its value is one smaller than the calculated sequence 
number. Thus, when B’=s, it is guaranteed that the itera- 
tions at level q-1 with the sequence number from 1 to s have 
been completed. 


The last portion of Algorithm 4.1 is for barrier synchron- 
ization and reinitialization. B™ acts as a barrier for the entire 
non-perfectly-nested DOALL structure. In fact, since 1, 1, 
B™ has only two values: 0 and 1. As in Algorithm 3.2, wT is 
used to detect the last processor leaving the entire DOALL 
structure. That processor is responsible for reinitializing vari- 
ables B’, p (1<j<m), T and C. Array A’ (1<j<m) is 
einitialived d in the code for each component in the second por- 
tion. The barrier synchronization used here is based on the 
number of completed iterations. If the nested DOALL struc- 
ture is within an outer serial loop, one has to use algorithm 
3.3 to implement branch node. 

Notice that iterations for the first nested DOALL com- 
ponent in a non-perfectly-nested DOALL structure can be 
scheduled without restriction. It is not necessary to check the 
value of B. This is why we set k,=1 and B is not used. 


The entire processor self-scheduling code for the whole 
nested DOALL structure in Figure 4.1 is shown in Figure 4.3. 


5. Overhead and Performance 


Processor self-scheduling allows us to have better utiliza- 
tion of processors. During the program execution time, when- 
ever a processor becomes available, it will grab another itera- 
tion and start executing the new iteration as long as the pre- 
cedence relation in the program is not violated. Hence, the 
program execution time can actually be improved. 


In a large multiprocessor system, global memory accesses 
will take very significant amount of time, hence, we will only 
consider global memory accesses (remember that Cedar syn- 
chronization primitives are global memory accesses) when we 
estimate the scheduling overhead. In Algorithm 3.1 and Algo- 


rithm 3.2, the overhead of scheduling an iteration of a single- 
nested parallel loop is only one Cedar synchronization instruc- 
tion, which is one global memory access to the index control 
variable J. The overhead of scheduling an iteration of 
multiple-nested parallel loops is two global memory accesses as 
shown in Algorithm 4.1: one for testing and fetch-and-adding 
index control variable J’ and one for checking variable B’. 
The later is not needed in the case of perfectly nested parallel 
loops, i.e. the scheduling overhead for an iteration of a per- 
fectly nested parallel loop is only one global memory access. 
This overhead is very small compared to the overhead that 
would incur if it is done by the operating system. 


It is very important to note that the barrier synchroniza- 


tions in the end of parallel loops are also needed in the static 


scheduling schemes. Hence, they are not extra overhead for the 
self-scheduling schemes. 


Let us compare the self-scheduling scheme with the static 
scheduling scheme proposed in [12]. Assume that the execution 
times for each iteration of the nested DOALL structure is the 
same. 


For a single-nested DOALL loop with N iterations, both 
schemes will schedule N iterations over P processors evenly. 
The program execution time will be the same for both 
schemes, i.e. it will be [N/P] times the execution time of an 
iteration. However, processor self-scheduling will require one 
extra global memory access and, hence, it will have slightly 
more overhead than static scheduling scheme. 

However, for multiple-nested DOALL loops, things can 
be quite different. Assume that we have a perfectly-nested 
DOALL-loop structure with loop- bound vector 
(N,,No, ° ++ ,N,). Using static scheduling scheme, we have to 
find the optimal decomposition of P=P,--- P,, where P, is 


2 
the number of processors assigned to the i-th loop nesting, 
such that W./P, |: a IN, /P, is minimized [12]. The com- 


pletion time for this static scheduling is 
T,=1 [v,/P,|--- |,/P,| 


where 7 is the execution time of an iteration. Using processor 
self-scheduling in Algorithm 4.1, we have 


T,, =(t +0) [N/PI, (5.2) 


where o is the extra time needed for 1 extra Cedar synchroni- 
zation instruction, and N=N,N,---: N,. Note that 


IN/Pls|Ny/P,|- +> |v, /P | 


-The equality in (5.3) holds only for some special cases. Let us 
those special cases and assume that 


ignore 
(Iv,/P,] IN, /P,)- IN/Pl=k, kA. Ty, is less than 


I, when 


o e | 
T>—|—| 
kK iP 
For the non-perfectly-nested DOALL-loop structure in 
Figure 4.1, assume that there are P=8 processors. Using the 
static scheduling scheme in [12], we get the optimal decompo- 
sition P =1X2%4, i.e., using 1 processor for the outermost loop 


nesting, 2 processors for the second outermost loop nesting 
and 4 processors for the innermost loop nesting. The comple- 


(5.1) 


(5.3) 
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tion time is 97. Using self-scheduling, the completion time is 
8(7+20). In self-scheduling scheme, a processor is not tied to 
any specific loop nesting, thus, processors can be better util- 
ized. 


6. Concluding Remarks 


We present processor self-scheduling scheme for both 
single- and multiple-nested parallel loops. Two different bar- 
rier synchronization mechanism are discussed. The scheduling 
overheads for these schemes are quite small if synchronization 


primitives are supported in the system as in Cedar [2]. 


Processor utilization can be improved over static schedul- 
ing scheme, which can lead to better program execution time 
if the execution time of each loop iteration varies substan- 
tially, or if there are multiple loop nestings. 
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DOSERIAL I~1,N (4) 
DOALL J,=1, M, 

ENDDOALL (Je) 
DOALL J,=1, M, 


ENDDOALL 
ENDDOSERIAL 


(a) (b) 


Figure 3.1 An example of parallel program 


LOCI:=—1 
START: 


DOALL J, 
(Algorithm 3.1) 


DOALL J, 
(Algorithm 3.1) 


if LOCI<N then 
begin 
LOCI:= LOCI + 1 
goto START 
end 


Figure 3.2 Outer serial loop control (scheme 1) 


START: DOALL J, 


(Algorithm 3.2) 


DOALL J, 
(Algorithm 3.2) 


Branch Node 
(Algorithm 3.3) 


Figure 3.3 Outer serial loop control (scheme 2) 
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Figure 4.1 A non-perfectly-nested DOALL structure 
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Figure 4.2 p-1 outermost common loop nestings 


{C; (>0)*; Decrement} 


L’: ir. <3x7; Fetch(x)&Increment} 
if failure then goto L 
(LOCT,,LOCI,)Fy,(2) 


. /*loop body 1 with (LOCI, LOCI,)*/ 


{A'(LOCI,); >1; Decrement} 
if failure then 
begin 
A (LOCI,) :=7 
z:= l(r-1) /7]41 
{B’; (=z-1)*; Increment} 
end 
goto L* 


L’: ’, <3>2>3; Fetch(x)&Increment} 
if failure then goto L 
y = l(x-1) /(2x8)] +1 
{B’. (>y)*; No action} 
(LOCI, ,LOCI,,LOCTs)+4's o (2) 


. /*loop body 2 with (LOCI, LOCI, LOCJ,)*/ 


{A*(LOCI,,LOCJ,); >1; Decrement} 
if failure then 
begin 
A(LOCI,, LOCJ,) := 3 
g:=l(r-1) /3]+1 
{B?. (=z-1)*; Increment} 
end 
goto L 


L’: ir > Bx2™4; Fetch(x)&Increment} 
if failure then goto L 
y := |(r-1) /4]41 
{B’. (>y)*; No action} 
(LOCI, ,LOCJ,,LOCK,)F 55 ,(2) 


. /*loop body 3 with (LOCI, LOC, LOCK ,)*/ 


{A*. >1; Decrement} 
if failure then 
begin 
A’ := 3x2x4 
z:= 1 
{B*. (=z-1)*; Increment} 
end 
goto i 


L* 4B". (=1)*; No action} 
{T; >1; Decrement} 


if failure then 


Figure 4.3 Self-scheduling code for DOALL 
structure in Figure 4.1 
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Abstract 


This paper discusses some advanced topics 
related to interchanging DO loops in numerical 
programs. DO loop interchanging is used for 
many reasons; one well-known use of loop 
interchanging is to vectorize an outer DO loop 
when the inner DO loop must be left serial. 

In this paper we study loop interchanging as 
an end in itself. First, we describe the 
KAP/Design, a special program which, when 
given a nest of DO loops, prints out all the 
ways in which those DO loops can be 
interchanged. We describe some of the special 
transformations that are used by the , 
KAP/Design. Second, a new type of loop 
interchanging, interchanging of imperfectly 
nested loops, is described. 


1. Introduction 


This paper discusses interchanging of DO 
loops in numerical programs. Loop 
interchanging is often used to uncover 
parallel operations in nested DO loops; when 
an inner DO loop cannot be vectorized, for 
instance, loop interchanging is used to bring 
a different loop to the innermost nest level 
which perhaps can be vectorized [1,2]. Loop 
interchanging has also been shown to be useful 
to modify the way arrays are accessed (to 
reduce bank conflicts or to reduce page faults 
[1,3,4]) or to change the way registers are 
allocated (vector registers in particular, 
allowing "super-vector" performance on vector 
pipeline machines [2]). 


The KAP is a family of powerful 
retargetable vectorizers which use loop 
interchanging heavily [5,6,7]. We have 
studied loop interchanging and its 
applications extensively. This paper presents 
some of the results of this research. 


The second section describes a special 
version of the KAP, called the KAP/Design, 
which prints out all legal ways to interchange 
a nest of DO loops. Examples of its use are 
given with an explanation of the problems we 
encountered while developing the KAP/Design. 
The third section describes a technique for 
interchanging imperfectly nested DO loops; 
this technique is not as simple as 
distributing the outer DO loop, but is 
something totally new. 
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2. KAP/Design 


As pointed out in [8], different 
formulations of an algorithm sometimes differ 
only in having the DO loops interchanged. 
Some algorithm designers even stated that it 
might be useful to have a program source 
translator that printed out all the ways in 
which a DO loop nest might be interchanged. 
The KAP vectorizer includes a powerful loop 
interchanging algorithm, and it was a small 
matter to make a version that would 
interchange a nest of loops all possible ways 
and print each interchanged loop nest out. 
This became KAP/Design; an example of 
KAP/Design output on a simple matrix multiply 
program is shown in Figure 1. Normally the 
KAP/Design would be hardly noteworthy, but 
some of the problems that were encountered 
while designing and using the KAP/Design are 
interesting. 


2.1 Simple Loop Interchanging 


This section explains briefly how the 
KAP/Design tests for the legality of loop 
interchanging. A nest of DO loops, such as 
the two-nested loop in Figure 2(a), can be 
thought of as traversing a two-dimensional 
iteration space, shown in Figure 2(b). The 
arrows in Figure 2(b) represent how the serial 
DO loops execute the statements in the loop 
for iteration (I=1,J=l) first, then (1,2), 
(1,3), .-., (1,5), then incrementing the I 
index to: eo’ to (2,1): €2,2)5: «ae. (2s2)5 wees 
(5,1), ..., (5,5). Interchanging these two DO 
loops means changing the order in which the 
iteration space is traversed; by interchanging 
the loop as in Figure 3(a), the iteration 
space would be executed in the order shown in 
Figure 3(b). 


There may also be data dependence 
relations between the iterations of a DO loop. 
Simple data dependence relations in scalar 
code are represented in the following program 
segment: . 


Sl: 
S2: 
Sa ie 
S4: 


Pe IN &< P< 
iow wu ll 
ON XN 
+ 
bh 


In this short program segment, we say that 
Statement S2 depends on Sl (flow-dependence) 
since the value of X used in S2 is assigned in 
Sl; this means that statement S2 cannot be 
moved above Sl without changing the answers. 


We say that S3 depends on S1 (anti-dependence) 
since the value of Z assigned in $3 is not the 
value used by Sl; this means that S3 cannot be 
moved above Sl without changing the answers. 
Finally, we say that S4 depends on Sl 
(output-dependence) since the value assigned 
to X in S4 is assigned after the value in Sl 
is assigned; again, S4 cannot be moved above 
Sl without changing the program. More on data 
dependence can be found in [9,10,11,12,13]. 


For interchanging loops, the KAP/Design 
is not so much interested in the data 
dependence relations between the statements in 
a loop as between the iterations of a loop. 

In the loop in Figures 2 and 3, the value of 
A(1,2) used in iteration (I=1,J=2) was 


SUBROUTINE MATMUL2( A,B,C,N,M,P ) 
REAL A(N,M), B(N,P), C(P,M) 
INTEGER N,M,P,1I,J,K 


C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 100 I = 1,N 
DO 100 J = 1,M 
DO 100 K = 1,P 
100 A(I,J) = ACI,J) + BCI,K)*C(K,J) 
C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 101 I=1,N 
DO 101 K=1,P 
DO 101 J=1,M 
101 A(I,J) = ACI,J) + BCI,K) * C(K,J) 
C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 102 J=1,M 
DO 102 I=1,N 
DO 102 K=1,P 
102 A(I,J) = ACI,J) + BCI,K) * C(K,J) 
C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 103 J=1,M 
DO 103 K=1,P 
DO 103 I=1,N 
103 A(I,J) = ACI,J) + BCI,K) * C(K,J) 
C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 104 K=1,P 
DO 104 I=1,N 
DO 104 J=1,M 
104 A(I,J) = ACI,J) + BCI,K) * C(K,J) 
C 
C MATRIX MULTIPLICATION LOOP 
C 
DO 105 K=1,P 
DO 105 J=1,M 
DO 105 I=1,N 
105 A(I,J) = ACI,J) + BC(I,K) * C(K,J) 
END 
Figure l,. 
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assigned in iteration (I=l,J=1). In fact, the 
value used in iteration (i,j) was assigned in 
iteration (i,j-l) (except for boundary 
values); this relation is shown in the 
iteration space dependence graph in Figure 4. 
Notice that the pattern of the dependence flow 
in the iteration space can be characterized by 
the direction or the distance of the flow with 
respect to the loop index variables. In this 
example, the dependence distance would be 
called (0,-1), since the distance in the I 
loop is zero, and the distance in the J loop 
is -1. The KAP/Design saves the sign of the 
dependence distance; here it would save (0,-), 
or (=,<), to characterize the data dependence 
in the loop ((=,<) is the data dependence 
direction vector). 


DO 100 I = 1,5 
DO 100 J = 1,5 
100 A(I,J+1) = ACI,J) + B(I,J) 
(a) 
J= 
1 2 3 4 5 


>» 
2 O vO me) 9 8 


(b) 
Figure 2, 
DO 100 J = 1,5 
DO 100 I = 1,5 
100 A(I,J+1) = ACI,J) + BCI,J) 
(a) 
J= 
1 2 ~S ~& 5 
I=l 
2 
3 
4 
5 
(b) 
Figure 3. 


The possible directions in a two 
dimensional iteration space (corresponding to 
a doubly-nested DO loop) are shown in Figure 
5. Figure 5(a) shows the data dependence 
directions that are preserved by loop 
interchanging; Figure 5(b) shows the data 
dependence directions that prevent loop 
interchanging. A pair of loops with a (<,>) 
data dependence direction vector cannot be 
interchanged (without interaction from outer 
loops; see [1,2]). 


The KAP/Design discovers all the data 
dependence relations in the DO loop, and saves 
the corresponding data dependence direction 
vectors. To perform DO loop interchanging on 
more than two loops, it does repeated pairwise 
interchanging. 


2.2 Triangular Loops 


One problem is that most studies of loop 
interchanging ignore the loop bounds [1,2,3]. 
Detailed discussions of how to test whether 
two loops can be interchanged explain the data 
dependence conditions, but assume that the 
inner DO loop bound is invariant in the outer 
loop. Many programs include triangular DO 
loop bounds, such as the loop nest in Figure 
6(a). Here, the upper bound of the inner DO J 
loop is a simple function of the outer loop 
index. The iteration space traversed by this 
DO loop pair is drawn in Figure 6(b); it is 
easy to see why this is called a triangular 


space; the triangle in Figure 6(b) is a 
lower-left triangle. The four types of 
triangles are the lower/upper-left/right; four 
more types may be distinguished by including 
or excluding the diagonal (Figure 6(b) 
excludes the diagonal). These are 
characterized by the loop bounds: 


with diagonal: without diagonal: 


lower left: 

DO 100 I = M,N DO 100 I = M,N 

DO 100 J = M,I1 DO 100 J = M,I-1l 
upper right: 

DO 100 I = M,N DO 100 I = M,N 

DO 100 J = I,N DO 100 J = I+1,N 
upper left: 

DO 100 I = M,N DO 100 I = M,N 

DO 100 J = M,N-I4+M DO 100 J = M,N-I+M-1 
lower right: 

DO 100 I = M,N DO 100 I = M,N 

DO 100 J = N-I+M,N DO 100 J = N-I+M+1,N 


Interchanging a triangular loop can be 
- thought of as transposing the iteration space 
about its major diagonal. Thus, a lower left 
triangle would be transposed into an upper 
right triangle, and vice versa. The lower 
right and upper left triangles would be 
transposed into themselves. 


loop. To properly interchange these loops it 
is necessary to modify the loop bounds, as in ae ; 2 ee 1 
Figure 6(c). This is no more difficult than 100 mar ee 
modifying the bounds of a double summation A(I,J) = A(I,J) + B(I,J) 
when interchanging the summations. The types 
of triangular loop bounds are classified by (a) 
the shape of the triangle in the iteration 
J= 
1 2 Bi. ieee N 
T=] 
O o—to 
2 fo) 
0) o——bo 
3 O Oo 
re) 
{ ! 4 O 0) 0) 
fe) o) 
(a) . 
J= ° 
1 2 3 4 5 
O O re) N O ) O gen O 
I=l O——-®0-—80-——2 0-0 sap 
0 (b) 
2  0-——Oe2—Po—Po-—Po /. 
0) e) oO 
3 O——PoO—S0-—bo-—_bo DO 100 J = 1,N 
O O O DO 100 I = J+1,N 
4 O-—Po-—Po——Peo-——-0 100 A(I,J) = ACI,J) + BCI,J) 
(b) 
5 =o —04-—0-——BO (c) 
Figure 4, Figure 5. Figure 6. 
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2.3 Data Dependence in Triangular Loops 


Most triangular loops in real programs, 
however, are not so easy to interchange. Most 
triangular loops refer to both the triangle 
and the diagonal, as in Figure 7(a), or to the 
triangle and its transpose, as in Figure 7(b). 
In these cases, the fact that the loop bounds 
are triangular must be taken into account when 
the data dependence graph is built. If the 
inner loop bound were replaced by N in these 
two loops, for instance, then there would be a 
data dependence cycle which would prevent the 
loops from being interchanged. With the upper 
bound of "I-1", there is no dependence cycle, 
since J is always less than I for any I. A 
harder example, from a kernel of a real 
program, is shown in Figure 8(a). To find the 
dependence from A(I,J) to A(K,J), the I and K 
indices must be considered together with the J 
index, since both I and K are triangular in J. 


To handle cases like this, the KAP/Design 
includes an exact data dependence algorithm; 
this algorithm either proves that a data 
dependence relation exists or proves 
independence, if the following common 
conditions are satisfied: 


1. Loop increments are constant, 


2. Array subscripts are a linear 
expression involving loop index 
variables and constants, and 


3. Loop upper and lower bounds are 
linear expressions involving outer 
loop index variables and constants, 
or are unknown. 


When loop bounds are unknown, this test may 
compute a data dependence relation that does 
not exist for the loop bounds used in a 
particular execution of the loop, but the 
dependence can be proven to exist for some 
possible values of the loop bounds. The new 
data dependence algorithm is an extension of 
the test in [9]. Figure 8(b) shows all the 
ways to interchange the loop from Figure 8(a), 
as discovered using this perfect data 
dependence test. | 


2.4 Trapezoidal Loops Interchanging 


The triangular loop bounds shown above 
actually appear quite often in numerical 
algorithms. Less frequently, other loop 
bounds which are functions of outer loops 
appear; some of these have little hope of loop 
interchanging, as the one in Figure 9. Others 
can be interchanged, but not with the simple 
rules for triangular loops. Such loop bounds 
may not appear in the original formulation of 
the algorithm, but may show up after some 
triangular loops have been interchanged. For 
instance, the loop in Figure 10(a) is coded 
with some simple triangular loop bounds. 
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100 


105 


DO 100 I 
DO 100 J 
100 A(I,J) = 


(a) 


DO 100 I 
DO 100 J 
100 A(I,J) = 


DO 100 J 
DO 100 I 
DO 100 K 


J+1,N 
det 


l, 
1, 
ACI 


N 
T= 


1,N 
1,I-1 


A(J,1) 


] 
»J) / ACI,T) 


A(I,J) = ACI,J) + ACI,K)*ACK,J) 


(a) 


REAL A(N,N) 
INTEGER N,I,J,K 


DO 100 J = 1,N 
DO 100 I = J+1,N 
DO 100 K = 1,J-1 


A(I,J) = ACI,J) 


DO 101 J=1,N 
DO 101 K=1,J-1 
DO 101 I=J+1,N 


A(I,J) = A(I,J) 


go 
© 
—y 
© 
NO 
Cy tH 


i) 
© 
— 
© 
NO 
A 
tou Ww ti 
— bt ee 


DO 103 I=1,N 
DO 103 K=1,I-2 
DO 103. J=K+1,I-1 


NC) = By) 


DO 104 K=1,N-1 


DO 104 J=K+1,N 
DO 104 I=J+1,N. 
A(I,J) = ACI,J) 


DO 105 K=1,N-1 
DO 105 I=K+1,N 
DO 105 J=K+1,I-1 
A(I,J) = ACI,J) 
END 


(b) 
Figure 8. 


+ ACI,K)*ACK,J) 


+ A(I,K) 


+ A(I,K) 


+ A(CI,K) 


+ A(I,K) 


+ A(I,K) 


* 


* 


* 


* 


*K 


ACK, J) 


ACK, J) 


A(K,J) 


A(K, J) 


ACK,J) 


After interchanging the DO K and DO J loops, 
the bounds are modified as shown in Figure 
10(b); the inner DO I loop, which was 
originally triangular with respect to the DO K 
loop, no longer is. The DO K loop bounds were 
changed when it was interchanged with DO J. 
Even through DO I and DO K do not form a 
triangular loop nest, the loop bounds can be 
modified to allow the loops to be 
interchanged; the result is shown in Figure 
10(c). Note the use of the MIN function in 
the upper bound of DO K (had the lower bound 
needed to be changed, a MAX function would 
have been used). The iteration space 
described by the inner two loops of Figure 
10(b) (assume that J is fixed) is shown in 
Figure 10(d); this is a trapezoid. In order 
to interchange trapezoidal loop bounds, the 
same type of loop bound modifications as used 


-DO 100 I = 1,N 
DO 100 J = 1,LP(1I)**2 
100 ACI) = ACI) + BCI,J) 
Figure 9. 
DO 100 K = 1,N 
DO 100 J = K+l1,N 
DO 100 I = K+l1,N 
100 A(I,J) = ACI,J) + A(I,K)*A(K,J) 
(a) 
DO 100 J=1,N 
DO 100 K=1,J-1l 
DO 100 I=K+1,N 
100 A(I,J) = AC(I,J) + ACI,K) * A(K,J) 
(b) 
DO 103 J=1,N 
DO 103 I=2,N 
DO 103 K=1,MIN (J-1, I-1) 
100 A(I,J) = ACI,J) + ACI,K) * A(K,J) 
(c) 
I= , 
12 3... J-l J J+1 ... N 
K=l 0 0 O e«. O° oO O° wee. °O 
2 0 O «ee 0 O O «ee O 
3 ro eran 0 oO oO - Oo 
J-1 O° OO -#ee- 0 
(d) 
Figure 10. 
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in triangular loop interchanging is used, 
except that a MIN or MAX function is used to 
cut off the point of the triangle and make the 
trapezoid. The KAP/Design includes triangular 
loop bound interchanging and trapezoidal loop 
interchanging; the KAP/Design output for the 
loop in Figure 10(a) is shown in Figure ll. 


3. Imperfectly Nested DO Loops 

When imperfectly nested DO loops are to 
be interchanged, such as the loops in Figure 
12(a), the usual process is to distribute the 
outer loop as in Figure 12(b); this creates 
perfectly nested DO loops, which can then be 
interchanged normally, as in Figure 12(c). 

The iteration space of the original loop can 
be drawn as shown in Figure 12(d); notice that 
there are two loop bodies that are important: 
Statement sl, which is enclosed only in the 

DO I loop, and statement s2, which is enclosed 
in both loops. The execution order of the 
iterations is shown. Distributing and 
interchanging the loops as described modifies 
the execution ordering to the one in Figure 
12(e); instead of execution flowing from left 


REAL A(N,N) 
INTEGER N,I,J,K 


C 
C 
DO 100 K = 1,N 
DO 100 J = K+1,N 
DO 100 I = K+1,N 
100 A(I,J) = ACI,J) + ACI,K)*ACK,J) 
C , 
C 
DO 101 K=1,N 
DO 101 I=K+1,N 
DO 101 J=K+1,N 
101 A(I,J) = ACI,J) + ACI,K) * ACK,J) 
C 
C 
DO 102 J=1,N 
DO 102 K=1,J-1 
DO 102 I[=K+1,N 
102 A(I,J) = ACI,J) + ACI,K) * ACK,J) 
C 
C 
DO 103 J=1,N 
DO 103 I=2,N 
DO 103 K=1,MIN (J-1, I-1) 
103 A(I,J) = A(I,J) + ACI,K) * ACK,J) 
C 
C 
DO 104 I=1,N 
DO 104 K=1,I-1 
DO 104 J=K+1,N 
104 A(I,J) = ACI,J) + ACI,K) * ACK,J) 
C 
C 
DO 105 I[=1,N 
DO 105 J=2,N 
DO 105 K=1,MIN (I-1, J-1) 
105 A(I,J) = ACI,J) + ACI,K) * ACK,J) 


END 


Figure ll. 


to right, then top to bottom, it flows from 
top to bottom, then left to right. This is 
exactly the same as the modification to the 
execution ordering for interchanging of 
perfectly nested loops, as was shown in 
Figures 2 and 3. 


Sometimes, however, the loop distribution 
that is required by this method is not legal. 
Take, for example, the loop in Figure 13(a); 
the value of A(I,K) used in’s2 is assigned is 
sl, and the value of A(I-1,K+1) used in sl is 
assigned on the previous I iteration in sl. 
The iteration space dependence graph for this 
loop is shown in Figure 13(b); distributing 
the outer DO I loop would violate the data 


DO 100 I = 1,N 
sl: X(I) = X(I) + Il. 
pO 100 J = 1,N 
s2: 100 B(I,J) = X(I) + A(I,J) 


(a) 


sl: 101 xX(I) = x(I 


s2: 100 B(I,J) = X + ACI,J) 


pO 101 I =1, 
sl: 101 X(I) = x(I 

DO 100 J = 
DO 100 I = 


s2: 100 BC(I,J) = X 


(c) 


1,N 
) + 
1,N 
1,N 
(I) + A(I,J) 


J=1 J=2 J=3 


I=1 sl1(1) ——®s2(1,1) —s2(1,2) —*s2(1,3) 
[=2 s1(2) ——s2(2,1) —s2(2,2) —ws2(2,3) 


I=3 sl1(3)——&s2(3, 1)— © s82(3, 2) s2(3,3) 


(d) 
J=1 J=2 J=3 
I=1 s1(1) s2(1,1) s2(1,2) s2(1,3) 
I=2 s1(2) s2(2,1) s2(2,2) s2(2,3) 
I=3 sl1(3) s2(3,1) s2(3,2) s2(3,3) 
Ce) 
Figure 12. 
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dependence conditions, since all the 
iterations of sl would be executed before any 
iterations of s2, changing the value used for 
A(I-1,K+1). These two loops can be 
interchanged, however; if the execution of 
Statement sl could somehow be moved so it ran 
along the DO J axis, then the iteration space 
dependence graph would be as shown in Figure 
13(c). In this iteration space, the DO I loop 
would have to be the inner loop; this 
corresponds to the program in Figure 13(d). 
Notice that not only have the DO loops been 
interchanged, but the references to the I loop 
index in sl have been replaced by J. 


This type of loop interchanging is legal 
when no data dependence relations are 
violated. The data dependence relations that 
are preserved are: 


(a) sl(i) --> sl(i') where i<=i' 
(b) s2(i,j) --> s2(i',j') where i<=i', j<=j' 
(c) sl(i) --> s2(i',j') where i<=i', i<=j' 
(d) s2(i,j) --> sl(i') where i<i', j<i' 
DO 100 I = K+1,N 
sl: ACI,K) = ACI-1,K4+l1) * ACK+1,K+1) 


DO 100 J = K+t1,N 
s2: 100 ACI,J) = ACI,J) + ACI,K)*ACK,J) 


(a) 
J~---> 
I sl——® s2 s2 s2 
| Pe 52 
: ee ear aa 
(b) 
J----> 
sl sl sl 
I eee. s2 
| S2 S2 s2 
V 
s2 s2 Ss 


(c) 


DO 100 J = K+t1,N 

sl: ACJ,K) = AC(J-1,K+1) * ACK+1,K+1) 
DO 100 I = K+1,N 

s2: 100 AC(I,J) = ACI,J) + ACI,K)*A(CK,J) 


(d) 
Figure 13. 


Test (a) above is obvious, and test (b) is the 
same as the test for normal interchanging. 


These conditions divide the iteration 
Space into two types of regions, as shown in 
Figures 14(a) and 14(b). These regions are 
subsets of each other: R1(2) includes R1(1) 
and is a subset of R1(3), and so on; for the 
R2 regions, R2(2) includes R2(3) and is a 
subset of R2(1), and so on. This new type of 
imperfectly-nested loop interchanging will 
preserve dependences in the iteration space 
dependence graph from sl(k) .to any iteration 
of s2 in region R2(k), and dependences to 
sl(k) from any iteration of s2 in region 
R1(k-1). Figures 15(a) and 15(b) shows data 
dependences in the iteration space that will 
be preserved by this type of loop 
interchanging, and Figures 15(c) and 15(d) 
shows data dependences that will be violated. 


Testing for legality of this type of loop 
interchanging is easier when the loop bounds 
are triangular, as in Figure 1l6(a); the 
iteration space for this loop is shown in 
Figure 16(b). Since the relation I < J is 
always true for any iteration of s2 (since the 
DO J loop starts at I+l), the interchange 
conditions (c) and (d) above becomes 


(c') sl(i) --> s2(i',j') where i<=i' 
(since i'<j', i<=i' implies i<=j") 


(d'). s2(i,j) --> sl(i') where j< i' 
(since i<j, j<i' implies i<i') 


J----> 


R1(1)} R1(2)] R1(€3)] R14)... 


I sl s2 s2 s2 
eee cee ee eee eee eee nee emt ee am ome ees oe ae em + 
sl s2 s2 s2 
V wenn errr enn + 
| sl s2 s2 s2 
mee eee ee cee tee me ee eee ee ee Oe ce cae SD OS St ee ee oe a + 
(a) 
J----> 
+ eee ee coe mee ee cee cee ce cee SSD eee CD ees DO epee mS Nm OD SS come me 
I sl s2 s2 s2 
+ eee Ter 
sl s2 s2 s2 
V $o---- 
sl s2 s2 s2 


(b) 
Figure 14, 
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(a) 
J----> 
I sl s2 $2 
| cS Ss2 s2 
V 7 
s | s2 s2 
(b) 
J----> 
I sl s2 s2 
| sl——*s2 s2 
V 
slN—bs2_ 8? 
(c) 
J----> 
I sl s2 s2 
| ae 
V 
sl s2 s2 
(d) 
Figure 15. 
DO 100 I = 1,N 
sl: eeeee 
DO 100 J = I+1,N 
s2: 100 ..... 
(a) 
J----> 
ue sl s2 s2 s2 
sl s2 s2 
V 
sl s2 
sl 
(b) 
Figure 16, 


s2 


s2 


s2 


s2 


s2 


s2 
s2 
s2 


s2 


A future version of KAP/Design will include 
interchanging of imperfectly nested DO loops. 


In conclusion, loop interchanging is an 
interesting subject for its own sake. 


Loop 


interchanging has many applications in 
optimizing compilers for high speed computers, 
and further studies in this field will 
certainly prove useful in the long run. 


References 


[1] 


[2] 


[3] 


[4] 


[5] 


[6]. 


[7] 


[8] 


Wolfe, M. J., “Optimizing Supercompilers 
for Supercomputers," Ph.D. Thesis, Univ. 
of Ill. at Urb.-Champ., Dept. of Comp. 
Sci. Rpt. No. 82-1009, Oct. 1982. 


Allen, J. R. and K. Kennedy, "Automatic 
Loop Interchange," in Proc. of the ACM 


SIGPLAN '84 Symposium on Compiler 
Construction, pp. 233-246, June, 


1984. 
Abu-Sufah, W. A., "Improving the 
Performance of Virtual Memory 
Computers," Ph.D. Thesis, Univ. of I11. 
at Urb.-Champ., Dept. of Comp. Sci. Rpt. 
No. 78-945, Nov. 1978. 


W. A. Abu-Sufah, D. J. Kuck, D. H. 


Lawrie "On the Performance Enhancement 
of Paging Systems Through Program 
Analysis and Transformations," IEEE 
Trans. on Computers, Vol. C-30, No. 5, 
pp. 341-356, May 1981. 


Huson, C. et al, "The KAP/205: An 
Advanced Source-to-Source Vectorizer for 
the Cyber 205 Supercomputer," Proc. of 
the 1986 International Conference on 
Parallel Processing, Aug., 1986. 


Macke, T. et al, "The KAP/ST-100: A 
Fortran Translator for the ST-100 
Attached Processor," Proc. of the 1986 
International Conference on Parallel 
Processing, Aug., 1986. 


Davies, D. et al, "The KAP/S-1: An 
Advanced Source-to-Source Vectorizer for 
the S-1 Mark IIa Supercomputer," Proc. 
of the 1986 International Conference on 
Parallel Processing, Aug., 1986. 


Dongarra, J. J., F. G. Gustavson and A. 
Karp, "Implementing Linear Algebra 
Algorithms for Dense Matrices ona 
Vector Pipeline Machine," in SIAM 
Review, Vol. 26, No. 1, pp. 91-112, Jan. 
1984. 


543 


[9] 


[10] 


[11] 


[12] 


[13] 


Banerjee, U., "Speedup of Ordinary 
Programs," Ph.D. Thesis, Univ. of I11. 
at Urb.-Champ., Dept. of Comp. Sci. Rpt. 
No. 79-989, Oct. 1979. 


Banerjee, U., "Data Dependence in 
Ordinary Programs," M.S. Thesis, Univ. 
of [11. at Urb.-Champ., Dept. of Comp. 
Sci. Rpt. No. 76-837, Nov. 1976. 


Banerjee, U., S. C. Chen, D. J. Kuck and 
R. A. Towle "Time and Parallel Processor 
Bounds for Fortran-Like Loops", IEEE 
Trans. on Computers, Vol. C-28, No. 9, 
pp. 660-670, Sep. 1979. 


Allen, J. R., K. Kennedy, “Automatic 
Translation of Fortran Programs to 
Vector Form," Rice Technical Report COMP 
TR84-9, Rice University, Houston, July, 
1984. 


Kuck, D., The Structure of Computers and 
Computations, Vol. I, John Wiley and 
Sons, Inc., New York, NY, 1978. 


COMPILER GENERATED SYNCHRONIZATION FOR DO LOOPS 
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Abstract -- This paper presents methods for the compile 
time generation of synchronization instructions for parallel loops 
running on a multiprocessor. A synchronization instruction set 
and architecture are defined, and using these it is shown how to 
synchronize FORTRAN source loops so that dependences on the 
parallel loop are enforced. Construction of an applicable depen- 
dence graph is covered. A program transformation is given that 
increases the parallelism that can be utilized in a parallel loop 
containing nested loops. Finally, a generalizable technique is 
presented for reducing the number of dependences needing syn- 
chronizing in a parallel loop. 


Introduction 


When programs are executed on a multiprocessing computer, 
parallelism can be supported at several levels. The University of 
Illinois Cedar Project machine [7], the CRAY XMP series [3] 
(using micro-tasking) and the NYU Ultra-Computer [6] among 
others allow intra-loop parallelism, i.e. parallelism that exists 
between different iterations of a loop, to be exploited. This paral- 
lelism is realized by allowing different iterations of a loop to run 
simultaneously on different processors of the multiprocessor. 
When one iteration of a loop produces data that another iteration 
uses, or there is a danger of an iteration overwriting data prema- 
turely, it is necessary to constrain the execution order of iterations 
so that the hazards are avoided. In this paper, we show how this 
can be done automatically, at compile time, using synchronization 
instructions. Strategies are presented for the placement of syn- 
chronization instructions in a FORTRAN source program loop, to 
increase the amount of parallelism realized from the source pro- 
gram, and to reduce the amount of synchronization necessary. 


It is assumed that programs that are synchronized by the 
methods presented here have been passed through an automatic 
program restructurer such as Parafrase [17], [11] and [10]. We do 
not discuss parallel loop detection, but rather the synchronization 
of dependences in loops that have previously been recognized as 
parallel. 


Although some of the transformations and optimizations 
mentioned in this paper are applicable to a wide range of architec- 
tures, for brevity’s sake we restrict ourselves to a single architec- 
ture. 


The target architecture, and the sorts of parallelism that can 
be exploited by it are the first topic of this paper. We then turn 
to a discussion of dependences and dependence graphs (DG), 
which form the basis of the actions of this paper. Since we are 
concerned only with synchronization within a loop, a DG will be 
built in turn for each loop to be synchronized in the program. 
The loop is then transformed, and the next loop in the program is 
dealt with until the entire program is transformed. . 


Next we define the synchronization instruction set used, and 
show how it can be used to synchronize dependences. Because of 
the semantics of the synchronization instructions, the DG can now 
be modified to reduce the amount of parallelism in inner loops lost 
to the synchronization scheme. 


This work was supported in part by the National Science Founda- 
tion under Grant No. US NSF DCR84-10110, the US Department 
of Energy under Grant No. US DOE DE-FG02-85ER25001, and 
the IBM Donation. 
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After nested loops have been dealt with, the DG is appropri- 
ate to guide the generation of synchronization instructions, and we 
give an example of a loop nest with synchronization instructions 
added. We can choose, however, to remove from the DG redun- 
dant dependence arcs, thereby reducing the amount of synchroni- 
zation needed. We give a method to reduce the number of depen- 
dence arcs, and by examples show its effects on synchronization. 


Although we use a single synchronization instruction set in 
this paper, the methods used can be extended to other instruction 
sets. In particular, the method presented in this paper for remov- 
ing redundant dependence arcs is extensible to any architecture. 


Architecture 
The architecture of this paper is a multiprocessor. Figure 1 


shows a block diagram of the target architecture. Neither proces- 
sor hor memory modules are distinguishable from their neighbors. 


M i is Global Memory Module ¢ 


P; is Processor 7 


Figure 1. System Architecture 


Each memory module is capable of being accessed by any 
processor through the interconnection network. Access is assumed 
to be deterministic in that either fetches and stores from a single 
processor are processed in the order they are issued, This means 
that either fetches and stores within a processor are done serially, 
or that an equivalent ordering is imposed on data accesses. 


Parallelism is realized by spreading iterations of a loop 
across multiple processors. A loop so executed will be referred to 
as a dospread loop. All other loops will be referred to as DO 
loops. A dospread loop can be either a doall [5], [13], [15], or a 
doacross loop [4], [15], [16]. We place the following restriction on 
dospread loops: they may contain no nested dospread loop. In [14] 
and [16] algorithms are given for deciding what loop in a nest of 
potential dospread loops should be run as a dospread loop. 


Dospread loop iterations are either spread horizontally [8] 
across processors or self scheduled. Horizontal spreading means 
that given P processors, processor p will execute iterations 
p,p+P,p+2-P, ---. Figure 2 shows an example of horizon- 
tal spreading on a loop with sixteen iterations. Horizontal schedul- 
ing can be accomplished through loop blocking [8]. Self schedul- 
ing means that whenever a processor becomes idle it will execute 
the next iteration of the dospread loop. 
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processor = ot 238 
i= 1 2 3 4 

5 6 7 8 

9 10 ll 12 

13 14 #15 16 


Figure 2. Horizontal Spreading of dospread Iterations 
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Dependences and Dependence Graphs 


The restrictions on program execution mentioned previously. 
are necessary to enforce dependences. We now informally define 
four types of dependences. For more formal definitions and discus- 
sion of dependence testing, see [2], [9] and [17]. In the definitions 
that follow, S; refers to a statement in the program. If there are 


two statements, S; and S;, andi < J, then S; lexically precedes 
S.. 


Jj 


1 flow dependence: If data fetched by S,; may be generated by 
S,., then a flow dependence exists from S,, to S,,. 


ii output dependence: If S,, may write to some memory location 
after S,, has written to that same memory location, then an 
output dependence exists from S,, to S,,. 


lili ante dependence: If S,, fetches from a memory location that 
is may subsequently be overwritten by S,., then 5S, is anti 
dependent on S,,. 


iv control dependence: If whether S,, executes depends on the 


outcome of the execution of S,,, then a control dependence 
exists from S,, to S,;. 


In each of the four cases, S,, is called the source of the 
dependence and S,, the sink. If the sink of a dependence is after 
the source, i.e. 80 < 81, the dependence is a lezically forward 
dependence (LFD). Otherwise the dependence is a lezically back- 
ward dependence (LBD). 


Dependences extend over zero or more iterations. The 
number of iterations a dependence extends across is known as the 
dependence distance, A [17]. Figure 0a shows a loop nest. Values 
of A, in S,, written in an iteration are read three iterations later 


by Sy. Therefore a flow dependence exists from S, to S, and 
= 3. 


A dependence can be thought of as a relation between two 
statements. This relation can be represented graphically on a DG, 
where nodes are statements and a directed arc from S,, to S,; 
represents a dependence. The arcs are labeled with the value of A 
for the dependence. Nested DO loops are considered to be a sin- 
gle statement, and are represented by a single node in the DG. 
Control cycles caused by backwards GOTOs are also represented 
by a single node in the DG. Figure 3b gives a DG for the loop of 
Figure 3a. If two or more dependences, all with the same source 
and sink, but with different distances, exist, we can remove all the 
dependence arcs except for the one with the minimum distance. 
The reason for this is that all arcs but the one with the shortest 
distance are redundant when using the synchronization instruc- 


DOSPREAD 10 I = 1, N 
A()=B()+C() —S, 
D(I) = A(IF-3) + E(I)  S_ 
10 | CONTINUE 


(a) A Loop Nest 


stmt = S, 
43 
S, 


(b) DG for the Loop of (a) 


Figure 3. A Loop and Its DG 
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tions of this paper. An example showing this is given in the sec- 
tion Dependence Arc Elimination, An arc is considered redundant 
if the dependence it enforces will be honored even if it is not expli- 
citly synchronized. What arcs are redundant is dependent on the 
synchronization instruction set and control structure of the archi- 
tecture. 


Dependences can have As that are greater than zero, less 
than zero, and equal to zero. If A > 0, as in Figure 3b, the 
dependence is from an earlier iteration to a later iteration. If 

= 0, then the dependence is from the same iteration to a later 
one. If A <0, the dependence is from a later iteration to an ear- 
lier one. This can only happen if the dependence is on a loop 
nested within another loop. Figure 4a shows such a loop nest, and 
Figure 4b shows the DG. There are two distances attached to the 
dependence arc. The first, 1, is the distance of the dependence on 
the outer J loop and the second, -2, is the distance on the inner J 
loop. 


DOSPREAD 201 = 1,N 
DO 10J=1,N L 
A(I,J) = B(I,J) + C(I,J) 
D(I,J) = A(I-1, J+2) - E(I,J) 


10 CONTINUE 
20 CONTINUE 

(a) A Loop Nest 

stmt =L, OX>1,-2 


(b) The DG for (a) 


Figure 4. A Loop Nest with >0 and <0 Dependences 


The DG is the major data structure for determining the 
location of synchronization instructions in a program. For our 
architecture, when the dependence graph is built only dependences 
with A > 0 are included. A less than zero has to be considered 
only if two nested loops are to execute as dospread loops. Since 
dospread loops cannot be nested, these types of dependences will 
not need to be synchronized. If A = 0, then the dependence is 
within an iteration. Since data stores and fetches are serial within 
an iteration, these dependences are forced to be honored by the 
hardware, and need not be synchronized. 


All control dependences have A =O, except for control 
dependences caused by branches out of the loop, called eat -7f s. 
If an exit-if occurs, the loop terminates. Therefore, in loops with 
an exit-if, each statement of an iteration is dependent on the exe- 
cution of the exit-if in the previous iteration. Thus a control 
dependence exists between the exit-if and every statement in the 
graph. Every dependence arc but the one from the exit-if to the 
first statement of the loop body is redundant and can be removed. 
Therefore we only include this dependence are in the DG. Which 
exit-if dependence arcs are redundant is strongly dependent on the 
the control structure of the architecture, so adding only one con- 
trol arc for an exit-if is not generally valid. Figure 5a shows a 
loop nest that will be used throughout this paper as an example, 
and Figure 5b the DG for the loop. Note that a data dependence 


with distance zero exists from S, to S, and is not included on the 
DG. 


See ee ee eee 


Three synchronization instructions are needed to synchronize 
dospread loops. We first give the semantics of these instructions, 
and then show how they are used to synchronize loops. All three 
instructions operate on a fixed set of synchronization registers 
accessible to all processors. The synchronization instructions 
assume that the DO loops being synchronized are normalized [2], 
i.e. they run from 1 to some upper bound in increments of 1. This 
can be done automatically. 


The testset instruction has the syntax testset(r), where r is 


a synchronization register. It performs two functions. The first is 


iteration. Having done so, it alters the value of r to signal later 
iterations it has executed. Figure 6a gives the semantics of the 
testset statement. One important feature of the testseé instruc- 
tion is that all executions of a testset after the first, for some syn- 
chronization register r, within an iteration, act as no-ops. 


DOSPREAD 501 = 1, N 


S(I) = S(I) + 1.0 S, 
IF (S(I).GT.0.0) GOTO 60 S 
DO 10 J =1,N L, 
A(I,J) = B(I,J) + C(1,J) 
10 CONTINUE 
IF (D(I).GT.0.0) GOTO 20 S; 
E(I) = F(-1)+G(1) S, 
F(I) = E(1) + H(D S; 
20 CONTINUE S 
P(I) = 2.0*P(I) S, 
DO 40M =1,8 L. 
Q(LM) = R(I,M) + A(I-1,M) 
R(I,M) = Q(I-1,M+2)*2.0 
40 CONTINUE 
50 CONTINUE 
60 CONTINUE 


(a) A Loop Nest 


stmt = S$, 

1 
Se 
L, 
5S; O 
S4 

1 
Ss 1 
5S, O 
S;, O 
L, 1 


(b) DG for the Loop Nest of (a) 


Figure 5. A Loop Nest and Its DG 


ifr =? - 1 then 
r:i=ot 

else ifr <i - 1 then 
goto | 


(a) Semantics of the testset Instruction 


if: - 4 >Othen 


t: ifr <+t- Athen goto! 


(b) Semantics of the test Instruction 


i: if all iterations 7, 7 <1, have completed then 
halt execution of all other processors executing the loop 


else 
goto | 


(c) Semantics of the terminate Instruction 


Figure 6. Semantics of the testset and testset Instructions 


The test instruction has the syntax test(r) A. The test 
instruction in an iteration waits until a testset instruction has exe- 
cuted A iterations previously. 


The final synchronization instruction is the terminate 
instruction. Its syntax is simply terminate. Its purpose is to halt 
execution of a dospread loop when an iteration branches out of 
the loop during execution of an exit-if. The terminate waits until 
all previous iterations of the loop have finished executing, and 
then halts the execution of the loop. Figure 6c summarizes its 
semantics. 


This instruction set involves several tradeoffs. The primary 
advantage of the instruction set, as discussed when building the 
DG, is that dependence distances can safely be assumed to be any 
distance less than or equal to the minimum distance of a 
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dependence. Therefore, one can always be assumed as the dis-— 
tance of a dependence. This is important for two reasons. First, 

it allows simpler dependence testing routines to be used. 

Secondly, in cases where the dependence distance is unknowable at 

compile time, as with sub-scripted subscripts, the dependence can 

still be safely synchronized by assuming the dependence distance is 

one, the minimum distance that the dependence can take. 


One disadvantage is that the testset instruction partially 
serializes the loop since each testset must wait until a testset in 
the previous iteration has executed. Therefore no loop can exe- 
cute completely in parallel. This serializing effect, however, has 
an important advantage. It increases the number of dependence 
arcs that are redundant. In doing so it allows the aforementioned 
assumptions to be made about the distance of a dependence. 


Another disadvantage is the use of a fixed number of regis- 
ters. If a loop has more dependences to synchronize than there 
are registers available, we can be forced to resort to dependence 
folding [14] to reduce the number of dependence arcs present. 
Dependence folding usually leads to loops being more serial. Only 
having a fixed number of registers also causes problems with 
nested loops, which are discussed later in this section. The advan- 
tage of a fixed number of registers is that they can be accessed 
faster than global memory, and special hardware can be set up, if 
desired, for accessing these registers, thereby taking traffic off the 
interconnection network. 


We now show how to synchronize LFDs and LBDs using the 
above synchronization instructions when no nested loops are 
present in the body of the loop. With any dependence it is neces- 
sary that executing the source of the dependence notify the sink, 
in a later iteration, that it can execute. It is also necessary that 
the sink of a dependence wait until notified that its source has 
executed. With an LFD this is simple. If a testset is placed after 
the source of the dependence, and before the sink of the depen- 
dence, the dependence will be enforced. 


For example, consider the loop in Figure 7a, and its DG in 
Figure 7b. The dependence from S, to S, extends over two itera- 
tions. Figure 7c shows a testset placed between S, and Sy. 
Before the sink of a dependence executes in iteration ¢ + 2 the 
testset preceding it must complete execution. The teséset in itera- 
tion ¢ + 2 cannot execute before the testset in iteration 7 + 1, 
which, in turn, cannot execute before the tesiset in iteration 7. 
The testset in iteration 7 cannot execute, however, until the 
source of the dependence in iteration « has executed. Thus the 
dependence is forced to be honored. 


DO 10I1=—1,N 
A(I) = B(I) + C(I) 5, 
D(I) = E(I) + A(-2) 5S, 
10 CONTINUE 
(2) A Loop Nest 
stmt = S, ) 2 
Se 
(b) The DDG for (a) 
DO 101=1,N 
A(I) = B(I) + C(D S, 
testset (1) TS, 
D(I) = E(I) + A(I-2) 5S, 


10 CONTINUE 
(c) The Loop Nest of (a) Synchronized 


Figure 7. Synchronizing a LFD 


To synchronize a LBD, such as is seen in Figure 8a, a testset 
is placed after the source of the dependence to signal that the 
source has executed in this iteration. A test is then placed before 
the sink of the dependence. Thus the sink of the dependence 
cannot execute until after the source, since the test instruction 


will not finish executing until after the testset following the source 
has completed. Figure 8c shows the synchronized loop. 


DO 101=1,N 
A(l) = B(I) + D(-2)_S, 
D(I)=E(I)+ Al) Se 
10 CONTINUE 


(a) A Loop Nest 


stmt = S, ) 
2 
So 
(b) The DDG for (a) 


DO 10I—1,N 
test(1) 2 
A(I) = B(I) + D(I-2) 
D(I) = E(I) + A(I) 
testset(1) 
10 CONTINUE 


(c) The Loop Nest of (a) Synchronized 


3 i a ie 


Figure 8. Synchronizing a LBD 


If control branches are present in the dospread loop further 
actions must be taken. In particular, we must insure that a test- 
set be executed for every register used in the loop, on every ‘itera- 
tion of the loop. Consider again the loop nest of Figure 5. To 
synchronize the LBD from S, to S, a testset would be placed 
between S; and S,. If the IF statement at S, takes the true 
branch, then the testset would not be executed on that iteration. 
The testset of the next iteration in which the false branch of the 
IF is taken would then deadlock waiting for its synchronization 
register to be updated. Therefore a testset should be placed along 
the flow of control path through the loop containing the true 
branch. 


This problem is handled as follows. The loop is broken up 
into basic blocks, (BAS) [1], and a flow of control graph [1] is built 
that represents the control structure of the program. This graph 
can be represented by a matrix reach(?, 7), such that 
reach(t, 7) = true if an arc from BAS; to BAS, is in the flow 
of control graph. The (7, 7) element of the transitive closure of 
reach, reach*(1, 3), is true if a path exists from BAS, to BAS,. 


Each arc in the DG is handled in turn. A testset, TS,,, is 
placed after the source of the dependence. At each branch, reach 
can be checked to see if any target of the branch is the block con- 
taining TS,,. If this is so, every other target BAS of the branch is 
checked (using reach™) to determine if it can also reach the block 
containing TS,,. If it cannot, a testset is placed in that BAS. 
Thus every branch that can reach TS,, must either reach the ori- 
ginal testset or an added testset, and deadlock cannot occur. 


Two other flow of control problems remain: the presence of 
nested DO loops within the dospread loop, and exit-ifs. The 
former case is handled in the next section. We deal with the 
latter case now. 


Consider the exit-if of statement S, in Figure 5. The control 
arc from S, to S, forms a LBD that will cause test(1) 1 to be 
placed before S,, and testset(1) to be placed after S,. The test 
instruction will wait until the testset in the previous iteration has 
executed before allowing S, to execute, and the testset will not 
execute if the exit branch of the exit-if is taken. If the exit branch 
is taken in iteration 7, all iterations less than 7 must be allowed 
to finish, and the deadlocked iterations greater than ¢ must be ter- 
minated. 


To halt the loop, a block of code is added to the loop that 
executes a terminate instruction. As will be seen, placement of 
this block is unimportant but in this paper it will be added to the 
_end of the loop. For every exit branch to some label /, the block: 


id terminate 


GOTO I 


is added to the loop, and all branches to / in the loop are changed 
to branches to /'. The terminate instruction checks if all iterations 
of the loop less than the current one have finished. If so, it halts 
all processors executing the loop and the GOTO makes the exit 
branch. To prevent other statements in the loop from falling 
through to this added block, a GOTO to the CONTINUE at the 
end of the dospread loop is added immediately before the ter- 
minate instruction. Figure 9 shows the program with these state- 
ments added. 


DOSPREAD 501 = 1, N 


test(1) 1 Tj 
S(I) = S(I) + 1.0 5, 
IF (S(I).GT.0.0) GOTO 61 So 
testset(1) TS, 
DO 10J =1,N L, 
A(I,J) = B(I,J) + C(I,J) 
10 CONTINUE 
IF (D(I).GT.0.0) GOTO 20 CR 
E(I) = F(I-1)-+G(1) Sy 
F(T) = E(I) + H(1) Ss 
20 CONTINUE S, 
P(I) = 2.0*P(I) S, 
DO 40M =1,8 L; 


Q(1,M) = R(M) + A(I-1,M) 
R(L,M) = Q(I-1,M+2)*2.0 


40 CONTINUE 
GOTO 50 

61 terminate 
GOTO 60 


50 CONTINUE 


60 CON TINUE 


Figure 9. Loop Nest with Exit-if Synchronized 


$e 


Nested Loops 


If the source and sink of a dependence are both within the 
same nested loop, the dependence forms a self cycle on the DG 
node representing the loop. L, in the DG of Figure 5b is an 
example of this. This loop is reproduced in Figure 10a If a test is 
placed before the loop, and a testset is placed after the loop, the 
loop forms a large chunk of serial code in the dospread loop. It is 
therefore desirable that some parallelism be extracted from this 
nested loop. Figure 10b gives the iteration space [17] for this loop 
nest. Each node in the iteration space represents an iteration of 
the loop, and arcs represent dependences from iteration to itera- 
tion. Dependences on the M loop span two iterations of the M 
loop, while dependences on the dospread loop span one iteration. 
Our goal is to allow some iterations of L, in iteration 7 + 1 of the 
dospread to execute before all iterations of L, execute in iteration 
« by placing synchronization instructions so that dependences on 
both the dospread and L, loops are satisfied. 


To do this a transformation called loop splitting is per- 
formed. Loop splitting breaks the inner loop into n different sub- 
loops, and places synchronization instructions between the sub- 
loops. Each sub-loop executes a contiguous set of iterations from 
the original loop. Thus if the M loop of our example is broken 
into four sub-loops, the first would run from 1 to 2, the second 
from 3 to 4, the third from 5 to 6 and the fourth from 7 to 8. 
Each sub-loop contains two iterations, therefore if the first sub- 
loop of iteration 7 of the dospread loop waits until the second 
sub-loop in the previous iteration of the dospread loop has exe- 
cuted, dependences to it will be honored. That this is true can be 
seen by examining the sources of dependences in the iteration 
space diagram that have their sinks in the second sub-loop. The 
second sub-loop must wait until the third sub-loop of the previous 
iteration has finished, and the _ third sub-loop must 
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wait until the fourth sub-loop of the previous iteration has 
finished. The fourth sub-loop contains no dependence sinks whose 
sources are in other sub-loops, therefore it does not need to wait 
on any other sub-loop. These dependences can be detected by 
splitting the inner loop and recomputing the dependences between 
the sub-loops. The synchronization of these dependences will give 
the execution ordering specified above. 


DOSPREAD 50I = 1, N 


DO 10J=1,N 


L, 
A(I,J) = B(I,J) + C(L,J) 


Q(,M) = R(I,M) + A(I-1,M) 
R(I,M) = Q(I-1,M+2)*2.0 


40 CONTINUE 
50 CONTINUE 
(a) A Loop Nest 
= | 2 3 4 5 6 7 8 
i=1 O O 
== 2 a 
= 3 , 
=4 
=5 i ee 
=6 O O 


(b) Iteration Space Diagram for (a) 
DOSPREAD 501 = 1,N 


pO 10J—1,N 


L 
A(I,J) = B(LJ) + C(I) 
DO 401 M = 1, 2 ds 
Q(LM) = R(I,M) + A(I-1,M) 
R(LM) = Q(I-1,M+2)*2.0 
401 CONTINUE 
DO 402 M = 3, 4 i. 
Q(IM) = R(L,M) + A(I-1,M) 
R(LM) = Q(I-1,M+2)*2.0 
402 CONTINUE 
DO 403 M = 5, 6 Le 
Q(LM) = R(I,M) + A(F-1,M) 
R(I,M) = Q(I-1,M+2)*2.0 
403 CONTINUE 
DO 404M = 7,8 c. 
Q(I.M) = R(,M) + A(I-1,M) 
R(I,M) = Q(I-1,M+2)*2.0 
404 CONTINUE 
50 CONTINUE 


(c) The Loop Nest of (a) after Splitting 


(d) DG for the Split Loop 


Figure 10. A Loop Splitting Example 


When this synchronization has taken place, the execution of 
the third and first, and fourth and second sub-loops of adjacent 
iterations of the dospread loop can overlap, salvaging some of the 
parallelism present before loop spreading. 


The only other problem that remains is what to do with 
dependences whose sources are outside the loop being split, but 
whose sink is in the loop being split, or whose sink is outside the 
loop being split but whose source is the CONTINUE statement of 
the loop being split. The dependence should be replicated over 
every copy of the loop created by loop splitting. The dependence 
routines used to build the original DG should then be queried to 
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see if the dependences exist to or from each sub-loop of the split 
loop. If the dependence no longer exists to or from a particular 
sub-loop, it should be removed from the DG. In our example, the 
dependence does exist to every sub-loop. 


Figure 10c shows the M loop split, and Figure 10d shows the 
DG for the loop nest with dependences resulting from loop split- 
ting added. 


In [14] a method is given for determining analytically the 
placement of dependences for split loops as a function of the max- 
imum dependence distance on the split loop and the minimum 
number of iterations in a sub-loop. 


r eer 


After nested loops have been taken care of, as in the previ- 
ous section, synchronization instructions can be inserted into the 
source program using the methods of the second section. Adding 
synchronization instructions now can result in more instructions 
being added than is necessary to insure the legal execution of the 
program. We now discuss a method of reducing the number of 
dependences needing synchronization in a DG. As the number of 
dependences needing synchronization is reduced, the amount of 
synchronization added to the loop is reduced, and consequently 
the overhead associated with synchronization is reduced. The goal 
of dependence arc elimination is to determine what dependences 
are redundant, i.e. what dependences are implicitly synchronized 
by a combination of the control structure of the architecture and 
the synchronization of other dependences. Related work in this 
area is discussed in [12]. 


We present our dependence arc elimination method with the 
instruction set defined in this paper in mind. The technique is 
usable, however, with a wide range of architectures. Details of 
this are given in [14]. 


A dependence arc can be eliminated if a combination of the 
control structure of the target architecture, and synchronization 
added for other dependences, force the dependence to always be 
synchronized. To eliminate a dependence arc, a simple way of 
representing the total control structure of a loop nest is required. 
We use a controlled path graph, or CPG, to give this representa- 
tion. A CPG contains nodes, each of .which represents a 
statement instance. A statement instance is the occurrence of a 
statement within an iteration of the dospread loop. At least a 
source and sink of any dependence arc to be eliminated must be 
included in the CPG. Thus, if A,,,, is the longest distance of any 
dependence in the dospread loop, the CPG must contain at least 
A naz + 1 iterations (columns). That no more columns than this 
are needed is proven in [14]. Arcs in the CPG represent the execu- 
tion order of the statement instances: the statement instance at 
the head of an arc will execute after the statement instance at the 
tail of an arc. 


Figure lla gives a dospread loop without branches and with 
synchronization instructions added. Figure 11b gives the DG for 
the loop that led to this synchronization, and Figure 11c gives the 
CPG for this loop. The dashed lines in the CPG represent control 
exercised over the program execution by the architecture. As 
stated before, statements within a single iteration execute serially 
on a single processor. Each statement within an iteration must 
execute after the previous statement in that iteration, and there- 
fore is at the head of an arc whose tail is the previous statement. 
Next, we know a testset on a certain register must follow the test- 
set on that same register in the previous iteration. Therefore an 
arc is drawn from each testset in the first iteration to the testset 
in the second iteration using the same synchronization register. 
Finally, we know that a test(r) A instruction must follow the test- 
set instruction A iterations back. Therefore we draw arcs from 
each testset to its corresponding test, if one exists. 


DOSPREAD 10I = 1,N 


A(I) = B(I) + C(I) . S, 
testset (1) TS, 
test (2) 2 T, 
Bi) = A() + E(l-2) = Sy 
E(I) = D(I) + F(I) Ss 
testset (2) TS, 
F(I) = A(L-2)+ E(I) = S, 
10 CONTINUE 
(a) A DOSPREAD Loop 
stmt = S$, 
Sy 
2 )2 
53 


(c) The CPG for the (a) 


Figure 11. Example of a CPG 


Constructing the CPG is the architecturally dependent por- 
tion of this method. If it is desirable to do dependence arc elimi- 
nation with another architecture, all that is necessary is that the 
dashed arcs accurately reflect the control exercised by the target 
architecture, and that arcs are added to reflect the effects of syn- 
chronizing the loop. 


Deciding if an arc should be removed from the DG is trivial 
once the CPG has been built. We need only to find a path from 
the source to the sink of the dependence in the CPG without using 
any arcs in the CPG that result from synchronizing that depen- 
dence. It is possible to eliminate the dependence arc since if a 
path exists through the CPG from the source to the sink of the 
eliminated dependence, then the sink of the dependence must exe- 
cute after the source of the dependence without explicitly syn- 
chronizing the dependence. 


The dependence arc from S, to S, can be removed from the 
DG for this loop. Starting at S, in iteration one, TS, in the same 
iteration can be reached. Its arcs can be followed to TS, in itera- 
tion 3, and from there the dotted lines can be followed to TS, in 
iteration 3. Therefore we can travel from the source to the sink of 
the dependence without going through any arcs resulting from 
TS,, which synchronize the dependence whose arc is being elim- 
inated. 


After deciding that an arc can be removed from the DG, we 
do so. As well, all nodes in the CPG used to synchronize the arc 
are removed, and the arcs associated with them are removed. 
This insures that the dependence arc just removed is not used to 
eliminate any other dependence arc. 


We give another example that shows that only the depen- 
dence arc with the minimum distance needs to be retained when 
two or more dependence arcs with different distances exist 
between two statements. Figure 12a gives a loop nest with both 
dependences from S, to S, synchronized. Figure 12b gives the 
CPG for this loop nest. The testset and set on register 2 
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synchronizes the dependence of length two, while the testset and 
test on register 3 synchronizes the dependence of length three. 
Starting at S, in iteration one, the dashed arcs can be traveled to 
reach TS,. The arc leaving 7S, in iteration one, and traveling to 
TS, in iteration two is taken, followed by the are to T, in itera- 
tion three. From there the dashed arcs can be taken to S, in 
iteration three, showing that the dependence arc with distance 
three can be eliminated. 


DO 101=1,N 
test(2) 2 T, 
test(3) 3 T; 
A(I) = B(I-1) + B(I-3) S, 
B(I) = A(I) * 2.0 So 
testset (2) TS, 
testset (3) TS, 
10 CONTINUE 


(a) A Loop Nest 


ime 


stmt = T, 


(b) The CPG for (a) 


Figure 12. Dependence Are Elimination Example: Redundancy of Longer Dependence 


Figure 13a shows the program of Figure 5a with synchroni- 
zation statements added. Figure 13b and Figure 13c show the 
four possible CPGs for this program. Four CPGs are needed to 
express the possible combinations of control paths taken on 


DOSPREAD 501 = 1, N 


test(1) 1 Ty 
S(I) = S(I) + 1.0 S, 
IF (S(I).GT0.0) GOTO 61 S. 
testset(1) TS, 
DO 10J =1,N L, 
A(I,J) = B(I,J) + C(I,J) 
10 CONTINUE 
testset(2) TS. 
testset(3) TS, 
testset (4) TS, 
testset(5) TS, 
IF (D(I).GT.0.0) GOTO 20 Ss 
test(6) 1 T, 
E(1) = F(I-1) + G(D) S, 
F(I) = E(I) + H(D Ss 
testset (6) TS, 
20 CONTINUE Ss 
testset (6) TS, 
P(I) = 2.0*P(I) Sy 
test(7) 1 T, 
DO 401 M = 1,2 i. 
test(8) 1 Ts 
DO 402 M =3, 4 Le 
testset(7) TS, 
test(9) 1 T, 
DO 403 M = 5, 6 jis 
testset (8) TS, 
DO 404M = 7,8 Ly 
testset(9) TS, 
GOTO 50 
61 terminate 
GOTO 60 
50 CONTINUE 
60 CONTINUE 


(a) Loop of Figure 5 Synchronized 


Figure 13. Dependence Arc Elimination with Branches 


different iterations. The first assumes that the false branch of the 
IF of S, is taken in both iterations. The second assumes that the 
false branch is taken in the first iteration, and the true branch is 
taken in the next iteration. The third assumes that the true 
branch is taken in the first iteration, and the false branch is taken 
in the second. Finally, the fourth assumes that the true branch is 
taken in both iterations. The CPGs are built the same as when no 
flow of control branches are present with two exceptions. When a 
statement is a branch, an arc is only placed from the branch to 
the statement that will execute after the branch, and not to every 
target of the branch. Finally any synchronization instruction that 
is not executed in the flow of control path for an iteration does 
not have any synchronization arcs coming in or going out of it. 
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(b) CPGs for (a) 


Figure 13. Continued 


Eliminating a dependence arc is the same as with the CPG 
of Figure llc except that it is necessary to find a path through 
every CPG constructed for the loop. This is necessary since the 
dependence arc must be redundant regardless of the flow of con- 
trol path taken through the graph. It is possible to eliminate the 
dependence arc since if a path exists through all the CPGs, then 
the sink of the dependence must execute after the source of the 
dependence, without synchronizing the dependence, regardless of 
what branches are taken. 


Consider, as an example, the dependence from L, to be 
synchronized by the testset instruction, TS;. Using the arcs in 
the CPGs provided by TS,, and the architecturally dependent 
arcs, we can travel from the source to the sink of the dependence 
without using any arcs supplied by the L, to Ls dependence, in 
this case those arcs connecting TS,. Therefore we can eliininate 
this arc from the graph, since it is synchronized by other arcs. As 
each arc is eliminated in the DG, the corresponding 
synchronization instruction nodes in the CPG are eliminated so 
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that they are not used when eliminating other dependence arcs. 
Likewise, the arcs representing dependences from L, to L, and L, 
to L, can be eliminated. 
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(c) CPGs for (a) 


Figure 13. Continued 


The dependence arc in the DG from L, to js is eliminated 
by the backward dependence from S; to S,, and the arcs provided 
by its testset, TS,. After this arc is eliminated from the DG, no 
more arcs can be eliminated. Figure 14a shows the DG after 
dependence arc elimination, and Figure 14b shows the synchron- 
ized program after dependence arc elimination. 


Lv) 
ap pp oo Ap oO :0 Op 


L; 
(a) DG for the Loop Nest of Figure 5 


Figure 14. The Loop of Figure 5 After Dependence Arc Elimination 


DOSPREAD 501 = 1, N 


test (1) 1 T, 
S(I) = S(I) + 1.0 S, 
IF (S(I).GT0.0) GOTO 61 So 
testset (1) TS, 
DO 10 J =1,N lis 
A(I,J) = B(I,J) + C(1,J) 
10 CONTINUE 
IF (D(I).GT.0.0) GOTO 20 Sa 
test(2) 1 T, 
E(I) = F(I-1) + G(1) 5, 
F(I) = E(I) + H(D S. 
testset (2) TS. 
20 CONTINUE S. 
testset (2) TS, 
P(I) = 2.0*P(I) S, 
test(3) 1 T; 
DO 401M = 1,2 Ls 
test(4) 1 T, 
DO 402 M = 3, 4 Le 
testsct (3) TS, 
test(5) 1 Ts 
DO 403 M = 5, 6 L3 
testset (4) TS, 
DO 404 M = 7,8 Le 
testset(5) TS, 
GOTO 50 
61 terminate 
GOTO 60 


50 CONTINUE 


60 CONTINUE 
(b) Loop of Figure 5 Synchronized 


Figure 14. Continued 


Finding a path through all the CPGs for a loop is necessary 
and sufficient for eliminating a dependence arc. This is proven in 
[14]. It is also shown in [14] that dependence arc elimination is 
NP-hard, and a polynomial time algorithm for dependence arc 
elimination in loops without control branches is given. An 
exponential algorithm is given for loops with control branches. 
The exponential nature of the algorithm is shown to be not too 
bad in practice because of the small number of branches and the 
short dependence distances present in most loops. 


If one is willing to not eliminate as many dependence arcs as 
possible, an easy but conservative approach is possible. By put- 
ting in arcs representing control exercised by the architecture only 
for non-branching statements, it will not be possible to go from 
the source to the sink of a dependence that is separated by a con- 
trol flow branch. This means only dependences whose sources 
and sinks lie within a basic block will be eliminated. Therefore, 
only one CPG can be built and the dependence elimination can be 
done in polynomial time. Using this method, the dependence arcs 
from L, to the L, sub-loops would not have been eliminated in the 
example of Figure 13. 


Conclusions 


We have shown that the automatic generation of synchroni- 
zation instructions is not only practical, but straightforward. 
Many of the techniques discussed in this paper have been imple- 
mented in a pass for the Parafrase program restructurer, demon- 
strating their practicality. We have also shown that a set of syn- 
chronization primitives that allows fairly primitive dependence 
testing to take place can be used effectively to synchronize DO 
loops. 


Several of the techniques discussed can be easily extended to 
other instruction sets. The algorithm used to place testset 
instructions can be used with any set instruction. The method of 
dependence arc elimination we have discussed can be immediately 


extended to most if not all architectures by modifying the place- 
ment of arcs in the CPG. Finally, loop splitting can be used 
effectively with any instruction set in which the number of syn- 
chronization registers is limited. 
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Abstract -- In this paper we introduce a program graph model 
for sequential programs. We then present two algorithms which can 
assign any program graph to be executed on a pipeline configuration 
having the minimum number of processors of a reconfigurable 
multiprocessor. We show that the time complexity of these algorithms 
is polynomial for large and practical classes of program graphs that 
include r-ary trees and serial/parallel graphs. We conject that the 
problem of assignment is NP-complete. Consequently, the complexity 
of the algorithms is the best that one can hope to achieve. 


1_INTRODUCTION: 


One of the main reasons of not being able to achieve the 
potential gain of concurrent and parallel systems is that the 
architecture is not well matched to the application of interest or vise 
versa [1]. In this paper an effort is made to match the sequential 
programs to a pipeline configuration of a reconfigurable 
multiprocessor [2]. The reconfigurable multiprocessor can be 
reconfigured to connect processors into commonly used topologies 
such as pipeline, tree, mesh, loop, and their composite [3] 


There will be a transition from the normally known pipeline 
architecture [4,5,6] to a new one which will be called program 
pipelines, in which, each stage will be a processor, or many 
processors. To execute a sequential program concurrently on a 


program pipeline, the program is divided into a sequence of blocks. — 


Each block is executed by one stage of the pipeline. It is assumed 
that the program will be executed frequently, and hence a reduction 
of total execution time will result if compared to the total execution 
time of the same program on a uniprocessor. A compiler program is 
a practical example which can be divided and which can repeatedly 
receive compilation tasks [7,8]. 


2,_PROGRAM GRAPH MODEL 
Definition 1: 


An acyclic directed graph G is called Program Graph iff the 
following three conditions are satisfied: 


1) Exactly on node in G has no in going edges; this node is called 
the_entry node of G 


2) Every node in G is reachable by a directed path from the entry 
node 


3) Every node V in G is assigned a positive integer d(V) called the 
delay of node V 


*This work is supported in part by a grant from IBM Corporation 
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Definition 2: 


Let G be a program graph. The delay of G, denoted d(G), is as 
follows: 


d(G) = max [d(V,), (V5), ...... d(V,)], such that V is in G. 
Detinition.3: 


A program graph is called linear iff it consists of only one 
directed path. 


Definition 4: 


A linear program graph is called a pipeline with delay K iff for 
every node V in the graph, d(V) = K. 


Definition 5: 


Let G be a program graph and P be a pipeline with delay K. An 
assignment f of G into P is a function that assigns each node V in G 
to a node f(V) in P such that the following two conditions are 
satisfied: 

1) For every two distinct nodes V and V' in G, if there is a directed 
path from V to V' then either f(V) and f(V') are the same node in 


P or there is a directed path from node f(V) to node f(V') in P. 
2) The max value in the set 


{ 2 d(V;) Vy,.---- V, constitute a directed path in G, 


and are assigned to the same node in P} is at most K, the delay 
of P. 


The next lemma follows immediately from the above definitions. 
Lemma.1; 

If there is an assignment from a program graph G into a pipeline 
P, with delay K, then d(G) <d (P), i.e., d(G) < K.. 


Problem: It is required to design an algorithim that takes any 


program graph G and any positive integer K, where K 2 d(G), then 
defines a pipeline P with delay K and an assignment f of G into P 
such that the number of nodes in P is minimum. 


Let us comment on the fact that our model program graph does 
not allow cycles, while sequential programs do have cycles. If a 
sequential program has a cycle, as shown in Fig. 1 - a as an 
example, then its execution can be pipelined only if there is an 
upperbound on the number of times the cycle is to be executed. In 
this case, we suggest two solutions on how to represent a cycle in 
our program graph model as follows: 


1 Collapse the cycle into one node in the program graph 
model. The execution time of the node equals t x s, where t 
is the time to execute the cycle once, and s is the upper 
bound on the number of times the cycle is executed. 
Collapsing is as shown in Fig. 1 - b 


2 Unfold the cycle into a sequence of s nodes, each with 
execution time t, in the program graph model. Where t and s 
are as defined above. Unfolding is as shown in Fig. 1 - c. 


Notice that the first solution will produce a program graph with a 
smaller number of nodes but the execution time of such nodes is 
bigger than those in the second solution. 


An efficient assignment of linear program graphs into pipelines 
is given in the next section. 


Fig. 1. Cycle Problem Solution 


a) Cycle b) Collapsing method _—_c) Unfolding method 


3,___ EFFICIENT ASSIGNMENT OF LINEAR PROGRAM. 
GRAPHS INTO PIPELINES: 


In this section, we present an algorithm that takes any linear 
program graph G, and a positive integer K, where d(G) < x, then 
computes: 

a) The minimum number p of nodes in a pipeline P with delay K such 
that there is an assignment of G into P, and 
b) The assignment function F: G 


Algorithm 1: 


Input: G and K (as defined above) 
Output: An integer p, and an array f (as defined above) 
Variables: 

sum: integer; 

Node: (1,..... 9,9 +1) 


p: = 0; 
While node <g 
do p=p+1; 
sum: = 0; 
While node < g and sum + d(node) < K 
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do f [node] : = P; 
Sum: = sum +d (node); 
node: = node + 1 
endwhile 
endwhile 
End of Algorithm 1 


The following theorem estabishes the correctness of Algorithm 1 
Theorem 1: 


Let G be any linear program graph, and K be any positive 
integer such that K 2 d(G). And let p and f be the output of applying 
Algorithm 1 to G and K:. 


a) fis an assignment of G into a pipeline with delay K that has p 

nodes. 

b) If there is an assignment of G into a pipeline with delay K, then 
this pipeline must have at least p nodes. 


4, __EFFICIENT ASSIGNMENT OF GENERAL 
PROGRAM GRAPHS INTO PIPELINES: 


We present in this section an algorithm that takes any (general) 
program graph G, and a positive integer K, where d(G) < K, then 
computes: 

a) The minimum number p, of nodes in a pipeline P with delay K such 
that there is an assignment of G into P, and 
b) An assignment f of G into P. 


Algorithm 2: 
Input: Gand K (as defined above) 


Output: An integer p, and an array f (as defined above) 
Variables: 


node:(1,..... 9) 
Steps: 
1. p:#O 


2. fornode=1(1,..... 19) do f [node]: = O endfor 
3. for each complete path T in G do 
* — Recall that T is in fact a linear 
Program graph 
a) Apply algorithm 1 to T and K, and let the 
output be q and h respectively. 
By theorem 1, q is the minimum number 
Of nodes in a pipeline Q with delay K 
Such that there is assignment h of 
* Tinto Q. 
b) p: = max (p,q) 
c) for every node in T do 
f [node]: = max (f [node], h [node]) 


* 


endfor 
endfor 


The following theorem estabishes the correctness of Algorithm 2 


Theorem 2; 


Let G be any linear program graph, and K be any positive 
integer such that K = d(G). And let p and f be the output of applying 
Algorithm 2 to G and K. 


a) f is an assignment of G into a pipeline with delay K that has p 

nodes. 

b) If there is an assignment of G into a pipeline with delay K, then 
this pipeline must have at least p nodes. 


iM MP. 


In this section we present some results concerning the time 
complexity of algorithms one and two. In particular we show that: 
1) Algorithm 1 has a linear time complexity. 
2) Algorithm 2 has a polynomial time complexity for large (and 
- practical) classes of program graphs. 
3) Algorithm 2 has an exponential time complexity in general. 


Theorem 3; 


The time complexity of Algorithm 1 is O(n), where n is the # of 
nodes in the linear program graph G. 


Next we discuss the complexity of Algorithm 2 for some practical 
classes of program graphs, namely, and serial parallel graphs. 


Definition 6: An r-ary tree is a directed rooted tree where each 
nonleaf node has at most r children. 


Definition 7: A serial/ parallel graph is a program graph that has 
one entry node, one exit node, and a block in between as shown in 
Fig. 2, where a block is defined recursively as follows: 

1) One node as shown in Fig. 3 a 


2) One node followed by a block as shown in Fig. 3 b, or 
3) One node that branches to two blocks as shown in Fig. 3c 


entry node 


block 


exit node 


Fig. 2 serial/parallel graph 


i 


Fig. 3 The recursive definition of a block 
Corollary 1: 


The time complexity of Algorithm 2 is 
0 ( 1/r [rn (r - 1) + 1] log, [n (r - 1) + 1]) i.e. 


0 (n log, n ) when applied to 


Program graphs in the shape of an r - ary tree with n nodes. 
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Corollary 2: 


The time complexity of Algorithm 2 is 0 (n2) when applied to a 
program graph in the shape of a seriai/parallel program graph with n 
nodes. 


Unfortunately, the time complexity of Algorithm 2 is not always 
polynomialas demonstrated by the following theorem. 


Theorem 4; 


There is a class of program graphs, namely, complete acyclic, 
for which the time complexity of Algorithm 2 is exponential. 


6, __ CONCLUSION: 


The execution of sequential programs can be speeded up by 
executing them on reconfigurable architecture. Specifically, we have 
introduced a program graph to model sequential programs and 
presented two algorithms which can be used to assign any program 
graph to a pipeline having the minimum number of processors. We 
proved the correctness of both algorithms and discussed their time 
complexity. The complexity analysis showed that the complexity of 
Algorithm 1 is linear, and the complexity of Algorithm 2 is polynomial 
when applied to program graph on the shape of r-ary trees or 
serial/parallel graphs. It is, also, shown that there exists a class of 
program graphs for which the complexity of Algorithm 2 is 
exponential. We suspect that the problem is NP - complete in 
general, and if it is so, then the time complexity of Algorithm 1 and 2 
is the best that one can hope to achieve. 


References_ 


[1] A.K. Jones and P. Schwartz, "Experience Using Multiprocessor 


Systems - A Status Report", Computing Survey, Vol. 12, No. 2, 
June 1980. 


[2] C. Wu, K. Mossaad, W. Lin, “Architecture of a Distributed 
Reconfigurable Multimicro-Computer Network", Proceedings of 
Internatinal Computer Symposium, 1982. 


[3] W. Lin and C. Wu, “Design of Configuration Algorithms for a 


Multiprocessor-STAR", Proceedings of the International 
Conference on Parallel Processing, 1985. 


[4] P. M. Kogge, “Ihe Architecture of Pipelined Computers”, 


Hemisphere Publishing Corp. and Mcgraw-Hill Book Company, 
1981. 


[5] ©.V. Ramamoorthy and H. F. Li, "Pipeline Architecture”, 
Computing Survey, Vol. 9. No. 1, March 1977. 


[6 


mel 


K. Hwang, and F.A. Briggs, "Computer Architecture and Parallel 
Processing”, McGraw-Hill, 1984. 


[7] P. El-Dessouki, W. Huen, and M. Evens, “Towards a Partitioning 
Compiler for a Distributed Computing System”, Proceedings of 
the International Conference of Parallel Processing, 1979. 


[8 


ee) 


J. K. Ousterhout, "Medusa, a Distributed Operating System”, 
UMI Research Press, 1981. 


SIMULTANEOUS BROADCASTING IN MULTIPROCESSOR 
NETWORKS 


K. N. Venkataraman 
George Cybenko 
David W. Krumme 


Department of Computer Science 
Tufts University 
Medford MA 02155 


Abstract 


The problem of simultaneous broadcasting in multi- 
processor networks is studied in this paper. This prob- 
lem, in which every processor has a token that must be 
broadcast to every other processor in the network, arises 
in the development of applications programs for multipro- 
cessor networks. The main result of this paper is that the 
optimal algorithm for the token broadcast problem (un- 
der a restricted model of communication) for a complete 
graph network with N processors runs in [1.44 logs N+4| 
steps. It is also shown that the same problem can be 
solved on a hypercube in [2 log, N]| steps. 


1 Introduction 


In writing applications programs to run on the INTEL 
iPSC/d5 hypercube multiprocessor [2], we have found a 
number of uses for the following construct. Every pro- 
cessor has a token that must be broadcast to every other 
processor and these tokens can be combined so that the 
effective size of combined tokens is the same as the size 
of the original tokens. This assumption that tokens can 
be combined applies where communication or packet-size 
~ overhead is the dominating factor; or where tokens can be 
combined arithmetically or otherwise. This construction 
is useful for synchronizing the starting times of the actual 
applications code on all processors for accurate timing 
measurements. Another equally important use for such 
a construct is in the computation of a global sum where 
each processor is storing one of the summands and the 
sum must be computed and communicated to all proces- 
sors. This situation has arisen, for example, in sparse 
eigenvalue and linear equations solving techniques with 
which we have been experimenting [1]. In general, the 
communications requirements demanded by this problem 
are necessary and sufficient to evaluate any function of 
all the tokens at all of the nodes in the network, and the 
assumption that tokens can be combined yields a simple 
yet reasonable abstraction of the problem. 
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Using realistic models for interprocessor communica- 
tion in a local memory, message passing multiprocessor 
network, we have studied good and optimal algorithms 
for performing this task on hypercubes and other mul- 
tiprocessor networks. It appears that the minimal time 
parallel algorithm for solving this global token passing 
problem provides a useful criterion for evaluating a mul- 
tiprocessor network topology. It is quite distinct, as our 
examples show, from previously used characteristics such 
as diameter, girth, density and connectivity [7]. 


2 Token Broadcast Problem 


In this section we shall state our version of the token 
broadcast problem and the two different models of com- 
munication that we shall focus on in this paper. 

Definition: Given a network N, specified by the in- 
terconnection graph G=(V,E), and a token ¢; associated 
with each of the nodes p;, 1 < ¢ < n = |V|, the token 
broadcast problem is that each node should broadcast its 
token to all the other nodes in G and as a result at the 
end of the broadcast each node should contain the tokens 
ti,°°*ytn. 

This version of the broadcast problem is different from 
the routing problems that have been studied [3,4,8,10]. 
We require that all tokens be available at all of the nodes 
at the end of the broadcast. 

We are interested in efficient algorithms for solving 
the broadcast problem for different interconnection graphs 
that have been proposed [7,8,9]. In the current paper we 
shall focus our attention on the hypercube and the com- 
plete graph. We shall investigate the algorithms for the 
broadcast problem under the two models of communi- 
cation described below. In both the models we assume 
that it takes the same amount of time to transmit one 
token or a collection of tokens across the communication 
link from a node to its neighbour and we shall use that 
measure as our unit of time. This assumption is realis- 
tic to a large extent on the INTEL iPSC multiprocessor 
system due to packet-size effects and further, makes it 
easier to analyze the properties of optimal algorithms for 
the broadcast problem. 

Model A: In each step of the computation, messages 
can be transmitted on any links but only in one direction 
along each link in the network. In other words, any num- 
ber of edges can be active but each edge can be used only 
in one direction at a given step. This is a good model of 


certain architectures, including the iPSC, if one assumes - 


that actual communication time dominates. 

Model B: In each step of the computation, a proces- 
sor can either receive a message from one of its neighbours 
or send a message to one of its neighbours. In effect, in 
a given step at most half the processors communicate 
with the other half. This model describes virtually any 
architecture under the assumption that the overhead or 
setup time at each node is great compared with the ac- 
tual communication time. On the iPSC, this is often the 
best model. 

For an architecture like that of the Ncube [5] with 
bidirectional channels, a bidirectional version of model A 
could apply, and in that case the graph-theoretic ques- 
tions that we ask have easy answers, while model B if 
appropriate would still most naturally use unidirectional 
communication. 


3 Optimal Algorithms for Complete 
Graphs 


In this section we shall study the token broadcast prob- 
lem for the complete graph interconnection pattern. We 
will present an algorithm for each of the communication 
models and we will also show that the algorithms are 
optimal in terms of number of steps. 

It is easy to see that we can solve the problem in two 
steps under model A. We pick a distinguished node p and 
in the first step, every node (except p) sends its token to p 
and in the second step p sends all the accumulated tokens 
to each of the other processors. It is also clear that this 
algorithm is optimal in terms of number of steps. 

For model B, we shall first establish a lower bound on 
the number of steps and then present an algorithm which 
runs at that speed. Let us assume that there are N = 24 
nodes in the network. We shall introduce a measure on 
the amount of information at a given node and study the 
maximum rate of growth of the total information over 
all the nodes in the network. At first, one may try the 
number of tokens v,; at processor p; as a measure of infor- 
mation content at that processor with D2, being the 
total information over all the nodes in that network. A 
simple analysis shows that the maximum rate of growth 
of information is 2 and the resulting lower bound on the 
number of steps is d. Instead, let us use v* as the measure 
and find that k which will give us the best lower bound. 
A careful analysis shows that k = 2 is the best choice; 
given k = 2 a straightforward calculation establishes the 
following theorem. 


Theorem 1: For a network of N processors whose in- 
terconnection pattern is a complete graph, any algorithm 
that solves the token broadcast problem under model B 


takes at least 
log. N 
log,(248) 


steps. (This translates to roughly 1.44 log, N steps.) 
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The following theorem shows that the lower bound is 
actually matched by the optimal algorithm to the broad- 
cast problem. 

Theorem 2: There exists an algorithm that solves 
the token broadcast problem under model B for a net- 
work of N processors whose interconnection pattern is a 
complete graph in 


steps. 

Proof: The general description of the actual algo- 
rithm which achieves this bound is as follows. After k > 2 
steps of the algorithm half the number of nodes have ac- 
cumulated 2 - f#b(k) tokens and the other half have ac- 
cumulated 2- fib(k — 1) tokens where fib(k) is the k‘* 
element in the Fibonacci sequence with fib(0) = 0 and 
fib(1) =1. At each step, the nodes with larger number 
of tokens send their tokens to the other half of the nodes; 
care should be taken so that the tokens that a node re- 
ceives is disjoint from the tokens that it already has. The 
actual protocol is given below. 

We assume N is even; if not, choose any one node 
and send its token to another site, solve the problem on 
all nodes but the one, and lastly send from some site 
to the chosen node. Let n = N/2 and label the nodes 
PO; P1,°**,Pn—1, 10, 71,°°*>In—1- 

Step 0: Transmit from p; to q; for all#,0 < ¢ < (n—1). 

Step 1: Transmit from q; to p; for all#,0 < ¢ < (n—1). 

Step k, k > 2: If k is even transmit from p,; to 
9i+fib(k-1) for all ¢, and if k is odd, transmit from q; 
to pi+sib(k—-1) for all 1, where + denotes addition modulo 
n. 

It is easy to verify that this strategy has the following 
properties: after step k, for each : either p; or q; has all 
tokens from all nodes p; and q; for 7 = t,s — 1,---,3 — 
fib(k+1)+1 where — denotes subtraction modulo n; the 
other one (p; or q;) has all tokens from nodes p; and q; 
for j =1,i—1,-+--,4 — fib(k) + 1; after step k, nodes p; 
and q; (k even) or nodes q; and p; (k odd) have 2- fib(k) 
and 2- fib(k-+ 1) tokens respectively; each step involves 
transmitting from the sites with the larger number of 
tokens to the sites with the smaller number. Then it 
can be seen that when fib(k) > n, every node has all 
tokens and the broadcast is finished. It is well known 

—2 
that fib(k) > (1475) , and from this inequality it is 
straightforward to count steps and arrive at the above 
bound. 

It is not obvious that this algorithm is optimal, and 
our only proof of its optimality is that it asymptotically 
takes exactly the same number of steps given by the lower 
bound. In fact this algorithm achieves at each step very 
close to the maximum rate of growth of information as we 
defined it in establishing the lower bound, and it solves 


the broadcast problem efficiently (usually within 1 step 
of the lower bound) even for small values of N. 

The optimal algorithm does not make use of all the 
links in the complete graph. The communications re- 
quired by this algorithm define a network whose intercon- 
nection pattern is a regular graph of valence O(log N). 
Further this algorithm would be optimal even if one were 
to charge for the transmission as a linear function of the 
size of the message. 


4 Broadcast Problem on a Hyper- 
cube 


The hypercube Hy of dimension d has (0,1)-vectors of 
length d as nodes, with an edge between two nodes if they 
differ in exactly one coordinate [7]. It is easy to observe 
that any algorithm under either model of communication 
will take at least d steps to solve the broadcast problem 
since the dtameter of the hypercube Hg is d. 

It is not too difficult to find algorithms that under 
model A solve the broadcast problem in d+ 1 steps. The 
simplest was first presented in [6] and can be described 
in this way. In the even numbered steps, nodes labelled 
with even parity send their collection of tokens to their 
respective neighbours labelled with odd parity and during 
odd numbered steps, the nodes with odd parity send their 
collection of tokens to their respective neighbours with 
even parity. But we have recently discovered for d > 3 
a way to merge two different strategies of this sort into 
a combined strategy that solves the problem in d steps 
which is optimal. 

Let us look at the broadcast problem for model B. Let 
T;t (T;) denote the computation step where each node 
whose ¢** bit is 0 (1) transmits all its accumulated tokens 
to its neighbour whose corresponding bit is 1 (0). Now 
consider the sequence 0 = vb oY or Tt? ,2y- It can 
be shown that any permutation of o is a solution to the 
broadcast problem, and furthermore if any step is omit- 
ted from such a permutation then it is no longer a solu- 
tion. Thus this algorithm runs in 2d steps but no faster. 
We have tried a number of different direct approaches, 
and they all resulted in algorithms that run in 2d steps. 
However, we have recently discovered an extremely com- 
plicated algorithm that solves the problem in 17 steps on 
the 9-cube, so we know that 2d is not optimal. But we 
believe there will be no simple algorithm that is better 
than 2d, and we also suspect that the optimum is closer 
to 2d than to 1.44d, so for practical purposes we consider 
the permutations of o to be the recommended algorithms 
for the broadcast problem under model B. 


5 Other Multiprocessor Networks 


Given an arbitrary interconnect network, it is interest- 
ing to consider the minimal number of steps needed by 
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an algorithm to solve this token broadcast problem for 
either of the models we consider here. Especially in the 
case of Model B, the minimal number of steps appears 
to effectively capture the speed at which a network can 
perform a global synchronization. This measure is quite 
distinct from diameter or other common measures. 

Restricting our attention to Model B (which is ar- 
guably most realistic and in fact models the performance 
of the INTEL iPSC family), our results demonstrate that 
a complete graph (diameter of 1) is less than 50% more ef- 
ficient than a hypercube with the same number of nodes. 
Another way of viewing our result is that there exist in- 
terconnection networks with about 50% more commu- 
nications channels than a hypercube that are optimal 
among all networks for this token passing problem. Our 
understanding of these optimal networks is extremely 
rudimentary at present. 

A good example to illustrate this new measure is a 
star network with one central node and n spokes. Each 
of the n spokes is connected to the central node only. 
The diameter of this graph is 2 but the optimal algorithm 
under Model B requires 2n steps. This observation cap- 
tures precisely the common wisdom that a star network 
encounters too much contention at its central node. 

A linear array of nodes requires either n or n+-1 steps 
for an optimal algorithm (depending on the parity of n) 
and this does in fact correspond closely to the diameter 
in this case. We have shown that for an ny X nz X -++ X 
n, multidimensional grid, the problem can be solved in 
xE_ p(n;) steps, where p(2m) = 2m, and p(2m + 1) = 
2m + 2. However, the diameter of such a grid is Ln; —k 
and so this solution is always within 2k of the diameter 
which is a lower bound. We have also shown that for 
a ring of n nodes it can be solved in n/2 + \/2n steps. 
In both cases, the solution is optimal and the number of 
steps is close to the diameter. We do not know what the 
optimum is for the hypercube. 


6 Conclusions 


The main result of this paper is that the optimal algo- 
rithm for the token broadcast problem (under a restricted 
model of communication) for a complete graph network 
with N processors runs in [1.44 log,N + 4] steps. We 
observe that the hypercube network is not very much 
slower as the same problem can be solved on a hyper- 
cube in [2 log,N] steps. There are two immediate ques- 
tions that one would like to solve. What is the optimal 
algorithm for the hypercube network? Is there a struc- 
tural property of the interconnection graph that totally 
captures the difficulty of the token broadcast problem? 


References 


[1] G. Cybenko, Alva Couch, David Krumme, and K. N. 
Venkataraman. Heterogeneous processors on homo- 
geneous multiprocessors. To Appear in Proceedings 
of Second ARO Workshop on Scientific Computing 
and Medium-Scale Multiprocessors, SIAM, Philadel- 
phia, 1986. 


[2] INTEL Corporation. iPSC Data Sheet. 1985. 


[3] G. Lev, N. Pippenger, and L. G. Valiant. A fast par- 
allel algorithm for routing in permutation networks. 
IEEE Trans. on Computers, C-30(2):93—100, 1981. 


[4] D. Nassimi and Sartaj Sahni. Data broadcasting 
in SIMD computers. JEEE Trans. on Computers, 
C30(2), February 1981. 


[5] NCUBE Corporation. Product Announcement. Tempe 
Arizona, 1985. 


[6] Y. Saad and Martin Schultz. Data Communication 
an Hypercubes. Technical Report, Research Report 
YALEU/DCS /RR-428, October 85. 


[7] Leonard Uhr. Algorithm-Structured Computer Ar- 
rays and Networks. Academic Press, 1984. 


[8] J. D. Ullman. Computational Aspects of VLSI. 
Computer Science Press, 1984. 


[9] J.D. Ullman. Some Thoughts about Supercomputer 
Organization. Technical Report STAN-CS-83-987, 
Department of Computer Science, Stanford Univer- 
sity, October 1983. 


[10] L. G. Valiant. A scheme for fast parallel communi- 
cation. SIAM J. Computing, 11(2), 1982. 


558 


VECTOR PROCESSING ON THE 
ALLIANT FX/8 MULTIPROCESSOR 


Walid Abu-Sufah and Allen D. Malony 


Center for Supercomputing Research and Development 
University of Illinois at Urbana-Champaign 
104 S. Wright St. 
Urbana, Ill. 61801-2987 


Abstract 


The Alliant FX/8 multiprocessor implements several high-speed 
computation ideas in software and hardware. Each of the 8 
computational elements (CEs) has vector capabilities and mul- 
tiprocessor support. Generally, the FX/8 delivers its highest 
processing rates when executing vector loops concurrently [5]. 
In this paper, we present extensive empirical performance results 
for vector processing on the FX/8. The vector kernels of the 
LANL BMK8al benchmark are used in the experiments. We 
execute each kernel on 1 and 8 CEs and show the measured exe- 
cution rate (in MFLOPS) as a function of vector length. We 
assess the performance of 1 CE as a vector processor by finding 
the vector lengths where vector processing exceeds that of scalar 
processing and calculating Hockney’s n,,. For 8 CEs, we give 
upper/lower bounds on the achieved speedups and on the mul- 
tiprocessing overhead. We also show the speedup variation as 
the number of CEs increases from 2 to 8. Our results reveal 
some interesting phenomena. Vector processing performance in 
a machine with a multi-level memory hierarchy, such as the 
FX/8, depends significantly on where the referenced vectors 
reside. Execution from memory, rather than from cache, 
degrades performance by a factor up to 3.7. Although speedups 
around 7 can be achieved for most stride-1 kernels when exe- 
cuted on 8 CEs, the maximum execution rates occur only for a 
narrow range of vector lengths (O(1000)). Performance drops 
rapidly when the vector lengths deviate slightly from the 
optimal values. This phenomena is not observed when executing 
on a single CE; the peak performance is obtained when the vec- 
tors are 32 elements long and remains close to the maximum for 
longer vector lengths (O(1000)). The kernels do not gain any 
appreciable speedup when the number of CEs is increased 
beyond 4 for short (O(100)) or long (O(10,000)) vectors. Mul- 
tiprocessing of some indexed vector kernels results in almost no 
speedup due to the synchronization necessary to enforce output 
dependencies. 


1. Introduction 


The Alliant FX/8 is a shared memory multiprocessor system 
with a maximum advertised performance of 94.4 millions of 
floating point operations per second (MFLOPS) for single preci- 
sion computations [5]. Each of its 8 computational elements 
(CEs) has vector processing capability with a peak advertised 


* This work was supported in part by the National Science 
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DCR84-06916, the US Department of Energy under Contract 
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execution rate of 11.8 MFLOPS. The FX/8 is one of the several 
machines which have been announced in the last few years that 
use different forms of parallelism to exceed the performance 
attainable from the technology used in the implementation(?), 
The FX/8 combines several interesting high-speed computation 
ideas in both software and hardware [12], [20], [21], [19], [13], 
[17]. It has an interactive optimizing Fortran compiler which 
transforms loops in subroutines to execute in vector mode on a 
single CE, vector-concurrent mode on multiple CEs, or scalar- 
concurrent mode on multiple CEs [5]. The operating system, 
Concentrix, is a multiprocessor Unix based on Berkeley 4.2 BSD. 
Multiprocessing is realized by concurrency control hardware in 
each CE which is accessed using special concurrency instruc- 
tions. The 8 CEs of the system are crossbar connected to a 
shared, direct mapped, cache. The cache is connected to the 
shared memory via a bus. A more detailed description of the 
F'X/8 is presented in the Section Two. 


The performance assessment of a vector multiprocessor machine, 
like the FX/8, is important because of the great amount of effort 
that was spent in the last decade to develop vector algorithms 
for different applications and to enhance the capabilities of vec- 
torizing preprocessors to detect vector loops in dusty deck codes 
[19]. In addition, there is little empirical data in the literature 
modeling the behavior of vector multiprocessors [7], [11]. This 
paper presents empirical results on the vector performance of 
the Alliant FX/8 multiprocessor. The thirteen vector kernels of 
the Los Alamos National Laboratory benchmark BMK8al1 (for 
double precision computations) were used in our experiments 
[14]. 


In Section Three we report and discuss the experimental results. 
We show the delivered performance for each kernel when exe- 
cuted on one CE. A single CE demonstrates the classical perfor- 
mance behavior of a vector processor where the maximum per- 
formance is sustained over a wide range of vector lengths. How- 
ever, as cache misses increase for longer vector lengths, the per- 
formance drops sharply to a rate where the cache hit ratio is at 
a minimum. To characterize the vector performance of 1 CE, 
we determine the vector length where each kernel starts execut- 
ing faster in vector mode than in scalar mode and calculate n,, , 
the vector length at which the CE is supposed to deliver half of 
its peak performance (r,,), as described by Hockney in [16]. 
These results indicate that a single CE processes short vectors 
efficiently. 


For 8 CEs, we present the delivered performance, speedup, and 
multiprocessing overhead for each kernel. We observe that the 


(@) For a rather comprehensive survey of such machines and 
brief descriptions see [9], [10]. 


execution rate increases as the vector length increases and then 
drops significantly to a minimum rate. The results show that 
the maximum performance for each kernel on 8 CEs is sustained 
over a significantly smaller range of vector lengths than for 1 
CE. This is reflected in the speedup and multiprocessing over- 
head calculations. Speedup, S_(n), is defined as t /t. where t, 
and t are the execution times of a kernel for vector length n 
executed on 1 and p CEs, respectively. We define the machine 
efficiency, E (n), as the maximum delivered execution rate 
divided by the peak advertised execution rate of the machine 
and multiprocessor efficiency, E’(n), as S_(n)/p. Multipro- 
cessing overhead, OV. (n), is equal fo 1-E (n). Our results indi- 
cate that only modest improvements in speedup are achieved 
when processing short (<=100) and long (>10,000) vectors on 
more than 4 CEs. 


To determine the effect that the cache has on performance, we 
repeat the experiments for each kernel such that cache misses 
will be encountered whenever possible. Our measurements show 
that the performance decreases by a factor up to 3.7 when the 
vectors are referenced from memory instead of from the cache. 


The BMK8al benchmark contains kernels with subscripted vec- 
tors. These vector kernels run slower than the stride-1 kernels. 
In Section Three, using one of the indexed kernels, we briefly 
discuss issues which affect the performance of such kernels. In 
Section Four we make some concluding remarks. 


2. The Experimental Environment 


We performed our experiments at the Center for Supercomput- 
ing Research and Development (CSRD) of the University of 
Illinois(*), The configuration of the FX/8 used for these experi- 
ments is shown in Figure 1. The computational complex of the 
FX/8 contains 8 CEs. When executing concurrency instructions, 
the CEs communicate via a concurrency control bus. Each CE 
has a computational clock period of 170 nsec with a peak execu- 
tion rate of 11.8 MFLOPS and 5.9 MFLOPS for single and dou- 
ble precision computation, respectively [5], [9]. With the 8 CEs 
working concurrently, the FX/8 advertised peak performance is 
47.2 MFLOPS for double precision computations. The CEs are 
connected by a crossbar switch to a direct-mapped, write-back, 
shared cache of 16K double precision words(*), The cache is 
implemented in 4 quadrants with a peak interleaved bandwidth 
to the CEs of 47.125 MW/sec. It is connected to a 4 MW shared 
memory via a bus with a peak bandwidth of 23.5 MW/sec. The 
system also contains 6 interactive processors (IPs) connected to 
their own caches as shown in Figure 1. The IPs primarily per- 
form operating system related functions and I/O operations. 


A computational element has vector processing capabilities as 
well as multiprocessing support. It has a rich set of arithmetic, 
logical and comparison vector instructions plus vector move 
instructions including scatter, gather and merge. There are 8 
32-bit data registers, 8 address registers, 8 double precision float- 
ing point registers, and 8 32-element, double precision vector 
registers in each CE. Operands of vector instructions can come 
from vector registers, vector and floating point registers, or vec- 
tor registers and the cache. Chaining is also supported for vec- 


(b) At the time the work reported in this paper was performed, 
the machine did not run production releases of the OS and the 
compiler. However, we believe that our conclusions will not 
change significantly when these releases are available. 


(c) A double precision word is 64 bits wide. 
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tor add-multiply and vector multiply-add instructions. Mul- 
tiprocessing is supported by concurrency instructions which per- 
mit iterations of a loop to be executed concurrently across multi- 
ple processors in the CE complex. 


The Alliant FX/8 Fortran compiler provides automatic detection 
of vector and/or multiprocessed loops. It optimizes code for 
scalar, vector and concurrent execution. Based on data depen- 
dency analysis, loops are optimized to execute in one of four 
modes: vector, scalar-concurrent, vector-concurrent, or 
concurrent-outer/vector-inner [5]. The FX/8 operating system, 
Concentrix, extends Berkeley Unix 4.2 to provide support for 
multiple processors and a large virtual space. 


The FX/8 system maintains timing information for each 
program which is accessible through Fortran library routines 
and can be used for measurement purposes. Our experimenta- 
tion procedure attempted to remove any inconsistencies that 
might result in the performance measurements due to the resolu- 
tion of these timing tools by assuring a long running time rela- 
tive to the granularity of the timed event. This was achieved by 
enclosing each kernel in a serial timing loop which repeats the 
execution of the kernel as many times as needed to obtain reli- 
able timing data. All measurements were performed in stand- 
alone mode. Each vector kernel was executed five times for vec- 
tor lengths varying between 1 and 100,000. The repetition of 
the experiments was necessary due to significant variations in 
the execution rates from one run to another for certain regions 
of vector lengths. 


The Los Alamos National Laboratory benchmark BMK8al con- 
tains thirteen vector kernels designed to reflect the vector state- 
ments which are widely encountered in scientific codes (14]. 
Each kernel is a different combination of add and multiply 
operations of vectors and scalars that stores the outcome into a 
result vector. Some kernels use an additional vector to index an 
operand or result vector. In our notation used to identify the 
different kernels, v is a vector, s is a scalar, p denotes addition, 
t denotes multiplication, and ¢ denotes an indexing vector(4), 
For instance, vts is the kernel v1 v2 * 8s and vi=vtv is the 
kernel v1(i+k) = v2 * v3 where ¢ is the indexing vector and k 
is a constant. The complete list of vector kernels is: 


ups vtv utsputs vips 
vts vtups vtupu vt==vutu 
upu uputs vtupvutu uputut 


vi=—vipvtv 
3. Experimental Results 


The performance of a one CE system is measured in order to 
calculate speedup and other performance metrics for multiple 
CEs. Examining the one CE results reveals interesting charac- 
teristics of the behavior of a vector processor when accessing 
data in a multi-level memory. The 8 CEs results show the per- 
formance improvement obtainable from vector-concurrent opera- 
tion. The vector length region where maximum execution rate 
is achieved using 8 CEs is narrower than for one CE. However, 
the speedup in this region is around 7 for most stride-1 kernels. 
Comparing the performance for one and multiple CEs reveals 
important observations on the number of CEs which can be 


(4) yg denotes a vector, v, indexed by another vector, +. 


efficiently employed in a vector multiprocessor system. The per- 
formance results for the indexed kernels provides qualitative 
measure of the difficulties encountered when attempting to 
improve the performance of some codes using multiple vector 
processors. 


3.1. One CE Performance 

Figure 2 shows the maximum measured execution rate as a func- 
tion of vector length for each kernel running on 1 CE. We 
observe that the behavior of all the kernels is similar; the com- 
putational rate increases as the vector lengths increase and 
reaches a maximum at vector length n For each kernel, 
the execution rate stays within a small percentage of the max- 
imum until the vector length reaches a value denoted by 


"drop: The computational rate then starts to fall until a vec- 
tor length denoted by N nin is reached. For vectors longer than 
n_. the execution rate remains rather constant. These three 


vector length points are shown for the vtspvuts kernel in Figure 
2. We identify four regions for each performance curve: the 
cache rate region (l<n<n ), maximum rate region 
(2 eak =" S" drop) the falloff region (a6 < ncn, in) 
and the minimum rate region (n > n__._). In the cache rate 
region, the size of the data referenced by each kernel is small 
enough such that the cache hit ratio is maximized. The perfor- 
mance in this region is characteristic of vector processors where 
the execution rate rapidly increases to a maximum point which 
is sustained as the vector length increases. The wide range of 
vector lengths where the execution rate stays within a small per- 
centage of the maximum identifies the maximum rate region. 
The falloff region begins when the cache hit ratio starts decreas- 
ing. As the cache hit ratio continues to decrease for longer vec- 
tor lengths, the number of the references to the shared memory 
increases and the performance drops until the cache hit ratio 
reaches its minimum at n . In the minimum rate region, the 
size of the referenced data is so large that a cache miss occurs 
whenever the kernel accesses the first word of a cache block). 


Several factors affect the delivered performance for a given ker- 
nel at a particular vector length. These factors include the 
number of memory references, the number of floating point 
operations performed, the types of floating point operations, and 
the degree of chaining in the kernel. The ups kernel runs faster 
than the vts kernel because of operation type; vtvps runs faster 
than vtv due to the number of floating point operations per- 
formed and chaining; vtspvts runs faster than viupvtv due to 
the difference in the number of memory references. Table 1 
shows the maximum MFLOPS measured for each kernel on 1 


CE. The maximum execution rate occurs at vector length 32 for 
all kernels. This is expected since the vector registers are full at 
this vector length. Table 2 shows the execution rate for each 
kernel in the minimum rate region. Performance in the max- 
imum rate region can be two times greater than the performance 
in the minimum rate region. 


In order to determine how efficiently one CE processes short vec- 
tors, we found the vector length where execution in vector mode 
starts to be faster than in scalar mode. More performance can 
be achieved in vector mode for vector lengths > 2 for the 
stride-1 kernels and >6 for the indexed kernels. Hockney’s 
(ny, r =) model can also be used to characterize the vector per- 
formance for one CE [16]. Table 3 shows that all kernels have 
an ny <4. This indicates that short vectors will be processed 


(*)4 cache block contains four double precision words. 
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efficiently on the FX/8. However, most kernels deliver only 
around 1/3 of the measured peak execution rate at n,, when 
executed on 1 CE instead of half the peak rate as expected by 
Hockney’s model. It can be shown that Hockney’s two parame- 
ter model (n sr) is simplistic when used to model real vector 
processors [4], [22]. 


3.2. Eight CEs Performance 

Deli LE tion Rat 

Figure 3 shows the maximum performance results when execut- 
ing in vector-concurrent mode on 8 CEs. We observe that the 
execution rate rises more slowly and reaches the maximum rate 
region for much longer vectors than in the 1 CE case (O(1000) 
compared to 32). This is partially due to the fact that vector- 
concurrent execution partitions the vector operation equally 
among the 8 CEs and longer vectors are required before the vec- 
tor registers of each CE are maximally utilized(). Multiprocess- 
ing overhead associated with starting and sustaining vector- 
concurrent operations accounts for the further increase needed 
in vector length before the maximum rate is achieved. The per- 
formance of a kernel on 8 CEs is affected by the multi-level 
memory hierarchy for the same reasons as in the 1 CE case. In 
fact, we observed that the falloff region in the 8 CE performance 
curves coincides with the falloff region in the 1 CE performance 
curves for each kernel. However, the percentage drop in 
MFLOPS in the falloff region is greater with 8 CEs. Due to the 
initial slow performance increase to the maximum rate and the 
fixed falloff region, the maximum rate region spans a much 
smaller vector length interval than in the 1 CE case. This result 
has ramifications on how codes should be structured, with 
respect to vector length, so as to maximize the vector-concurrent 
performance when running on 8 CEs. In particular, we notice 
that if vector lengths deviate slightly from the maximum rate 
region, performance degrades rapidly. 


Table 1 shows the maximum peak MFLOPS measured for each 
kernel when executing on 8 CEs, the vector length where the 
peak performance is delivered, and the machine efficiency (E) 
at this vector length). The machine efficiency is less for 8 CEs 
than for 1 CE due to multiprocessing overhead. All the kernels 
achieve < 50% machine efficiency and 10 kernels are < 30% 
efficient for 8 CEs. Table 2 is analogous Table 1 except the data 
for the MFLOPS in the minimum rate region is presented. It- 
can be seen from the MFLOPS and the machine efficiency that a 
low cache hit ratio significantly reduces the performance. This 
subject is discussed in more detail in the section 3.3. 


Speedup Results 

Speedup is defined as S(n) = t,/t, where t, and t, are the 
execution times of a kernel for vector length n executed on 1 
and p CEs, respectively), Since there were variations in 
measuring ¢, and #¢, over the five runs, we define the lower 
bound on the speedup of a kernel with vector lengths n, L ; 
to be the ratio of the smallest of the five t,’s and the largest te. 
The upper bound on the speedup, U s is calculated as the 
ratio of the largest measured t, to the smallest te. Figure 4 
shows the upper and lower bound speedup curve for the vtv 


() This analysis is supported by the observation that the execu- 
tion rate has a local maximum at vector length 256 for almost 
all of the kernels. At vector length 256, the vector registers for 
each CE are full. 


(e) By definition, the maximum E occurs at this vector length. 
(h) In the remainder of the paper, S(n) will be used to denoted 


Sn). 


kernel), We observe that for all kernels the speedup upper 
bound is less than 4 for vector lengths smaller than 500 and less 
than 5 for vector lengths greater than 10,000. The maximum 
upper bound speedups for all the kernels are shown in Table 4. 
Eight of the 13 kernels have a maximum upper bound speedup 
greater than 7. The speedup of 4 of the remaining kernels is 
between 5 and 7. The indexed kernels have the smallest speed- 
ups; vt==vtv has a maximum upper bound speedup of only 1.32. 


By comparing the vector lengths in Tables 1 and 4 we observe 
that the vector lengths where the peak MFLOPS and maximum 
speedup occur are not necessarily the same for a given kernel. 
Table 5 shows the speedup upper bound at the vector lengths 
where each kernel runs at its peak execution rate. We notice 
from Figure 4 and Tables 4 and 5 that the lower and upper 
bounds on speedup can differ significantly (up to 57%). This 
variation occurs in the falloff region and is primarily a result of 
the nondeterministic behavior of the cache in this region from 
one run to another. However, this variation is insignificant for 
short and long vector lengths (n<500 and n>10,000). 


Figure 5 shows the speedup as a function of the number of CEs 
for stride-1 kernels. The envelopes in the figure enclose the 
speedup curves for all kernels at vector lengths 100, 1K and 
100K. The speedup curve for vtvps is shown as a dashed line 
and roughly represents the median speedup within each speedup 
envelope. We observe that for both short and long vectors the 
speedup gained by increasing the number of processors beyond 4 
is modest for most kernels. As the number of CEs increase from 
4 to 8, the speedups approach 4.5 and 4 for vector lengths of 100 
and 100K, respectively. For vector lengths of 1K, speedups are 
close to being linear in the number of CEs. These speedup 
results could be very useful to designers of multi-million dollar 
multiprocessor vector machines (e.g., the new Cray multiproces- 
sors) in light of the mean vector lengths encountered in applica- 
tion codes (<468 in the Lawrence Livermore National Lab work- 


load [24]). 


Multiprocessing Overhead 

The percentage multiprocessing overhead of a machine with p 
processors when executing a kernel with vector length n, 
OV_(n), is given by (1-S_(n)/p)*100%; OV/(n) denotes the 
overhead when p=8. We idt U OVIn denote the upper bound 
on the overhead and L denote the lower bound. Figure 6 
shows the upper and lower bound overhead curve for the upvts 
kernel). For all kernels, U OV/n) *8 greater than 50% for short 
(n< 100) and long (n> 10200 vectors. For six of the nine 
stride-1 kernels, U O is less than 25% for vector lengths in 
the region 1000 <n < 2000. The overheads of the indexed vector 
kernels are greater than 30% for all vector lengths. 


Table 6 identifies the vector lengths where the minimum 

U occurs for all the kernels. We note that for the 9 
OVinl ae é 

stride-1 kernels the minimum Vov n) is between 10-20%. For 

the indexed kernels, the minimum yf OVin is close to 30% for 1 

kernel, 40% for 2, and 85% for the ok Jernel. 


3.3. Execution from Memory 

In order to measure the vector-concurrent execution rate for a 
kernel with vector length n (1<n<100,000) such that the the 
maximum number of cache misses occurs, we reference the vec- 


(i) The speedup curves for the other kernels are found in [3]. 
(i) The overhead curves for the other kernels are found in [3]. 
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tors as columns of two-dimensional arrays. The kernel is exe- 
cuted for the first column (of length n), then the second column, 


and so on. Since every column is distinct, new vectors are 
always being referenced. Also, the two-dimensional array sizes 
are declared such that when a column is referenced again by the 
timing loop, none of the data from the previous reference of that 
column will be present in the cache. We refer to the running of 
a kernel in this fashion as execution from memory. When a 
kernel is not restricted to execute in this manner and is able to 
take full advantage of the cache, we say that the kernel is exe- 
cuting from cache. 


Figure 7 shows the execution rate from memory and from cache 
for the vtsputs kernel using 8 CEs. When the kernel executes 
from memory, the execution rate rises slowly as the vector 
length increases and reaches a maximum that coincides with the 
rate obtained in the minimum rate region when the kernel exe- 
cutes from cache. The same behavior is observed for the other 
kernels [3]. Table 7 shows the maximum performance degrada- 
tion factors when kernels execute from memory instead of from 
the cache. This factor ranges between 1.35 and 2.26 when run- 
ning on 1 CE. When executing on 8 CEs and using stride-1 ker- 
nels, the degradation factor is larger and ranges between 3.03 
and 3.67. For indexed kernels, the degradation factor is between 
1.49 and 2.08. 


It is obvious that if the memory speed were increased, the execu- 
tion performance from memory would increase. It is also 
expected that by increasing the size of the cache, the maximum 
rate region of the execution from cache curve would be 
extended. A current limitation to increasing the performance in 
the minimum rate region of the two curves is the bandwidth 
available on the Data Memory Bus [3]. Improving the data 
memory bus bandwidth would have the effect of shifting the 
minimum rate region of the two curves upward. 


3.4. The Behavior of Indexed Kernels 

When running on a single CE, the execution rate of a kernel will 
drop if one or more of the referenced vectors are addressed 
indirectly. This is mainly due to an increase in the number of 
vector instructions generated by the compiler for an indexed 
kernel (scatter/gather instructions, etc.). Moreover, the memory 
access pattern of the indexed vector elements might result in an 
appreciable decrease in the utilization of the bandwidth of the 
interleaved memory modules. 


The execution rate of a multiprocessed vector kernel could 
degrade significantly if the result vector is addressed indirectly. 
This can be seen clearly in Figure 3 when comparing the execu- 
tion rate curves for kernels vtv and vi=vtv. Examing the 
assembly code generated for the two kernels reveals that while 
the vtu kernel is executed as a single concurrent vector loop, the 
concurrent loop of the vi=vtv kernel encloses two vector loops. 
Each CE first executes the vector statement temp+vtv, where 
temp is a temporary vector. While CE, continues by executing 
a second vector loop to scatter the contents of temp to the 
specified elements of the result vector, each CE. (i=1,...,7) waits 
for a synchronization signal from CE. indicating that it has 
finished scattering its results. In this fashion, any output depen- 
dence relationships between different instances of the original 
statement vt=—vtv will be preserved. Figure 8 shows the execu- 
tion scheme of this kernel using 8 CEs. This explains the lack of 
any speedup when this kernel is executed concurrently. Syn- 
chronization is not required in kernels with no potential output 


dependencies. Kernels with output dependencies will gain 
speedup if the time spent evaluating the right hand side expres- 
sion of the kernel is significantly larger than the time spent in 
the scattering loop by each CE. This is the case for the kernel 
vi=viputv where it is possible to attain a maximum speedup of 
5.86 compared to 1.32 for the kernel vi=vtv. 


Conclusion 


Using the vector kernels of the LANL BMK8al, this paper 
assesses the performance of the Alliant FX/8 multiprocessor for 
vector processing. One CE of the FX/8 shows the classical vec- 
tor processor behavior where the performance increases as the 
vector length increases to a maximum which is maintained for 
larger vector lengths. However, because of the memory hierar- 
chy, the vector performance of 1 CE falls to a rate dependent on 
the shared memory access speed when the cache size is not large 
enough for the referenced vectors to be cache resident. 
Although vector processing performance on 8 CEs shows the 
same rise and fall as vector length increases, parallel execution 
accentuates this behavior. First, the increase in execution rate 
is slower for short vectors due to the partitioning of the vectors 
across multiple processors and the multiprocessing overhead. 
Second, although speedups greater than 7 are achieved for most 
stride-1 kernels for vector lengths of O(1000), the region of max- 
imum performance is narrower than for 1 CE. Third, the 
speedup gain by increasing the number of computational ele- 
ments beyond four is small for short (O(100)) and long 
(O(10,000)) vectors. The multiprocessing overhead exceeds 50% 
for short and long vectors for all kernels. It is less than 25% for 
most stride-1 kernels for vector lengths of O(1000). Lastly, the 
performance of the FX/8 on stride-1 kernels can drop by a fac- 
tor greater than 3 if the vectors are referenced from the shared 
memory. 


The lower performance of the machine for kernels with indexed 


vectors is partially due to the increase in the number of vector 
instructions needed to execute the kernel. However, interproces- 
sor synchronization to satisfy potential output dependencies is 
the major reason for performance degradation of indexed ker- 
nels. 


The successful use of an 8 CEs FX/8 for vector processing 
depends on observing the principle of locality of reference [8]. 
This implies that the programmer (or optimizing compiler) 
should structure the code such that once certain sections of the 
program’s data are resident in the cache, as much computation 
as possible is performed using this data before processing other 
sets of data [1], [2], [6], [18]. Some of the other factors that 
enhance the delivered performance are reducing the number of 
memory references in the vector statement, increasing the 
number of floating point operations performed using the same 
operands, and taking advantage of the chaining capabilities of 
the processors. 
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The Configuration of the Alliant FX/8 
used in the Experiments 
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Figure 2-b 
The Maximum Delivered Performance 
for a 1 CE System (continued) 
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Figure 3-a 
The Maximum Delivered Performance 
for an 8 CE System 
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Execution Rates at Vector Length 100K 
of Elementary Vector Operations 


1 Computational 8 Computational 
Processor Processors 
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Table 6 


Table 5 
Lower/Upper Bound on Speedups 
at Vector Lengths where Peak 
Execution Rates Occur 


— 
|_ Length Bounds 


vps 5.13 / 6.67 
vts 5.88 / 7.36 
vpv 4.17 / 9.68 
vtv 5.73 / 7.22 
vtvps 5.60 / 7.08 
vpvts 5.15 / 6.99 
vtspvts 6.04 / 7.21 
vivpv 6.14 / 7.07 
vtvpvtv 5.42 / 8.20 
vips 4.15 / 4.83 
vi=vtv 1.08 / 1.24 
vpvtvi 4.32 /{ 4.58 
o. 


Minimum Upper Bound 
Multiprocessing Overheads 
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Table 3 
n, and the Percentage 
of Peak Execution Rate at ny 


vps 
vts 
Vpv 
vtv 
vtvps 
vpvts 
vtspvts 
vtvpv 
vtvpvtv 


* n,,is calculated using a linear 


least-squares approximation of the vector 


execution times for vector lengths between 


1 and 256. 
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Table 7 


Degradation of the Peak Execution 
Rates when Executing from Memory 


Benchmark 
Degradation | Degradation 
Factor * Factor * 


vps 
vts 
vVpv 
viv 


vtvps 
vpvts 
vtspvts 
vtvpv 
vtvpvtv 
vips 
vi=vtv 


* 
degradation factor is calculated as 
the peak execution rate divided by the execution 


rate at 1OOK 
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Abstract -- Supercomputer architectures for 
numerical applications can be based on either of 
two major principles, SIMD vector machines or 
MIMD multicomputer systems. In the paper both 
solutions are compared in terms of performance, 
cost-effectiveness, and software problems in- 
volved, thus providing the rationale for the de- 
cision to develop a MIMD/SIMD multiprocessor su- 
percomputer, configurable to up to 1024 nodes, 
where each node contains a floating-point vector 
processor. The crucial issues of such an archi- 
tecture are the need for a bottleneck-free inter- 
connection structure on the hardware side and for 
the appropriate program development environment 
on the software side. The solutions envisioned 
for the SUPRENUM MIMD/SIMD supercomputer present- 
ly under development are presented. 


The Choice of Supercomputer Architecture 


In this paper the term “supercomputer" refers to 
very high performance machines for numerical ap- 
plications; though our considerations to some ex- 
tent hold true as well for non=-numerical super- 
computers. Supercomputers obtain their perform- 
ance from two contributing factors. Firstly, they 
operate at the highest possible speed technology 
can provide. Secondly, additional performance is 
gained through a large amount of parallel pro- 
cessing. 


An algorithm may be generally defined as a par- 
tial order of operations, the partial order being 
determined by the data dependencies between the 
operations. We call the parallelism of an algo- 
rithm explicit if the data dependencies are well- 
defined by the nature of the data types to be 
processed and, consequently, are known a priori. 
We call the parallelism implicit if it is nota 
priori known but must be determined through data 
dependence analysis. 


What makes dataflow principle so attractive is 
its property to enable the machine to perform the 
data dependence analysis at run time and, thus, 
to provide a convenient way of exploiting implic- 
it parallelism. Since implicit parallelism in- 
cludes explicit parallelism, dataflow is a most 
general operational principle for parallel pro- 
cessing. The forte of the dataflow solution how- 
ever, comes at the price of considerable overhead 
and, therefore, is not cost-effective in parallel 
processing applications where the simpler and 
more efficient SIMD or MIMD control scheme of 
handling explicit parallelism suffices. 
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Figure 1 presents a taxonomy relating the nature 
of parallelism to the appropriate control struc- 
tures, processor structures, and communication 
structures. 


ALGORITHMS 


IMPLICIT 


EXPLICIT 
PARALLELISM 


MIMD/SIMD 


PARALLELISM 


DATAFLOW 


PIPELINE ARRAY OF MULTI MULTI 
PROCESSING PIPELINE PROCESSOR 
ecg | SYSTEM SYSTEM 


INTER 


BUS CONNECTION 
REMORY STRUCTURE NETWORK 
(STRONGLY (LOOSELY (LOOSELY 
COUPLED) COUPLED) COUPLED ) 
Figure 1 Taxonomy of parallel computer architec- 


tures 


Figure 2 lists some of the major trade-offs be- 
tween SIMD pipeline architectures and MIMD multi- 
processor architectures. In the supercomputer do- 
main, only the SIMD vector machine has found a 
larger market penetration. Recently, the first 
MIMD multiprocessor systems have become available 
in product form; their use at present still being 
more that of experimental systems rather than 
"production machines'. However, two decades of re- 
search in innovative computer architecture have 
produced enough insight into all four forms of 


parallel architectures listed in Figure 1 to make 
it safe venturing the following statements. 


1. SIMD machines consisting of an array of pro- 
cessing elements cannot compete in cost-effec- 
tiveness with SIMD pipeline machines, since 
they do not provide the parallel processing 
gain of a multi-stage pipeline processor. 


In MIMD machines for numerical applications, 
the cost-effectiveness can be increased by as 
much as an order of magnitude through perform- 
ing the floating point operations in each node 
in the vector mode rather than the scalar mode. 


In the realm of numerical supercomputers, pure 
dataflow architectures have no chance to become 
a competitive solution; the reason being two- 
fold. Firstly, since the parallelism in large 
numerical (array processing) problems is pre- 
dominantly explicit and, thus, can be handled 
by the extremely efficient SIMD control scheme, 
the ability of the dataflow machine to handle 
implicit parallelism, which leads to its high 
control overhead, does not pay. Secondly, the 
pure dataflow scheme is based upon the notion 
that the firing of an operation consumes the 
tokens that are the "carriers" of the operands. 
Consequently, a value can be used as operand 
only once. This is counter-productive in all 
numerical algorithms in which values occur as 
operands of a number of operations. Both, the 
high control overhead and the inability of the 
pure dataflow scheme to handle data structure 
objects, result in an unfavorable MIPS/MFLOPS 
rate. By special measures (e.g., the introduc- 
tion of the ‘structure memory’ in the manches- 
ter machine (Gurd, 1985), this rate can be 
much improved. However, it should not be over- 
looked that such measures in effect constitute 
a deviation from the pure dataflow scheme by 
introducing beneath the dataflow control level 
an SIMD control level for handling data struc- 
ture objects. The performance gain obtained by 
such measure therefore is to be attributed to 
the SIMD execution of complex operations on 
data structure objects, while the dataflow con- 
trol now synchronizes only the complex proce- 
dures and not the operations inside the proce- 
dures and, thus, becomes tolerable. 


As conclusion of the discussion above we dare 
stating the following CONVERGENCE THEOREM: Numeri- 
cal supercomputers of the future will be predomi- 
nantly MIMD/SIMD machines. 


The convergence will happen from the side of the 
vector machines by adding more and more pipelines; 
from the side of the MIMD architectures by pipe- 
lining the floating point coprocessor in the node. 
The SUPRENUM architecture is of the latter type. 
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-|PROGRAMMING STYLE: 


SIMD PIPELINE ARCHITECTURE 


| PROCESSOR PERFORMANCE: 

SIMD GAIN 

PARALLEL PROCESSING GAIN OF MULTISTAGE 
PIPELINE FOR LARGE, ORDERED SETS OF DATA 
"SCALAR GAP" 


| PERFORMANCE LIMITATION DETERMINED BY: 
EFFECTIVE DATA MEMORY BANDWIDTH 


POTENTIAL FOR PARALLEL PROCESSING: 
LOW TO MEDIUM 


COST-EFFECTIVENESS: 
HIGHER 


} PROGRAMMING STYLE: 
PREDOMINANTLY FUNCTIONAL 


|CONFIGURABILITY AND FLEXIBILITY OF APPLICATION: 
LOW 


MIMD MULTIPROCESSOR ARCHITECTURE | 


PROCESSOR PERFORMANCE: 
NO SIMD GAIN 
NO OR LITTLE PARALLEL OPERATION 
NO "SCALAR GAP" 


PERFORMANCE LIMITATION DETERMINED BY: 
NODE INTERCONNECTION BANDWIDTH 


POTENTIAL FOR PARALLEL PROCESSING: 
HIGH TO VERY HIGH 


COST-EFFECTIVENESS: 
LOWER 


COOPERATING PROCESSES OR SINGLE ASSIGNMENT 


CONFIGURABILITY AND FLEXIBILITY OF APPLICATION: 
HIGH 


Figure 2 Trade-offs between SIMD and MIMD 
architectures 


The Choice of Technology 


For the sake of discussion, two main technologies 
called "microcomputer technology" (MCT) and "main 
frame technology" (MFT) shall be characterized. 


Microcomputer Technology: 
e TTL or MOS, highly integrated low-power logic 


and memory (some et to 10° transistors per 
chip); 


@e predominant use of standard ("off-the-shelf") 
components, therefore lower design cost; 


@e low cost of packaging and cooling (forced air 
only); 


e lower operating speed. 


Main Frame Technology: 
e ECL, less highly integrated high-power logic 


3 
(some 10 transistors per chip); ECL or CMOS 
memory ; 


@ predominant use of custom gate array logic, 
therefore higher design cost; 


e higher cost of packaging and cooling; 


e higher operating speed. 


Of course, the cost-effectiveness of a computer 
depends not only on the technology but also on the 
architecture. Two examples shall illustrate that 
fact. 


(A) A microcomputer with floating-point coproces~- 
sor has about the same cost per FLOPS as the 
MFT based vector machine. Hence, the conclu- 
sion can be drawn that the higher cost-effec- 
tiveness of MCT compensates for the lower 
efficiency of the conventional architecture. 

(B) Comparing MCT vector machines with MFT vector 

machines, the former exhibit a much higher 

cost-effectiveness. The reason why neverthe- 
less MFT is the dominating supercomputer tech- 
nology is the much higher absolute performance 

MFT may provide. 


This discussion can be summarized by stating that 
a supercomputer development based on MCT is less 
costly and risky. However, if one wants to reach 
the same performance as with MFT, a higher degree 
of parallel processing is required to make up for 
the lower operating speed of MCT. Whether the MCT 
solution is more cost-effective than its MFT 
counterpart is a question that can be answered 
affirmatively only if the vector machine architec- 
ture is chosen. In the case of a MIMD machine one 
will be quite satisfied if the cost-effectiveness 
of a MFT-based vector machine can be met. In this 
case the MIMD machine is the better choice, for it 
is more flexible and does not exhibit the "scalar 


gap - 


The relationship between the two technologies, MCT 
and MFT, and the two architectural forms, SIMD and 
MIMD, can be summarized as follows: 


MIMD multiprocessor systems with a larger 
number of nodes (as needed to achieve super- 
computer performance) can be realized only in 
MCT. 


SIMD vector machines may be realized in both, 
MCT and MFT. 


569 


To conclude this section, we shall take a look at 
the absolute performance obtainable in either 
technology by contrasting the two cases, SIMD 
vector machine and MIMD multiprocessor system, 
respectively. 


SIMD Vector Machine - MFT 


As paradigm for maximal performance obtainable in 
ECL technology (MFT), we take the vector machines 
of CRAY Research Inc. (CRAY-1, CRAY-X MP, CRAY-2). 
The following table presents the major parameters 
of these machines. 


Model Clock Max. number Max. Year 
Time T of pipelines Performance 

CRAY-1 12.5 ns 1 160 MFLOPS 1977 

CRAY-X MP 9.5 ns 2 400 MFLOPS 1983 

CRAY-2 approx. 5 ns 4 1000 MFLOPS 1986 


SIMD Vector Machine - MCT 


Highly integrated dynamic memory allows a stream 
of up to 16 million words per second to be moved 
to and from the memory. Static CMOS memory has be- 
come available with an access time of down to 15 - 
25 nanoseconds. The paradigm of a highly inte- 
grated floating-point pipeline processor is the 
Weitek WT 2264/2265 chip set (Weitek, 1986), which 
is designed to satisfy the following performance 
specifications: 


- IEEE single precision operations 
(ADD, SUBTRACT, MULTIPLY): 20.0 MFLOPS 


- IEEE double precision operations 
(ADD, SUBTRACT, MULTIPLY): 15.5 MFLOPS 


The Weitek processors form a 7-stage pipeline. 
This, in connection with the pipelining of the 
data moves, results in a pipeline gain of about 
10, a gain that is lost in the case of scalar 
operations. Chaining multiplication and addition 
(accumulation), e.g., for the inner vector pro- 
duct, doubles the vector mode performance. Several 
such pipeline processors could be cascaded to ob- 
tain a "macro pipeline" with a multiple of the 
performance of one processor set, provided the 
memory bandwidth is not already exhausted by one 
Single pipeline. 


MIMD Multiprocessor Machine - MCT 


The maximal performance of an MIMD multiprocessor 
system with N nodes, obtained under most favorable 


conditions, is 
P = - N - P 
max q n 


if P is the performance of a node and q is an 


overhead factor, q<l. 

The conditions under which F ase can be reached are: 

1. The algorithms executed by the machine must 
have a sufficiently high inherent parallelism 
in order to allow for the linear performance 
increase. 


2. There exists no communication bottleneck in 
the system. 


The first condition must be satisfied by the ap- 
plication; the second condition must be satisfied 
by the architecture. Violation of the conditions 
may result in a dramatic performance decrease. 


It was pointed out above that SIMD machines pro- 
vide more MFLOPS per cost unit, while MIMD ma- 
chines provide a higher flexibility of use by 
virtue of the fact that parallel processing is not 
restricted to vectors. Hence, the decision between 
the two architectural principles involves a trade- 
off between the higher cost-effectiveness of SIMD 
and the higher flexibility of MIMD. However, the 
MIMD flexibility exists only if it is not unduly 
constraint by the system's communication struc- 
ture. For example, a communication structure that 
allows each node to communicate only with its 
nearest neighbors would be such a constraint and, 
thus, handicap the MIMD machine's competitiveness 
in comparison to the SIMD machine. Consequently, 
the communication structure of a general purpose 
MIMD multiprocessor architecture should provide 
total internode connectivity. 


MIMD/SIMD Machine - MCT 


Going after the order of magnitude in node per- 
formance gain by "“vectorizing" the node operations 
is a temptation hard to resist to. The result is 
an MIMD/SIMD architecture, i.e. a MIMD multipro- 
cessor architecture in which each node contains 

a floating point pipeline processor. To preserve 
the flexibility of the MIMD approach, the nodes 

of such a machine must be connected through a 
communication structure with a very high communi- 
cation bandwidth. I.e., a certain proportion must 
be maintained between node performance (in MFLOPS) 
and communication bandwidth (in MBytes), the fac- 
tor of proportionality being application dependent. 
If this condition is satisfied, than vectorizing 
the nodes pays even in non-vector applications 
such a multigrid PDE solvers. In this case, even 
the small sets of data that must be interchanged 
between grid point can be treated as (small) vec- 
tors and, consequently, lead to a vector gain. 


It must be mentioned that by attempting to combine 
the advantages of SIMD and MIMD, one also combines 
the software complexity of both approaches, namely 
the need for either a vectorizing compiler (SIMD) 
or a vector language (e.g. FORTRAN 8X). 


Communication Structures in Parallel Computers 


SIMD Vector Machines 


SIMD vector machines process operand data streams 
flowing from a storage to the pipeline processor, 
thereby producing result data streams flowing back 
to the storage. Thus their performance is deter- 
mined either by the memory bandwidth or the pro- 
cessor bandwidth limitation, whichever comes first, 
and no further communication bottleneck exists. 

The same holds true for multipipeline machines in 
which the tasks running on the different proces- 


sors are only “loosely coupled" (i.e., in which 
there is little data dependency,between the tasks). 


MIMD Multiprocessor Systems 


To overcome the memory bandwidth limitation prob- 
lem, each processor of a MIMD multiprocessor sys- 
tem must have its private memory. Such a processor- 
memory combination usually is called a "node". 

This approach puts the emphasis on the problem of 
providing an adequate interconnection structure 
(IS) to handle the internode communication. 


An adequate IS should satisfy the following condi- 
tions: 


(1) It must provide total connectivity to main- 
tain the potential flexibility of a MIMD 
machine. 


(2) It must exhibit a sufficiently high bandwidth 
to avoid communication bottlenecks. 


(3) It must be technically and economically 
feasible. 


(4) It must be highly reliable. 
The Problem With Interconnection Networks 


Interconnection networks (IN) that are capable of 
connecting a number of source nodes with a number 
of destination nodes are considered by many as the 
solution to the interconnection problem in large 
scale SIMD multiprocessor systems. Many papers 

have been published dealing with such structures, 
their complexity, and their interconnection proper- 
ties. Hardly any of these many papers is addressing 
the topic of the technical feasibility and the in-. 
terconnection bandwidth obtainable in view of such 
mundane parameters as pin limitation, packaging 
problems, driving power limitation, cost, etc. 


INs come in two major categories: single-stage 
(permutation) and multi-stage networks. In single- 
stage networks, a data packet may have to travel 
through the network several times in order to reach 
a given destination from a given source. In con- 
trast, in a multi-stage network the data can flow 
directly from the source to the destination. An- 
other distinction is that between circuit switching, 
where a physical connection is provided from source 
to destination, and packet switching, where a logi- 
cal connection is provided for a packet travelling 
through the network. Furthermore, control may be 
centralized or decentralized. 


If one takes a closer look, the seemingly large 
variety of INs proposed can be identified as vari- 
ants of one of the 4 basic classes listed in the 
following table (Ermel, 1985). 


CLASS _ NETWORK _TYPE COMPLEXITY 

A CROSSBAR SWITCH 0 (n2) 

B BENES NETWORK 0 (n-log n + n/2) 
C N-CUBE, BASELINE, BANYARD, OMEGA, FLIP, DELTA 0 ((n/2) log n) 
D DATA MANIPULATOR, INVERSE DATA MANIPULATOR 0 (n + n log n) 
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Networks with (N-log N)-complexity are not suit- 
able for circuit switching, since they exhibit 

at any time a large incidence of mutual blockages 
of data paths. This leaves the crossbar switch, 
which provides total point-to-point connectivity, 
as the only viable solution for a circuit switch- 
ing network. Packet switching networks, on the 
other hand, can readily deal with blockages, 
since the nodes have the capability of storing 
and keeping a packet for the duration of a 
blockage. 


The switching elements of a circuit switching 
crossbar network are extremely simple (simple bus 
connection by tri-state logic). In addition, some 
hardware is needed to arbitrate the access to the 
crossbar busses. The switching elements of packet 
switching networks, on the other hand, must be 
intelligent communication processors that are 
orders of magnitude more complex than the simple 
Switching elements of a circuit switching net- 
work. By the same token, they are also much 
Slower, as they have to execute a complex micro- 
program to handle a packet. Therefore, if one 
compares only the switching complexity of differ- 
ent networks, one may be comparing apples with 
oranges. 


A more detailed feasibility study conducted by 
Ermel (1986) has shown that the size of a cross- 
bar network should not exceed that of a 32 x 32 
Switching matrix in order to be packagable ina 
reasonable manner; and 64 x 64 would absolutely 

be the technical limit. This means that the direct 
interconnection of, say, 256 or more nodes through 
a crossbar switching network is technically not 
feasible. Packet switching networks for the inter- 
connection of an equally large number of nodes, 
even if still feasible in terms of pin limitation 
and packaging constraints, would be too slow for 
supercomputers. 


The way out of this dilemma is a two-stage ap- 
proach by forming clusters of nodes and connecting 
these clusters either via a crossbar network or 

an equally fast packet switching network. The 
interconnection of the nodes of a cluster may be 
performed through either a very high-speed clus~ 
ter bus or, again, a crossbar network. 


Software Problems 


Historically the parallel processing user has been 
a rather knowledgeable scientist or engineer who 
is willing to assume the burden of creating ap- 
plication programs using rudimentary environments. 
Detailed architectural and operating system knowl- 
edge as well as the intricate ability to manually 
map parallel algorithms onto a virtual parallel 
architecture are some of the hurdles such users 
had to overcome. . 


Programming of Single-Pipeline Vector Machines 


The programming of single pipeline vector machines 
usually is carried out in FORTRAN. A number of 
vectorizing compilers have been developed which 
map the inner loops of conventional, sequential 


FORTRAN programs onto the vector operations of the 
machine. The conditions under which such a mapping 
is possible and the techniques involved have be- 
come a well-understood topic. 


Programming of Multi-Pipeline Vector Machines 


Presently, multi-pipeline vector machines are 
programmed in a multitasking manner. Typically, a 
task is a part of a program that can be run in 
parallel with some other parts of the program. 
The work to be done is partitioned into at least 
as Many tasks (processes) as there are pipelines. 
The system then maintains a task queue, to which 
an unoccupied pipeline can go in order to find a 
task to execute. This "task attraction" scheme 
works only in strongly coupled systems, i.e. 
systems with shared logical memory. 


The primary tools for multitasking can be classi- 
fied into three categories. They are: task crea- 
tion, critical sections monitoring, and task syn- 
chronization. The most advanced tools can be found 
in the Denelco HEP system, whereas for example the 
Cray computers offer only low-level constructs 
such as LOCK and EVENT variables. The following 


- little example demonstrates some of the problems: 
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SERIAL PROGRAM: DO 1 I=1, NPIPES 


1 CALL COMPUTES (TI) 


PARALLEL PROGRAM: DO 1 I=1, NPIPES 


1 CALL TASKSTART 
(ID, COMPUTES, I) 


The two programs do not produce the same results. 
The reason is that FORTRAN calls are made by ref- 
erence and the main task with the I-loop and the 

other tasks are running asynchronously. 


Despite this and other classical "multitasking" 
problems the simple "task partitioning" approach 
works on the presently existing vector machines 
with a small number of pipelines. The application 
programmer is assumed to be a knowlegdeable system 
programmer capable of using the low-level multi- 
tasking tools. 


This approach will not be acceptable when the 
number of pipelines becomes larger. The above 
mentioned low-level multitasking tools are fare 
too error prone for problems of higher complexity. 


Experiences with MIMD Multi-Processor Systems 


The experiences with multitasking programming so 
far gained were made by system programmers and can 
be summarized as follows: 


(1) The behavior of even short parallel programs 
may be astonishingly complex. 


(2) Tracking down a bug in a parallel program 
can be exceedingly difficult. This results 
from the combination of logical complexity, 
nonrepeatable behavior, and the lack of 


existing tools. 


At best, only partial solutions can be expected. 
The following postulates can be given: 


(1) Whenever possible, rewrite at least the 
"skeleton" of the parallel program from 
scratch. 


(2) Given the great difficulty of finding bugs, 
much greater emphasis must be placed on 
writing code which is correct from the be- 
ginning. 


(3) Use high-level synchronization policies 
wherever possible and hide (encapsulate) 
them as much as possible. 


Programming of MIMD Multiprocessor Systems 


The problem of programming a MIMD multiprocessor 
system with a very large number of nodes has been 
attacked only lately. Realistically, we see four 
different ways how applications software may be 
written during the next 5 - 10 years. Each of 
them currently has some severe drawbacks that 
limit their usefulness. McGraw/Axelrod (1984) 
discuss the following four approaches: 


(1) Extend existing languages, like FORTRAN, 
with new operations that allow users to 
express concurrency and synchronization. 


(2) Extend existing compilers to identify con- 
current operations wherever it can and in- 
sert the necessary synchronization (auto- 
matic parallelization). 


(3) Add a new "language layer" on top of an 
existing language that describes the multi- 
tasking and the desired concurrency, while 
allowing the basic applications program to 
remain "relatively" unaltered (metalanguage 
approach). 


(4) Integrate new languages and appropriate 
compilers which incorporate the concepts 
of concurrency and synchronization (e.g., 
a very high object-oriented and process- 
oriented procedural language of a functional 
language). 


These four ways do not represent totally ortho- 
gonal approaches, and therefore there can be no 
hard lines drawn between them. All of them in- 
volve some amount of language and compiler alter- 
ation. 


SUPRENUM-1: A MIMD Supercomputer 


Rationale for SUPRENUM-1 


SUPRENUM-1 is a MIMD/SIMD supercomputer develop- 
ment project funded by the Ministry of Research 
and Technology (BMFT) of the German Federal Gov- 
ernment. To carry out the project a task force 
has been formed consisting of several research 
institutions and companies. Specifically the Ge- 
sellschaft ftir Mathematik und Datenverarbeitung 
(GMD) is involved in the following tasks: 


e development of a first prototype hardware 
and system software (in close cooperation 
with the participating industry) 


e development of specific program development 
environments 


@e development of application software, e.g., 
multigrid partial differential equation 
solvers (Trottenberg, 1984). 


The rationale for the decision to build a MIMD/ 
SIMD multiprocessor system rather than a super 
fast SIMD vector machine is multifold: 


(1) The market for SIMD vector machines is 
highly competitive, whereas the market 
for MIMD/SIMD supercomputers is just be- 
ginning to evolve. 


(2) The higher flexibility of the MIMD princi- 
ple allows for a broader application appli- 
cation spectrum. 


(3) The "scalar gap" problem of SIMD vector 
machines is mitigated in MIMD/SIMD multi- 
processor systems. 


(4) It is expected that the MCT-based MIMD/SIMD 
multiprocessor system will eventually be- 
come more cost-effective than the MFT-based 
vector machine of comparable performance. 


(5) Given the appropriate hardware solution, 
the development cost of the MCT-based multi-— 
processor system is lower than would be the 
cost of developing a MFT-based vector ma- 
chine. 


(6) The MIMD multiprocessor system development 
will produce technological spin-offs that 
will benefit a whole spectrum of products 
(such a general spin-off effect could not 
be expected from the more specialized pipe- 
line machine development). 


The SUPRENUM-1 research and development project 
is product-oriented. This means that, in contrast 
to pure research, additional market-oriented re- 
quirements must be satisfied such as: 


- the desired absolute performance must be ob- 
tained at competitive cost-effectiveness; 


- the program development environment provided 
must find user acceptance; 


- the machine must be manufacturable, testable, 
and maintainable; 


- there exists a time window during which the 
research and development project must lead 
to a production model. 


In order to meet the time window requirement, the 
architectural design of SUPRENUM-1 will be based 
upon already proven yet highly innovative con- 
cepts and solutions. In this sense, the SUPRENUM- 
1 will be based upon already proven yet highly 
innovative concepts and solutions. In this sense, 
the SUPRENUM-1 development is influenced in many 
ways by the solutions and experiences gained by 
the preceding UPPER-project (Behr/Giloi, 1984). 


SUPRENUM-1 Hardware Architecture 


In order to obtain a manageable, bottleneck-free 


interconnection structure, SUPRENUM~1 has a hier- 
archical hardware structure consisting of nodes, 


clusters, and hyperclusters. 


As indicated in Figure 3, the basic processing 
node of SUPRENUM-1 is a single board computer con- 
sisting of the following major components: 


@e powerful 32-bit ('front end') microprocessor, 
to function as program execution machine as 
well as scalar processor, in connection with 
8 MByte of high-speed dynamic memory and an 
objectoriented memory management unit; 


e powerful floating-point vector processor 
(MFLOPS, IEEE standard single and double pre- 
cision), performing a variety of complex nu- 
merical operations under microprogram control; 


@® microprogram controlled dedicated communica- 
tion coprocessor, to support the object-ori- 
ented message passing and to allow the ob- 
jects of application-specific data types to be 
copied at high-speed from node to node. 


Each node has a local operating system whose task 
is to boot-up the node on power-on, manage and 
schedule the processes in the node, exchange com- 
munication objects with other nodes, manage the 
local memory, and perform comprehensive self- 
testing and fault diagnosis routines. 


SCALAR 
‘FRONT END‘ 


PROCESSOR 


FLOATING-POINT 
VECTOR 
COPROCESSOR 


‘MEMORY 
MANAGE- 
MENT 


COMMUNICATION 
COPROCESSOR 


Figure 3 Structure of a node 


Figure 4 depicts a block diagram of the cluster. 
The cluster contains up to 16 nodes plus a disk 
controller for local disks, the diagnosis node, 
and the communication unit for inter cluster com- 
munication. A cluster is accommodated in one 19" 
rack. The processors of the cluster communicate 
via an ultra-fast, duplicated parallel backplane- 
bus, which allows several communication partners 
to exchange messages simultaneously. The overall 
communication bandwidth of the cluster bus is 256 
MByte/s a value that can hardly ever been exhaust- 
ed. 


Smaller systems may consist of one hypercluster 
ring, comprising up to 4 clusters (64 nodes) which 
are interconnected by one slotted ring bus (modi- 
fied UPPERBUS, Zuber 1984) with a transmission 
bandwidth of 560 MBit/s. 
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Figure 4 Structure of a cluster 


Larger systems are structured in the form of a 
matrix of clusters whose rows and columns, respec- 
tively, are hypercluster rings. That is, each 
cluster belongs to two hyperclusters, row and col- 
umn. The hypercluster bus is not fault-tolerant 
itself; rather, fault tolerance is achieved by al- 
ternate routing in the matrix. The hypercluster 
bus controller provides a gateway between the row 
and column each cluster belongs to, and it is in- 
telligent enough to handle the alternate routing 


task arising in the case of bus transmission faults. 


The bus controller takes care of the protocol hier- 
archy that regulates the exchange of packets or 
larger logical entities, formed by a number of 
packets, via the slotted ring bus. 


The SUPRENUM-1 system includes a separate oper- 
ating system machine, whose tasks are to manage 
the global system resources such as global disk 
storage, take care of the workload distribution 
and system initialization, and control the recov- 
ery procedures required in the case of fault de- 
tection. The operating system machine, however, is 
not involved in the actual program execution. Pro- 
gram execution is strictly handled by the collec- 
tive of local node operating systems. In addition 
to the operating system machine, SUPRENUM-1 con- 
tains a dedicated diagnosis and maintenance ma- 
chine. Moreover, additional 'peripheral computers' 
can be added such as: one or more program develop- 
ment machines and special graphic processors for 
graphical representation of results. Note that 
these additional machines can be arbitrarily in- 
serted into any of the rings of the orthogonal net- 
work of ring busses and that it is specifically 
the (ultra high bandwidth) ring bus structure that 
allows for the wide-range configurability of the 
system. 


Figure 5 illustrates the matrix structure of a 
SUPRENUM-1 configuration with 256 nodes and a 
maximum processing power of 4 GFLOPS. 
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Figure 5 SUPRENUM System With 16 Clusters 
(256 Nodes) 
SUPRENUM-1 Software Architecture 
The SUPRENUM Software Environment 


The SUPRENUM software system forms a hierarchy 
consisting of: 


e ‘firmware’ (PROMed software) and software 
for node monitoring, process management, and 
inter process communication 

@® operating system machine software providing 
the language interface as well as the central 
database management 

@® appropriate parallel processing languages 

@® program constructor 

@ application constructor. 


Every layer provides services to the upper layers. 
The language layer will be based upon the services 
of an abstract SUPRENUM machine. 


The abstract SUPRENUM machine specifies the run- 
time environment the (distributed) 'global oper- 
ating system' (collective of node operating sys- 
tems) will support. Its basic features can be 
characterized as follows: 


e The logical entity of computation is the pro- 
cess. 

® The logical entity of communication is the 
communication object. 

e Only communication objects can be shared by 
several processes; all other objects are 
local to the process who owns it. 

e A structure is provided for aggregating pro- 
cesses with communication objects. This struc- 
ture is called a task. 

e The task provides the scope of protection. 

e Multiple users may be given partitioned ac- 


cess to the machine. 


The following basic features of the abstract 
SUPRENUM machine will be reflected in the pro- 
gramming languages: 


e dynamic creation of explicit processes; 
@ asynchronous inter process communication (IPC); 
@® messages received at a single process entry 


point. 


Asynchronous communication through messages re- 
ceived at multiple entry points have been explored 
by Behr/Giloi (1984) and Mihlenbein/Warhaut (1985). 
The approach is reflected in appropriate exten- 
Sions of PASCAL (Hanisch, 1984) and MODULA 2 (War- 
haut, 1986). In the SUPRENUM system, the languages 
extended for program decomposition into cooper- 
ating processes with data-driven synchronization 
will be FORTRAN and MODULA 2. 


The SUPRENUM Program Constructor 


The program constructor consists of language-spe- 
cific syntactic editors and interpreters. The con- 
structor is based on the programming system gener- 
ator "PSG" developed by Bahlke/Snelting (1985). 
The constructor can deal with incompletely speci- 
fied program templates, called fragments, and com- 
prises a ‘hybrid' editor, which accepts textual 
input and/or prompted input from an abstract syn- 
tax tree. | 


Program fragments called program templates will be 
implemented for specific types of generic inter- 
process communication. One example is a program 
template for a ring process structure, where every 
process sends messages only to its left and right 
neighbor. Another example is the 2D-mesh, where 
each process can send messages to its four neigh- 
bors, etc. 


Other components of the programming environments 
are run-time debuggers of different flavors. For 
example, language interpreters may be used for de- 
bugging small modules, simulators may be used for 
debugging larger programs, and a performance moni- 
tor may be used for performance-related debugging 
on the physical SUPRENUM machine. 


The SUPRENUM Application Constructor 


The SUPRENUM Application Constructor will allow 
users who do not want to do low-level programming 
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to implement applications by using existing ap- 
plication software packages. 


The development of the SUPRENUM Application Con- 
structor is a very research-oriented topic. The 
constructor will have to contain specialized ap- 
plication languages for restricted application 
domains, as well as a knowledge base for guiding 
the user in the task of generating applications 
by some heuristics. 


First examples will be a very high level language 
for specifying the resolution of partial differ- 
ential equation solving and an expert system for 
creating the appropriate multigrid programs. 
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ABSTRACT 


Reliability and availability are two common meas- 
ures used to evaluate the dependability of computer 
systems. In this paper we analyze the reliability and 
Computation-Communication Availability of multi- 
computer networks for multiple faults with and 
without repair. Simulation models are developed based 
on task requirements, graceful degradation and compu- 
tation and communication capability of the system. 
The effect of component failure rate and repair rate on 
the dependability of the multicomputers are also 
presented. The model accepts the adjacency matrix of 
a multicomputer graph as the input and hence is suit- 
able for all types of networks. Typical results are 
presented for 16-node loop, complete connection, 
hypercube, mesh and tree structures on a comparative 
basis. 


1. INTRODUCTION 


Based on interconnections, parallel and distri- 
buted computer architectures can be broadly divided 
into two categories, namely: multiprocessors and multi- 
computers[1]. A multiprocessor consists of a large 
number of processors connected to a number of 
memory modules through a circuit switched intercon- 
nection network. In a multicomputer, however, each 
processor has its own private memory = and 
message/packet switching is used for communication 
between the processors (nodes). Reliability issues of 
multiprocessor systems were presented in fe 3]. This 
paper is concerned with those issues of multicomputer 
systems. Multicomputers are also sometimes referred to 
as computer networks. Normally the former is used in 
conjunction with centrally located systems while the 
later is used in case of geographically distributed 
computers. We will, however, use these two terms 
without distinction. We will also use processors and 
nodes interchangeably. : 


Several structures such as loops, trees, full connec- 
tion and hypercubes, etc. [4-6] have been proposed to 
interconnect a network of computers in a 
message/packet switching environment. Each structure 
possesses some unique advantages and disadvantages 
compared to another. For example, a bi-directional sin- 
gle loop (ring) structure has only two I/O ports per 
node, but the diameter (maximum number of hops 


between any two nodes along the shortest path) is a 


in a system with N nodes. Any two non-adjacent 
faulty nodes or links disconnect the loop. On the other 
hand a completely connected structure has (N-1) I/O 
ports per node, but the distance between any two 
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nodes is unity. The structure is highly fault tolerant, 
but due to its high cost the structure is unsuitable for 
systems with large number of nodes. The cost and per- 
formance of other structures such as chordal ring, tree 
and hypercube, etc. lie between these two [4-7]. The 
performance of the multicomputers is usually measured 
in terms of the average distance between the nodes and 
the average traffic density on a link. These measures 
usually decide the average queueing delay or the 
saturating load of a computer network [5-7]. 


All these performance evaluations assume that the 
components of the systems are fault free. In a real 
situation, however, the nodes and links will fail ran- 
domly depending on their failure rates. Hence, certain 
dependability measures like, reliability, availability etc. 
[8, 9] of these multicomputer systems are also impor- 
tant. These dependability measures can also be 
quantified based on the performance requirement at 
the user level or system level [11-13]. For a degradable 
multicomputer one measure is the task based dependa- 
bility . In this case the failures are tolerated as long as 
the minimum number of nodes for the execution of a 
task are available and the reliability and availability of 
a system are computed based on this task requirement. 
The following example clarifies this idea. 


Example-1 


Consider a bidirectional single loop structure with 
16 nodes where a node and a link have failed in (0, t ). 
Assuming a task requires 12 nodes for computation, 
the structure in Fig. l.a is good enough, but the struc- 
ture in Fig. 1.b is not. The former has a linear array of 
13 nodes sufficient for computation while the later has 
two arrays of 8 and 7 nodes both of which are 
insufficient for the task. 


The reliability at time ¢t is defined as the proba- 
bility that the system is operational during (0, ¢ ) [10]. 


The aim of this paper is to compare the reliabilities of 
different multicomputer structures under similar task 
requirements. From example 1, it is clear that the relia- 
bility not only depends on the number of faults during 
(0, ‘) but it heavily depends on the exact location of 
the faults which is random in nature. The situation 
becomes more and more complicated with progress of 
time, not to mention other difficult structures such as 
trees, meshes and hypercubes, etc. Hence it is unlikely 
that an analytical solution to compute the reliability 
exists; we therefore resort to simulation. 


The reliability, as discussed above, does not reveal 
the performance degradation due to failure of various 
components in (0,¢). The computation availability 
(CA) of a gracefully degrading multicomputer is 
defined as the expected value of available computation 
at time ¢ [9]. This is directly proportional to the 
number of processors (nodes) operational at time ¢. 
Many other similar measures have been suggested by 
different authors to capture the dependability of 


degradable computer systems [13, 14]. We develop an 
availability model for computation and communication 
as described in the next section. 


We will consider only hard failure of the nodes 
and links and compare the fault-tolerant properties of 


various multicomputers for both repairable and non-- 


repairable conditions. By non-repairable we mean that 
there is no on-line maintenance facility available to 
repair/replace the failed components. The failed ele- 
ments are detected and isolated by the maintenance 
processor and are repaired only when the system goes 
to the failed mode. A more optimistic situation is to 
consider on-line repair which means a component is 
repaired/replaced whenever it fails. In this case the 
system dependability is greatly influenced by the repair 
rate. The effect of the failure rate and repair rate on 
the dependability of the systems are also analyzed. 


Fig. 1.a - A 16 nede ring network after node 4 
and link 6 have failed. 


A 16 node ring network after node 4 


Fi ° iD = ° 
= and link 11 have failed. 


2. COMPUTATION AND COMMUNICATION 


The most critical issues in the design of a distri- 
buted system are concerned with interprocess and 
interprocessor communication. It has been widely 
recognized that communication is at least as important 
as computation [15]. Hence, the communication capa- 
bility of the degraded multicomputer must be con- 
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sidered with an equal weight taking into account both 
the node and link failures. It is also important to note 
that the component failures may bring in a structural 
change of the system. The loop structure of example 1 
, illustrated before, reduces to linear arrays after a 
node and a link failures. This structural change will 
result in a significant degradation in the communica- 
tion ability of the network. Therefore, the previous 
availability models [9] do not represent the system 
performance adequately. Cherkassy et. al. [16] have 
considered both node and link failures in a tree net- 
work. However, because of the analytical complexity, 
they restrict their studies to single failures. Recently 
Tsuchiya (17] has presented a mathematical model for 
the availability analysis of distributed networks. But 
he does not consider performance degradation due to 
node and link failures. In this paper, we will study per- 
formance degradation with multiple failures and that 
too for various types of networks. Naturally, simula- 
tion seems to be the only way to accomplish this for- 
midable task. 


First, we have to determine a suitable measure for 
the computation and communication in a multicom- 
puter system. The computation capability of a network 
is directly proportional to the number of operational 
nodes. The communication capability is usually meas- 
ured in terms of the average delay of a message 
between a source and a destination. Under the usual 
assumptions of uniform traffic generation, fixed rout- 


Toa [5], 
where d is the average number of hops a message 
passes through its route and p is the utilization factor. 


Here p = J where + is throughput in messages/sec, 


? 


ing, etc. [5, 7], the delay is proportional to 


— is the average message length in bits/message and 
is the total capacity of the network in bits/sec. 
When p approaches , the delay approaches infinity 


corresponding to a saturation traffic of y,,; . Then 
Ysat = — truly represents the communication capa- 


city of a network. Here yw being a constant and C 
being directly proportional to the number of links (L ), 


Ysat = ris for some constant k, . Although this is a 


good measure for the communication capability of the 
links, the number of nodes may be insufficient to han- 
dle that amount of traffic. In a network with N nodes, 
koN messages can be generated or processed in unit 
time for some constant k, . Hence, the actual com- 


munication capability is Min ( bys k oN ). With unit 


constants, we can safely assume Min (aw ) asa 


figure of merit. A similar measure was considered for 
the performance comparison of various multicomputers 
Hi For simplicity (we guess), only the number of links 
L ) was chosen as the performance metric in [16]. 


The computation capability being proportional to 
N, we can define the computation-communication 


capability of a network as N.Min (N, > ). The cost 


of a multicomputer system includes the cost of proces- 
sors (nodes), links and I/O ports. If we start with the 
same number of nodes for all the structures we can 
assume that the cost is proportional to the number of 
links (ZL). Taking performance and cost into account, 


we define the computation-communication availability 
of the system (CCA, (t )) as: 


(1) 


where at time t, the network contains x disjoint seg- 
ments with the ith segment having N; nodes, L; links 
and average distance d;. L is the total number of links 
of the initial configuration. We will consider disjoint 
segments that have more than two connected nodes 
[11]. Hence, 


z z 
YN; <N YL; <L 
es | § ==] 
for N; >2. Note from the above CCA (t) expression 
that a completely connected structure will be node 


deficient where as a loop structure will be link 
deficient. 


Example-2 


Consider the situation depicted in Fig. 1. The 
CCA, (t ) for Fig. 1.a is 1.63 and the CCA, (t) for Fig. 
1.b is 1.38. 


Computation-Communication Availability (CCA ) 
as defined above can be interpreted differently for 
different applications. If the multicomputer only exe- 
cutes tasks that need a minimum of J nodes, the CCA 
is obtained by summing over the disjoint sets that 
satisfy this minimum requirement. When [=2, the 
availability is same as that denoted by the equation 
above. On the other hand for batch processing 
environments, where the multicomputer executes one 
task at a time, the CCA of the system will be the 
maximum of the available computation- 
communications over all the disjoint multicomputer 
sets. The task can then be executed on any set that 
has at least I connected nodes. If none of the disjoint 
sets has J nodes the CCA of the system is assumed to 
be zero, because it is of no use. These availability 
models are called task based CCA models in this paper 
in order to distinguish them from the absolute CCA 
whose equation was derived earlier. The model is capa- 
ble of computing both the absolute and task based 
CCA’s of any multicomputer network. 


12 : “da; 
CCA, (t ) = + aN; Min ( N;- a ); 


t=] t 


and 


3. SIMULATION TECHNIQUES 


In this section we present the simulation tech- 
niques for the reliability and the CCA evaluation of 
various multicomputer networks. We compute the task 
based reliability as used by Ingle and Seiwiorek [12]. It 
is defined as follows: 


“If a task needs at least J processors for it’s exe- 
cution, then the system remains operational as 
long as these minimum resources are available on 
the system. Otherwise the system goes to the 
failed state. ”’ 


It is to be noted that this technique is a generali- 
zation of the previous methods used for reliability com- 
putation. For example, when [=N ie. when a task 
needs all the processors and does not allow any grace- 
ful degradation, we get the good old reliability which is 
the series reliability of all the required components. 
For [=2, the model is similar to the one adopted in 
ah and for [=1 it is equivalent to Beaudry’s model 
9}. ! 


Failure and repair assumptions 


It is assumed that the failure of the nodes and the 
links are exponentially distributed over time. A nonex- 
ponential assumption will only mean a change to the 
random selection of failing time and can be easily 
incorporated into the simulation. We assume homo- 
geneity of all the nodes as well as of all the links to 
consider identical failure characteristics. Thus we 
define \,, and », as the failure rates of a node and a 
link respectively. The reliabilities of a node and a link 
are then given by R,(t)=e *’ and R(t) =e’. 
The node and link failure rates of a system are then 
given by X), and Y,, where X and Y are the 
number of active nodes and links taking part in | 
computation/communication at any time t. A node is 
considered active if it is a member of a set of con- 
nected nodes and a link is active if it is used for com- 
munication between two active nodes. 


We consider on-line repair of the nodes and links 
with exponential distribution of repair times given by 
tu, 1 and pw, respectively. We assume a small repair 
time of few hours that represents the actual repair 
duration of a component. However, we may not need 
such an instant repair facility because of the redun- 
dancy provided by the system or due to operating 
characteristics. So a variation of the above assumption 
is to stretch the (mean time to a MTTR dura- 
tion. A practical implication of this policy is the una- 
vailability of the repairman to attend the fault. If the 
unavailability period is very large the actual repair 
time is negligible and the repair rate is mostly deter- 
mined by the mean unavailable time of the repairman. 
The repair is carried out until the system is in the 
working (up) state so that the system dependability 
can be computed. | 

System representation 


The representation of any multicomputer topology 
is given by an adjacency matrix A using graph 


theoretic notations. The matrix-elements A [2 , j| and 
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A[j,i] are ‘1’ if there is a link between node + and 
node 7. Otherwise A [i,j] = A[j,7] = 0. A nonfailed 
node i, 1<i <N, is represented by making the 
diagonal element A [:, 7] = 1. As an example the ini- 
tial adjacency matrix for a 16 node loop network is 
given in Fig. 2. : 


Foooo°oeo°oo°ooe°o°o°orr 
qQooo0o0o00c0o°co°co°coorr-s 
o0o0coococooooooOorrrd 
oooooo°eo°o°oeeOrrrcoe 
ego0o0o°oco°co°ocodoorrrdagoo 
Oo0ooca0onag0qoocoorrrOO00 
oooocoooco;corrr OG OCOOCOCO 
oooooocoorrrK OOCC0C°O 
O000CO0O0ORrRrFrRKOOOCOCOC°O 
COCCOrFrFrKOOOOCOOCOOCOCSO 
CocOoOrrrQOOCOoOo0o00oO 
COocorrr QQo0o0o°ocoo°o°0o°o 
OrrrQQ0oo°coo°ooeo°0o 
Pere Ooooo°eo°oo0o0o0°o 
Ker oOoooooo°coe°o°e°o°dor 


Fig. 2 a 16 node 


Initial Adjacency matrix for 


ring network. 


We use a reachability matrix R [3] to show the 
graph connectivity. The matrix specifies whether or 
not there exists a path between two nodes which are 
not necessarily adjacent. The connectivity of any arbi- 
trary graph can be obtained from it’s adjacency matrix 
using any standard algorithm [18]. It is evident that 
the initial reachability matrix R of any connected net- 
work has all it’s elements as ‘1’ indicating that there is 
a Sg from any node to any other node in the net- 
work. 


Whenever a node 7 fails all the entries of the 7th 
row as well as of the 7th column of the A matrix are 
made ‘0’. Similarly if the random failure of a link des- 
troys the connection between nodes 7 and 7, then the 
A [53] and A[j,z] elements of the A matrix are 
made ‘0’. So the A matrix is modified depending on 
the two types of faults. This modification of the A 
matrix is then reflected in the reachability matrix R, 


which can be divided into various disjoint matrices. 


having all the elements as ‘1’. For example the R 
matrix for Fig. 1.b is given in Fig. 3. It can be 
observed that the failure of node 4 and link 11 has 
resulted in two disjoint sets. In one of the sets the con- 
nected nodes are {1, 2, 3, 12, 13, 14, 15, 16 } and the 
second connected set is { 5, 6, 7, 8, 9, 10, 11 \. These 
two sets are obtained from the R matrix of Fig. 3. 


tivoooaoooogooj1iiii 
l11100000000,11111 
tiloooo0oo0oo0o1n111141 
9000000000000 0550 
9000MWLILILitogadoogOg 
O0001M1lILI1IL1IYOOOOO 
00001 LILIILYyOOOOO 
9O0O001LLILILUDODQODO 
09000lLLILIIL-UAgOOOO 
O00011L111i1trooood 
90900111111 H00000 
JlLivovovodod0nriryr 
llvoqoooogoootii), 4 
L1iyooo0o0o0o0o0o0'n i111 1 
Lllyoooooooolii1 yy 
Liyoooooooododtiiyz, 3 


Fig. 3 Reachability matrix for Fig. 1.b. 


As we will analyze the variation of reliability with 
time, we divide the selected time frame into uniform 
intervals. Whenever a node and/or link failure becomes 
due, the program goes through the following steps: 


1) An element is selected at random from the exist- 
ing active elements. 


2) If the fault is covered [19] then go to step 3, else 
increment the failed count of the system and start 
a new experiment again from time ¢ = 0 with the 
initial configuration of the system. 

3) The number of active elements is decremented by 
one. 

4) The A matrix is modified depending on the type 
of the fault. 

5) The next fail time of the node/link is computed. 


With the instant repair policy we use a single 
repairman service facility. The sequence of the steps 
for repair are: 

1) When ever a component fails its repair duration is 
computed if the repairman is readily available. 
Else the component is kept in a FCFS repair 
queue till the service is available. 
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2) The node/link is returned to the system at the 
completion of repair and the A _ matrix is 
modified. 

3) The next failed element from the queue is taken 


up for repair. 

The initial A matrix is thus modified with ran- 
dom number of ‘0’ entries due to the failure/repair of 
nodes and links with the passage of time. We compute 
the corresponding R matrix at the specified time inter- 
vals. The R matrix gives the connectivity information 
of each node which is employed to find the system con- 
dition. 

The R matrix, obtained above, is searched 
exhaustively to obtain all the available valid subma- 
trices. A submatrix is valid if it has all 1’s so that it 
can represent a set of connected nodes (subgraph). The 
communication performance of a subgraph is then 


obtained from Min(N,, ra where WN; is the 


t 

number of active nodes in subgraph 7, L; is the 
number of links that are used for communication and 
d; is the average distance a message has to travel in 
the «th subgraph. N; is obtained by counting the 
number of nodes in subgraph 7. JL; is obtained by 
counting all A[j,k] such that 7,k € {subgraph i} 
and k >7+1. The average distance for an arbitrary 
network is defined as: : 


N; N; 

y a ap 
ee 

N; (N; = 1 ) 


where d,, is the shortest distance from node j to node 
k. So we use a shortest path algorithm [18] to find d, 
from any node to any other node in the subgraph and 
finally compute d. The adjacency matrix A is used to 
find the shortest path between nodes of a subgraph. So 
the sequence of steps the program goes through to find 
the R,(t) and CCA, (t) at some specified time ¢ are: 


1) Compute the R matrix from the A matrix and 
find the various disjoint groups from the R 
matrix. 


2) Select the number of groups ’x’ that satisfy the 


minimum resource requirements I; if any. 


If x=0, then the system can not satisfy the 
resource requirements. Increment the failed count 
and start the experiment again from the initial 
configuration of the system. Reset the system 
time. 


Otherwise for each subgroup zg, , find the N, , LD; . 
and d; from the A matrix as explained earlier. 


We use multiple independent repetition technique: 
to determine the R, (t) and CCA, (t ). The failed count 
is used to_ obtain the system reliability and 
N;, £;,and d; are used to find the CCA,(t) from 
equation 1. 


4. SIMULATION RESULTS 


The simulation model, discused in section 3, can 
be used to analyze any arbitrary network topology. It 
accepts the initial A matrix and the task requirement 
I as the input parameters and gives the reliability 
R,(t) and computation-communication availability 
CCA, (t) variation with time. We present here the 
fault-tolerant capability of five types of multicomputer 


3) 


4) 


networks. They are: Full connection (FC), mesh, 
‘Hypercube (HC), binary tree and ring networks. The 
topologies are shown in Fig. 4. Performance and cost 
parameters for these networks are reported in [5-7]. 


We have taken an initial configuration of 16 nodes 
(N = 16) for all the networks. Most of our results are 
based on a node failure rate \,, =100 in 10° hours and 
link failure rate \,=20 in 10° hours. The processor 
failure rate is taken close to that of an LSI-11 proces- 
sor [12]. The link failure rate is taken one fifth of the 
processor failure rate and does include the interfaces at 
both ends of the link. The effect of varying the node 
and link failure rates are also discussed. We would 
like to emphasize that the model accepts these failure 
rates as inputs and hence is suitable for any other 
failure rates that can be determined for another imple- 
mentation based on a component count and technol- 
ogy. 

Fig. 5 shows the variation of reliability with time 
for all the above networks when the task needs a 
‘minimum of 12 processors and with no repair facility. 
The results indicate that the FC has the maximum 
reliability as expected. The reliability of the HC is very 
close to that of the FC even if it’s number of links are 


only 32 compared to 120 in the FC. This suggests that. 


with a typical link failure rate of 20 in 10° hours four 
alternate paths from each node is sufficient to provide 
reliability close to that of the complete connection. 


HYPER-CUBE (HC) 


MESH 


Fig. 4 
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Fig. 5 Reliability variation with time for a 
task requiring 12 processors and with- 
out repair. 


3000 


An = 0.0001, 4; = 0.00002, coverage = 1.0. 


COMPLETELY CONNECTED (FC) 


BINARY TREE 


The ( N -1 ) links from each node 
required for better communication 
not increase the reliability linearly. 
tion is in the middle range of the reliability bounds. 


The tree and ring connections have similar poor relia- 
bility property. Table I compares the reliability of the 
five structures for [=8, J=12 and J]=16. Naturally 
as we decrease the value of I ( allow more graceful 
degradation to the system ) the reliability of the multi- 
computers increase. 


in a FC is mostly 
capability and do 
The mesh connec- 


TABLE I 
Reliability variation with time for a task 
requiring I processors and without repair. 
4, =0.0001, r», =0.00002, 
Coverage=1.0. 


oO 

° 
eooooooooOre 
Se eeSeeeese| 


OOO FF Bt et et ee 
CO0C0C0OK HK Hee 
oO 
eo 


Sole. eee 
Se ee ee 
eee ee RS 
A ee wee ee 
a ee eee 


L; 1.00 1 Te 1. 
QO. 0.64 0 0. 0. 
QO. 0.40 0 QO. 0. 
Q. 0.26 0 QO. 0. 
Q. 0.16 0 QO. 0. 
QO. 0.10 0 Q. Q. 
0. 0.06 0 Q. 0. 
QO. 0.04 0 0. Q. 
0. 0.02 0 QO. 0. 
Q. 0.01 0 Q. 0. 
QO. 0.00 0 Q. QO. 


Now, let the system level requirements be that the 
system has to run for a 1000 hour mission time 
(unmaintained) and must have R, (1000) >0.9. Fig. 6 
shows the effect of node failure rate on the system reli- 
abilities at ¢=1000 hours for [=8. It is observed that 
the FC has the minimum threshold node reliability 
R,7 and is 0.62. The HC and mesh topologies are next 
in order with individual R,7(1000) being 0.65 and 
0.75. So based on the task requirement and operating 
conditions the above three networks can be designed 
with less reliable processors. R,7(1000)==0.92 for the 
tree and ring structures. If we increase the task 
requirement to 12 nodes it is obvious that the thres- 
hold #&, for all the networks will also increase. For 
I =16 the system reliability curve for all the networks 
will lie below the R, line and R,7(1000) must be 
>0.9 so that the required system reliability is main- 
tained. A practical implication of these observations is 
that the FC, HC and mesh are considered more costly 
compared to star and ring topologies. This is based on 
the assumption that we use the same processors for all 
types of networks. But if we can design a 
FC/HC/mesh with less reliable processors and still can 
assure the system reliability, then why should we 
spend more by using the same reliable processors as we 
need for the star or ring architecture? 
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. Effect of node failure rate on system 
reliability without repair at t=1000 
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Fig. 7 shows the variation of CCA, (¢) with time 
for the five topologies and with no-repair. We use [ =2 
to compute the absolute CCA. It is seen that the FC 
behaves worst in this case. Even though the_FC has 
the maximum communication performance (d = 1 ),. 
its large number of links brings the performance/cost 
ratio down. The tree and the ring connections lie in 
the middle of the graph suggesting that they have a 
fair performance and low cost. They are suitable for 
applications where communication requirements are 
not stringent. The HC connection gives the best 
performance/cost behavior. This is because it has an 
optimum number of links to provide reliability and 
CCA close to those of the FC. 
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TABLE II 
Task based CCA,(t) without repair, \, =0.000i, 
) =0.00002, Coverage=1.0. 
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Table II shows the task based CC'A comparison of 
the five topologies. It can be observed that the 
CCA, (t) of the FC shows the least degradation with 
time. This is because of very good fault- tolerance and 
communication attributes. The mesh connection has 
the second best CCA . Comparing the results of Fig. 7 
with those of table II for [=8, we can observe that 


Zz 
¥) CCA; (t) of all the subgraphs with minimum two 


t=1 

connected nodes gives less CCA degradation with time 
for all the structures except for the FC. This implies 
that the FC has very low probability of having a sub- 
graph with less than 8 nodes in 3000 hours. But the 
CCA of the ring and tree structures increases dramati- 
cally for J =2. 


The task based R,(t) with repair is given in Fig. 
8 for [=12, uw, ~1=10 hours and w,~'=2 hours. The 
effect of repair can be observed by comparing the 
results with those of the tabulated values given in 
table I. However, the reliability of the tree topology 
does not increase as of other structures because of it’s 
unsymmetrical nature. Repair has no effect on the reli- 
abilities of the five structures for J =16 (results are not 
shown in the Fig.). 


The effect of varying the node repair interval on 
the normalized CCA,(1000) is plotted in Fig. 9 for 
J =8. Here we have assumed separate repair facility for 
the nodes and links. The results show that different 
topologies can have different repair intervals to main- 
tain a specific system degradation. 
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5. CONCLUSIONS 


Reliability and computation-communication aval- 
lability (CCA ) evaluation of multicomputer networks, 
with and without repair, are presented in this paper. 
Simulation is adopted because of the analytical intrac- 
tability of the problem. The simulation model is quite 
general and is applicable to all multicomputer graphs. 
Typical results obtained in section 4, indicate that the 
hypercube structure performs the best from the relia- 
bility and availability stand points. It’s cost and per- 
formance were earlier shown 6] to be a reasonable bal- 
ance between loop and completely connected struc- 
tures. 


The effects of node failure rate and repair rate on 
the system dependability are quite interesting. It is 
observed that in order to maintain a specified system 
reliability the node reliability of different networks 
differs. This is because of the difference in degree of a 
node of different graphs. Similarly the required fre- 
quency of repair varies for various networks to keep 
the CCA,(t) within a specified degradable range. 
These observations lead to another interesting prob- 
lem: Generally we express the cost of a network in 
terms of the number of links. This does not take into 
account the processor reliability at the design phase 
and the repair cost at the maintenance phase. But we 
have seen that the threshold node reliability and the 
node repair rate are lower for structures like FC/HC 
compared to those of ring/star networks for maintain- 
ing a static reliability or availability. Hence, a more 
appropriate cost model of a network is necessary tak- 
ing into account node reliability, link cost and repair 
cost, as is done for software life cycle models. This will 
allow us to compare the cost of different networks over 
a specified mission interval. 
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Abstract 

This paper describes the maintenance architec- 
ture and its LSI implementation of the SIGMA-1, 
an ultra high speed data-flow computer for numeri- 
cal computations in scientific and technological ap- 
plications. The multi-PE version of the SIGMA- 
1 adopts the semi-custom LSI implementation ap- 
proach. LSI implementation is closely related to 
testability and maintenability. In a parallel pro- 
cessing system with a large number of processors, 
the architecture for resource management, debug- 
ging and maintenance is important as the archi- 
tecture for program execution. The SIGMA-1 uses 
combined SIMD/MIMD maintenance architecture. 
Design policy, implementation method, and improved 
architecture, including maintenance architecture, 
are discussed. 


1 Introduction 


Data-flow architecture is identified as the most suitable 
architecture for parallel processing systems. The main 
reasons are: The hardware construction of a large-scale 
parallel processing system using data-flow architecture is 
much easier than von Neumann architecture; and data- 
flow computing exploits all the parallelism in a program at 
the architecture level. Much research has been conducted 
[1-6] and several prototype data-flow processors have al- 
ready been built [7-10]. 

The SIGMA-1 is an instruction-level data-flow com- 
puter developed at the Electrotechnical Laboratory. The 
main objective of the SIGMA-1 is to build an ultra-high 
speed computing environment, to execute numerical com- 
putations in scientific and technological applications such 
as Monte Carlo simulations and particle simulations that 
cannot be executed efficiently on a conventional vector 
processor. 

The prototype SIGMA-1 processor with a processing 
element (PE) and a structure element (SE) is in full op- 
eration now(11]. The performance of this prototype is ap- 
‘proximately equal to the performance of a von Neumann 
processor constructed using the same technology [12]. The 
estimated performance of the SIGMA-1 with 200 process- 
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ing elements exceeds 100 MFLOPS. 
This SIGMA-1 prototype has brought out several prob- 


lems in constructing a data-flow computer system with a 


large number of processors. The prototype SIGMA-1 pro- 
cessor is constructed by conventional TTL MSI technol- 
ogy. Size, power consumption and hardware reliability will 
be serious problems if the same implementation technique 


is used in the construction of the final SIGMA-1 system. 


Therefore, LSI implementation of the SIGMA-1 is neces- 


sary to construct a system with two hundred processing 


elements. 

In a parallel processing system with a very large num-. 
ber of processors, the architecture for resource manage- 
ment, debugging and maintenance (maintenance architec- 
ture) is as important as the program execution architec- 


ture. LSI implementation of data-flow architecture makes 


more difficult to perform operations needed in resource 


management, testing, maintenance, and debugging. 


This paper describes LSI implementation and mainte- 
nance architecture of the SIGMA-1 and discusses the prob- 
lems in the constructions of a data-flow computer with 


large number of processors. Section 2 outlines architecture 


of the SIGMA-1 and problems in constructing data-flow 
computing system with a very large number of processor 
and the way to avoid them in the SIGMA-1. Section 3 
discusses the LSI implementation approach for construct- 
ing a data-flow processor including interconnection net- 
work. Section 4 describes the maintenance architecture of 
the SIGMA-1 that performs resource management, main- 
tenance, and debugging. 


2 SIGMA-1 Architecture 


2.1 Architecture 


Figure 1 shows the global architecture of the SIGMA- 
1. Four processing elements (PEs) and four structure 
processing elements (SEs) connected by a local network 
(LNET) form a group. A process or a subroutine is as- 
signed to one of these groups. This group acts as a high- 


performance processor in the execution of assignment task. 


Therefore, interconnection within a group must have high 
speed and high capacity, otherwise the performance of se- 
quential computing segments of programs decreases. The 
SIGMA-1’s local network, 10 by 10 crossbar packet switch- 
ing network eliminates this problem. 

The global network, a two-stage omega network which 
connects up to 64 local networks is used for communicat- 
ing processes and accessing structure processing elements. 
The SIGMA-1 adopts a new load distribution method [14] 
implemented in an interconnection network. The com- 
puter simulation [14] shows that this load distribution by 
that method is very close to ideal distribution. 

Figure 2 shows block diagrams of a PE and an SE. A 
PE executes all the instructions and data processing op- 
erations except structure operations, and an SE executes 
structure operations such as read, write, allocate and deal- 
locate. 

A PE consists of a FIFO buffer (B), an instruction fetch 
unit (F), a matching unit (M), an execution unit (E) in- 
cluding a floating point arithmetic unit, and a distribution 
unit (D). Five units in the PE are arranged to operate in a 
two-stage pipeline. The B unit stores 8K packets, where a 
packet consists of 89 bits. The M unit stores 64K packets 
and fires two-input operations. Single bank chained hash- 
ing is used for performing matching. The F unit reads the 
instruction memory according to the address information 
on the input packet and also reads immediate operands 
from the instruction memory. The E unit consists of an 
integer ALU, a floating point ALU, a structure address 
generator, an input data-type checker and a sequencer. 
The D unit consists of a PE address generator, a loop 
count controller, and an output buffer. 

An SE consists of a FIFO buffer (SB) and a structure 

memory unit (S) which includes a structure memory, hier- 
archical flags, and waiting queue control circuits. Detailed 
construction is described in [15]. 

Experience with the single processor SIGMA-1 shows 
several problems in constructing a data-flow computer sys- 
tem with a large number of processors. The way to avoid 
the main problems, synchronization and performance on 
sequential computation of programs, are discussed below. 
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2.2 Synchronization 


Synchronization out of concurrent processes and synchro- 
nization between processing elements and memory systems 
are included in the first problem. A data-flow architecture 
solves the synchronization problems based on data depen- 
dency in a parallel processing system since the instruction 
the arrival of input data triggers the instruction execution. 
However, in a data-flow architecture it is very difficult to 
solve synchronization problems which are independent of 
data dependency since the execution of such instructions is 
partially decided by the data dependency of the program. 
This kind of synchronization is necessary for managing 
activities on processors, detecting program termination, 
mutual exclusion and sequentializing associated with I/O 
operation. 
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The SIGMA-1 solves this synchronization problem us- 
ing operations based on static data-flow model and the 
maintenance architecture discussed in section 4. In or- 
der to realize such operations, the SIGMA-1 utilizes the 
order preserving property within processing elements and 
networks and instruction firing and execution without a 
matching memory. Detailed synchronization based on static 
operations is discussed in [13]. 


2.3 Performance on Sequential Comput- 
ing Segments in Programs 


Performance of the sequential computations in a program 
is important in most matrix computations such as linear 
equations and eigenvalue problems. Figure 3 shows se- 
quential computing segment of matrix computations. Here, 
the results from operations on all matrix elements (elim- 
inating step or rotation) are accumulated into a scalar 
variable, then distributed to all elements or all vectors in 
pivoting in Gaussian elimination and convergence test in 
iterative programs are performed. In circular pipelined 
data-flow architecture, all data manipulated by instruc- 
tions circulate through all processor pipeline stages. How- 
ever in a von Neumann processor, the data circulates only 
in the execution stages. This is shown in figure 4. In 
general, circular pipelined data-flow architecture decreases 
the performance of the sequential computing segments of 
a program even in the matrix computations. In order 
to avoid this problem, the SIGMA-1 uses 2-stage short 
pipeline control architecture and advanced packet sending 
mechanism. At the result of this improvement, single in- 
put instructions are executed in every 2 clock cycles and 
two input instructions are executed in every 3 clock cycles 
in a pure sequential program. 


3 LSI implementation 


The prototype PE constructed consists of eight printed cir- 
cuit boards with approximately 1900 TTL MSI and MOS 
memory chips. The number of boards, size, power con- 
sumption and reliability prohibit the use of prototype PH’s 
in the construction of the final SIGMA-1 system which has 
200 processors. Thus LSI implementation is necessary for 
constructing the final system. 
Data-flow architecture has following characteristics: 


1. A data-flow processor does not have registers or his- 
tory sensitive memories. 


2. Pipeline stages in a data-flow processor does not in- 
terfere each other except for input and output con- 
nection to the neighbouring stages. Independency of 
each unit enables simple unit design. 


3. Communication between processing elements is per- 
formed as packet communication. This simplifies in- 
terfacing between processing elements. 


4. Five memory accesses, two read and two write (in- 
sert and delete) accesses to a matching memory, and 
a read access to an instruction memory are neces- 
sary for a two input instruction execution. Figure 
5 shows typical memory accesses for an instruction. 
This is 2.5 times that of a von Neumann processor 
memory access, where one read access for instruc- 
tion fetch and one read/write access to data access 
is performed. 


5. Since the concurrency in a program is determined by 
data dependency, a large buffer that stores excessive 
parallelism during execution is necessary. 


Characteristics (1) to (3) support LSI implementation, 
while (4) and (5) imply that the data-flow processor re- 
quires wider connection between a processing element and 
memories. 

Several approaches in implementing data-flow architec- 
ture are considered in the following paragraphs. The first 
is a single chip approach. Tthere are approximately 60,000 
gates in a SIGMA-1 PE. Therefore, it is possible to design 
a single chip PE by current VLSI technology either adding 
a sufficient number of pins to the chip or installing buffer 
memories and matching memories in the chip. However, 
these conditions cannot be satisfied by the current VLSI 
technology. Hence, memory interface signals must be mul- 
tiplexed on a limited number of pins on a chip. Since a 
data-flow processor requires wider connection between a 
PE and memories, data-flow architecture is much better 
than von Neumann architecture if implemented in a VLSI 
chip. 

The second approach is to construct a processing ele- 
ment using ready-made LSIs such as AMD2901 and a mi- 
croprocessor. Ready made LSIs for fixed-point and floating- 
point ALUs are most suitable for constructing an experi- 
mental processor. However, no ready-made LSIs are avail- 
able for the most important units, the matching unit, the 
instruction fetch unit, and the distribution unit in a data- 
flow processor since they are not common in von Neumann 
processors. Therefore, the performance of a data-flow pro- 
cessor is reduced if constructed by ready made LSIs. 

The single chip approaches are inappropriate for con- 
structing an experimental data-flow processor. We have 
chosen the third approach, semi-custom LSI approach, 
for constructing the final SIGMA-1 processing elements. 
In commercial computers, semi-custom LSI approach has 
been widely accepted. Design and manufacturing cost for 
semi-custom design are comparatively high. In addition, 


modifications to the manufactured LSI is impossible. This 
is an obstacle for LSI implementation of an experimental 
processor. 

Table 1 shows hardware of the prototype and the final 
processing element. The final PE consists of semi-custom 
LSIs, gate arrays and standard-cell LSIs, CMOS memo- 
ries, and a number of SSIs for drivers and receivers. A 
PE has eight types of gate arrays, a standard cell LSI, 


586 © 


Operations on Matrix Elements 
ALi, jl = AOBLi, jl,lil,s) 


~. oa 
\ {| ca 
Cal culation af 
Scalar Variable 
Ss = gfALi, jf) 
—~ XK 
Bes 


ul 


Operations on Matrix Elements 
ALi, j] = FCBLi, jl,Lil,s) 


Data Memory 


, Fig. oe Sequent ial Comput ing. Segment 


of Matrix Compu tat ion 


Fig. 4 


and contains 28 LSIs. About 42% of ICs in a PE are in 
the matching unit, (see table 1). This indicates that the 
matching unit is the most important unit in the SIGMA-1. 

These LSIs and memories are assembled on a printed 
circuit board. The total number of gates in a PE, exclud- 
ing memories, is approximately 81,000. The total mem- 
ory capacity in a PE is approximately 1 MB. This in- 
cludes gates for test and maintenance circuits which will 
be explained in the next section. The net number of 
gates, estimated at 54,000, is almost the same as that of 
a pipelined von Neumann processor employing the same 
technology. Since the component count is 1570 and there 
are six printed circuit boards per a PE, both the number of 
components and the number of printed circuit boards are 
reduced by one fifth to one sixth of those in a prototype 
hardware. 


4 Maintenance Architecture 


4.1 Maintenance architecture for a data- 


flow processor 


Computer system time for an user includes program exe- 
cution time, hardware and software initialization, program 
loading and unloading, execution of an operating system 
including resource management, program debugging, and 
hardware testing and maintenance operations. It is nec- 
essary to decrease all the computer time to construct a 
high-speed computing system. So far, resource manage- 
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ment, maintenance, and debugging operations on a par- 
allel processor system are performed mutually by many 
element processors or by a single service processor [16]. 
If these operations are performed mutually by many pro- 
cessors, it is difficult to control them when overflow or 
deadlock in hardware resource occurs. If these operations 
are performed by a single service processor, resource man- 
agement, maintenance, and debugging can be controlled 
even if overflow or deadlock occurs. However, the time 
required for these operations increases linearly with the 
number of element processors in a system. Hence, the 
single service processor approach would not be fruitful in 
a parallel processing system with a very large number of 
processors. In order to achieve a tolerable time for these 
operations, a special purpose parallel architecture for re- 
source management, maintenance, and debugging is neces- 
sary. This architecture is the “maintenance architecture” 
of a processor, while the “program architecture” executes 
the program. 

Maintenance architecture is more important in parallel 
processing environment with a very large number of pro- 
cessors than in a uni-processing environment. The perfor- 
mance of program architecture does not increase linearly 
with the increasing number of processors in a system. In 
general, the number of maintenance operations increases 
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slightly more than linearly as the number of processors in- 
creases, and interconnection hardware is added to element 
processors. Consequently, the ratio of program execution 
time to non program execution system time becomes lower 
with the increasing number of processors in a system. 
When the program architecture is data-flow, non data- 
flow maintenance architecture is essential because a data- 
flow architecture alone cannot perform certain kinds of 
testing and maintenance operations. History sensitive op- 
erations or computations utilizing side effects are essen- 
tially necessary for resource management, maintenance 
and debugging. For example, neither the arrival of a data 
token at a certain node nor data tokens sent to an in- 
correct position can be detected only by data dependency 
relationship. Basic hardware tests, such as the connec- 
tivity test between components, cannot be performed on 
data-flow architecture. A von Neumann processor can 
perform these tests since it can handle history sensitive 
or side effect operations. Therefore, either von Neumann 
architecture or non-data-flow architecture is required for 
maintenance architecture in a data-flow processor. A cor- 
rect program is successfully executed but the behaviour 
of incorrect programs or wrong hardware cannot be ana- 
lyzed by the data-flow architecture itself, since the execu- 
tion of a data-flow processor is solely determined by data 
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availability. Hence, a processor solely based on data-flow 
architecture cannot exist as a standalone computer sys- 
tem, and it is necessary to combine it with non-data-flow 
maintenance architecture. 


4.2 SIGMA-1 Maintenance Architecture 


The main objective of the SIGMA-1 maintenance archi- 
tecture is to perform resource management, maintenance, 
and debugging as fast as a single von Neumann type pro- 
cessor in the 200-PE SIGMA-1 system. For this purpose, 
the parallelism in the maintenance architecture must be 


more than that in program architecture because perfor- | 


mance of the program architecture does not linearly in- 
crease with the increasing number of processors but re- 
source management and maintenance operations increases 
slightly more than the number of processors in program 
architecture. Resource management, maintenance and de- 
bugging require global control and global data exchanges. 
Therefore, these operations cannot be executed in inde- 
pendent service processors attached to processor groups 
in the system. 

In order to fulfil these requirements, the SIGMA-1 
maintenance architecture adopts a combined SIMD /MIMD 
parallel architecture. Figure 7 is a block diagram of main- 
tenance architecture in the SIGMA-1. The first and sec- 
ond layers are special purpose SIMD maintenance proces- 
sors and perform basic operations on hardware compo- 
nents. The third and fourth layers operates as von Neu- 
mann MIMD processors and perform high and abstract 
level resource management, maintenance, and debugging. 

The first layer of the SIGMA-1 maintenance architec- 
ture consists of maintenance circuits, including conven- 
tional scan-in and scan-out circuits in PEs and SEs. This 
layer accesses all memories, registers and important con- 
trol signals inside and outside LSIs. All memory cells, 
registers, and important logic signals are accessed accord- 
ing to M (maintenance) commands sent through the M 
Bus from T units. All the maintenance items in the first 
layer are uniquely addressed in the system. Data transfer 
rate by the first layer is limited, since the data transfer 
width of the M Bus is limited to four bits and accesses to 
memories and registers are limited to one bit at a time. 
However, all the maintenance circuits in the first layer op- 
erate concurrently in SIMD fashion. About 33% of logic 
gates are used in the maintenance circuits. This layer also 
controls several hardware resources such as the number of 
packets in a FIFO buffer (B) in a PE, the load factor of 
a matching unit, execution count of a certain instruction, 
and deadlock status. This information is passed to higher 
layers as attention signals and used in the run time control 
program in the third and fourth layer processors. 

The second layer consists of maintenance units (T units) 
installed one per PE, SE and network board. A T unit is 
a microprogram controlled special purpose processor that 
consists of one gate array (2600 gates) and a micropro- 
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gram ROM. The objective of this layer is to combine bit- 
wise operations in the first layer into wordwise logical op- 
erations. The microprogram in the T unit translates the 
logical maintenance command into a sequence of M com- 
mands. These commands are sent to the first layer and 
gather data from the first layer. The second layer watches 
for attention signals and controls the system clock signal if 


necessary. The T unit receives commands and exchanges 
maintenance data with the maintenance processor through 


the T Bus. 

The third layer consists of maintenance processors, one 
per group comprising four PEs, four SEs, and a local or 
global network. A maintenance processor is a commercial 
microprocessor (8086) with no secondary storage device. 
This layer checks data from T units, generates mainte- 
nance commands to T units, and extracts error or debug 
information in the data from T units. The fourth layer is 
a service processor. The service processor, VAX 11/750, is 
the host processor of the SIGMA-1. This layer is in charge 
of global human interface control. 

The final SIGMA-1 system consists of one service pro- 
cessor, 40 maintenance processors, 360 T units, and over 
10,000 maintenance circuits in LSIs. 

The maintenance architecture in the SIGMA-1 sup- 
ports fault tolerant operations. Break down in a process- 
ing element breaks down is notified to the service processor 


‘immediately. Then the system clock signal to the faulty 


hardware stops and change configuration registers in net- 
works and processing elements are modified correctly. There- 
after the system operates normally. In this case, mainte- 
nance operations can be performed even if the system clock 
is not applied to processing elements and masks handshake 
signals in the SIGMA-1 hardware. 


5 Conclusion 


Several problems in constructing a large-scale data-flow 
computing system with a very large number of processors 
have been discussed in this paper. Data-flow architec- 
ture can solve only some of the problems encountered in a 
parallel processing system. In the SIGMA-1, most of the 
other problems are solved by architectural improvements 
in processing element, network and maintenance architec- 
ture. 

Semi-custom LSI implementation of the SIGMA-1 in- 
dicates that a high-performance data-flow processor with 
about 1 MB memory can be installed on a printed circuit 
board. For construct a parallel processing system with a 
very large number of processors, the data-flow architecture 
is more suitable than the von Neumann architecture, since 
the connections to network are much easier in data-flow 
architecture. 

From the hardware point of view, the merit of data- 
flow architecture is the easy expandability to a large sys- 
tem. Therefore, single chip VLSI implementation is indis- 
pensable. However, there are several problems to solve for 


achieving a single-chip high performance data-flow parallel 
processor. Some of them are: 


1. Design of an efficient, reduced and refined instruc- 
tion set to reduce the instruction set and the number 
of instructions executed in a program, 


2. Design more efficient matching unit for compact in- 
stallation of a PE, and 


3. Efficient hardware resources management in a PE 
and an SE. 


The preliminary version of the SIGMA-1 PE and SE 


have been in operation since the latter half of 1984. The. 


final version of the SIGMA-1 was designed and fabricated 
at the end of 1985. A single group with four PE, four SE 
and a local network will be in operation at the end of 1986. 
The final system, consists of 200 processing elements, with 
a total predicted performance of 100 MFLOPS will be in 
operation by the end of 1987. 
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Table 1 Hardware of the SIGMA-1 
$------------------------------ +-------------------------------------- + 
| Prototype | LSI Version 
$------------------------ === === +----------~-------~------------------- + 
| Technology Advanced STTL | CMOS gate array 
| | CMOS standard cell LSI 
| CMOS SRAM | CMOS SRAM, DRAM 
+—--------------------- = ----- = $—-—-— n= + 
| Unit N L | oN L M G NG 
$------------------------------ +-------------------------------------- + 
| B Unit 178 166 | 23 11 6 14016 9252 
| F Unit 237 180 | 80 23 6 13251 7903 
| M Unit 385 274 | 144 28 7 16835 10233 
| E Unit 267 256 {| 45 57 2 13298 9726* 
| E Unit (FALU) 406 406 | 12 9 1 7500#* 7500%# 
| D Unit 97 97 | 39 39 6 13968 9630 
| T Unit 0 0 {| 11 10 1 2367 0 
$----------------------- = -- === +-------~----------------- ------------- + 
| PE Total 1570 1379 =| 354 157 29 = 81235 54244 
+--~------------- - -- = -------- == 4+——---—-------—------—----—--- ------------ + 
| SB Unit 78 30 | 23 11 6 14016 9252 
| S Unit 242 176 =| 319 212 1 8000 7000 
| T Unit 0 fe) | 11 11 1 2367 ) 
$------------------------------ $—----------~-------------------------- + 
| SE Total 320 206 =| 353 234 8 24383 16252 
| a 
|Grand Total 1890 1585 | 707 391 37 105618 70496 
+------------------------------ +------~---~----------------------------- + 
| | 
|PCB Count 8 | 2 
+------------------------------ 4—---~-------------- + ----- -------- ------- + 
| | 
| N | M G 
|Local Network 32 | 16 128000 
|Global Network 512 | 256 2048000 
$------------------------------ 4+----~—------—-—-------------------------- 
| 
| N Number of IC chips 
| L Number of Logic ICs (not including RAM and ROM) 
| M Number of LSIs 
| G Number of gates (translated to 2 input nand gate count) 
| NG Number of gates excluding maintenance circuits, where drivers 
| are not counted. 
| 
| #* Excluding floating-point ALU 
| #** Floating-point ALU (hard-wired vs. AMD29325) 
ee me ce we ce ee a ce ae cr me a ae a a a a 
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Abstract: The problem of array handling is a major issue in 
the design of data-flow multiprocessor systems. We demon- 
strate here two solutions (namely the I-structures and the to- 
ken relabeling approach) and apply them to two well known 
numerical algorithms. The Fast Fourier Transform and the 
LU decomposition algorithm have been chosen since they are 
representative of a large class of algorithms and provide good 
benchmark for the evaluation of the performance of data-flow 
systems. We base our research upon the MIT tagged data- 
flow architecture. The results of a deterministic simulation of 
Arvind’s data-flow machine are presented and show the con- 
trast between the two structure representation methods. 


I. INTRODUCTION 


Data-flow systems have been proposed as an alterna- 
tive to the cumbersome programming methods offered by con- 
ventional von Neumann computers. The traditional model of 
computation implies a central control (the Program Counter) 
as well as a global state (the memory system represented by 
the varzables in the high-level programming language). On 
the other hand, in a data-flow system, the executability of an 
instruction is decided by the availability of its operands, 
thereby implementing a distributed control very amenable to 
parallel implementation. For more details, the interested 
reader may refer to excellent surveys of data-flow principles 
by Treleaven (1982), Srini (1986) or to a special issue of IEEE 
Computer magazine (1982). 


While these principles describe the interaction of 
scalar elements, careful attention must be given when dealing 
with larger structures. In this paper, we study in more detail 
the I-structures (Arvind and Thomas, 1978) and the token re- 
labeling approach (Gaudiot, 1986, and 1985). In section II, we 
describe these two methods. The two algorithms used for 
illustration are described in section III and their data-flow 
graph representations are then explained. The simulated 
machine and the results from the simulation are shown in sec- 
tion IV. Concluding remarks are made in section V. 


Il. ARRAY HANDLING 


The basic data-flow principles as described above ap- 
ply mostly to scalar operations. When large data structures 
are involved, a different mode of processing must be entered. 
Indeed, the underlying rule of data-flow systems is the single 
assignment principle. It implies that no variable can be as- 


*This material is based upon work supported in part by the 
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signed a value more than once in the course of the program. 
A scalar token (a value) cannot be modified but a new token 
can be created after the appropriate processing. Large data 
structures can be likewise created after “modification” but 
the copying operation can impose large overhead if a single 
element of large element array should be modified. For this 
reason, several schemes have been developed in the past. We 
will now describe some of the most representative ones. 


2.1. Heaps 


This scheme has been originally described by Dennis 
(1974). It involves representing an array as a tree of pointers. 
When a single array element must be modified, only a se- 
quence of pointer modification steps is involved. For an n 
element array, the cost of modifying a heap is proportional to 
log n which compares favorably with a cost proportional to 
n when the original array is entirely copied. There are 
several disadvantages which have been associated with the 
heaps (Arvind and Gostelow, 1982). These include a sequen- 
tialization of some array operations, a centralization of array 
accesses, storage overhead, etc. 


2.2. I-structures 


The I-structures were introduced by Arvind and Tho- 
mas (1980) in order to permit pipelining between the producer 
and the consumer of structures. In other words, an array ele- 
ment should be readable before the entire array has been pro- 
duced. This scheme can be implemented by the addition of 
"presence bits" which can be associated with every cell of 
storage. When a request is made to a "full" cell, the data can 
be forwarded to the requestor. When the cell is found to be 
empty, a tag is left, indicating the forwarding address of the 
original requestor. When the data is ready, it can be directly 


forwarded to its intended destination. 


This method obviously reduces the latency between 
the creation of a data element and its consumption and allows 
a more flexible scheduling of operations. However, it intro- 
duces additional overhead at execution time and also requires 
the design of a complex I-structure handling hardware 
mechanism. 


2.3. Token relabeling 


This scheme was described by Gaudiot (1985) for a 
data-flow system which uses the U-interpretation principles. 
When a producer and a consumer of an array can be readily 
identified, the notion of structure can be entirely ignored at 
the low-level. Instead, the tag associated with each token 
under the rules of the U-interpretation is used as 


identification of the index of the array element of the high- 
level language. In other words, it can be simply said that, 
when an array A is created, its A(i) element is tagged with 
"i" (Note that the tag is in fact a more complex structure but 
this simplified description is sufficient mm the context of this 
paper). This approach can be applied very easily to some 
operations such as the Fibonnaci numbers (Fig. 1) where Rl 
and R2 are actors which respectively add 1 and 2 to the input 
iteration tags in order to allow the result produced by the 
"add" operator during iteration "i" to be re-used during 
iteration "i+1" and "i+2". 


2 
OH) & 


Fig. 1. Token relabeling 


More complex program structures are involved in ord- 
er to apply the scheme to scatter and gather operations. 
These constructs, such as the gather situation illustrated in 
Fig. 2, would not be easy to implement (Gaudiot, 1985): 


AN 
. (Relabeling) 
Ble iF 
AL iy) AF 


C 
( af 


Fig. 2. Gather 


DO 1 I=1,100 
C(t) = B(i) + A(F(i)) 


An inverse function F! unknown at compile time 
would be needed to perform the relabeling of data-flow to- 
kens. It has been shown that such a calculation is not neces- 
sary. Instead, we introduce in the calling program a sequence 
generator which produces the F{i)’s, tagged by i (Fig. 3). A 
relabeling actor (called y) exchanges the order of the tag and 
that of the data value. Both the A elements and the i (tagged 
F(i)) are input to a special actor 6. Without any inverse func- 
tion, the element A has been matched with the proper F(i); 
we have found A(F(1)) still attached to its original F(i) itera- 
tion label. 
takes the A input and relabels it with the content of its other 
input. In other words, it outputs A(F(i)), with a label i. 


Fy 


The special 6 actor is a relabeling actor which 
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Fig. 3. The 6 actor 


This approach has many advantages: better run-time 
memory management as no intermediary storage is needed, no 
need for array operations, smaller hardware and execution 
overhead. However, it requires the complete consumption of a 
data structure lest some tokens remain orphaned for the rest 
of the execution of the program. 


i. TWO APPLICATIONS 


In this section, we present several data-flow graphs for 
two parallel algorithms. The crucial consideration in the 
design of these graphs has been to successfully exploit all the 
parallelism inherent to the algorithms. Since we are interest- 
ed in contrasting the performance of the token relabeling 
scheme with that of the I-structure implementation, we con- 
struct two graphs for each algorithm. In each, we apply one 
of the two different structure handling schemes under study, 
namely the Token Relabeling approach and the I-structure 
method. The applications examined in this section have been 
run on a deterministic simulation program and the results will 
be shown in the next section. 


3.1. The FFT application 


$.1.1. The algorithm: Fig. 4 shows the eight-point 
decimation-in-frequency FFT computation. In the general 
case, the FFT computation contains only butterfly constructs 
with a regular pattern of connections among these butterflies 
and different multiplicative constants W. The butterfly con- 
struct (Fig. 5) obeys the following equations (Oppenheim and 
Schafer, 1975): 

a(j+1,p )=2(j,p )+2(7,9) (1) 
z(j+1,q)=2(j,p)-2(3,.9)) Ww 


where z(j,#) denotes the data coming into stage j, at point 
:. There are N inputs, and n = logN stages, with the fol- 
lowing constraints 0<+: < N-1, and 0< 3 < logN-1. 
The relation between p and q in a particular stage for a 
given size of the problem uniquely specifies the connections of 
butterflies between two successive stages. This means that 
the routing of data tokens is determined by p and q. In ad- 
dition to the data routing issue, the parameter r is another 
factor which distinguishes the computations of butterflies lo- 
cated in various place. At a particular stage 7, the parame- 
ters in the equations are derived as follows: 


5 te aaa? ca _ 
ca 9--$- —o-@ 
WW ro 
tHe ©) O-* 
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26- ePe 
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Fig. 4. Eight-point decimation-in-frequency FFT 
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Fig. 5. Butterfly construct 


p=i AQ*-s7 (2) 
q=1 V coe ’ 
r=(i 2” J )eod 
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for all 0 < + < N-1, where ’A’ is a bitwise AND 
function, 'V’ is a bitwise OR function, ’|’ is an modulo func- 
tion, and k is the one’s complement of k. 


Assuming that k = 2"-!"!, then the butterfly be- 
comes: 


a(j+1,tAk )=2 (7,8 Ak  iVE Zs 3 
sCaiive Ma hy fay Nis baat Ne -gn(i |k)/k (3) 


for all: such thatO < + < N-1. 
3.1.2. The Token Relabeling implementation of FFT: 


As the access pattern of FFT computation is rather 
regular, the relabeling function is relatively simple. The 
butterfly construct is implemented by a corresponding data- 
flow graph which accepts input tokens from two terminals as- 
sociated with index numbers j and ? in the iteration portion 
of its tag. 


In order to express the FFT butterfly of equations (1) 
and (2) into a data-flow graph, the equations are rewritten as: 


a(g+1,iAk )=2(7,8Ak)+2(j,F (Ak) (4) 
(7 +1,1Vk \=(z (3 AT Dy -jmi |k)/k 


“Zi; 


It can be shown that the functions F, and F, can be 
expressed as follows: 


F (p)=p @ k 
(UV =4 @®k _~ where @ is an exclusive OR 


The butterfly is redrawn as shown in Fig. 6 where two 
index modification blocks F;* and Fz" have been inserted. 


The relabeling function F~! is generally not available. 


This is because it cannot be calculated at compile time. Gau- 


diot (1985) demonstrated a method which entirely avoids the 
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_ x(j+lp) 


x(j+1,q) 


Fig. 6. Modified butterfly construct 


calculation of F~!. This was described in a previous section 
of this paper. Here, two primitive actors 6, and 6, are 
defined for enabling easy operations on the iteration number 
portion of the tag. The function of the 6, actor is to extract 
the iteration number from the input token while that of the 
§,, actor is to set the iteration field of the tag of one of the 
input tokens with the data field of another input token 
(Fig. 7). Note that this 6, corresponds to the 6 actor as pre- 
viously defined by Gaudiot (1985). 


ili] ig) 


@ 


Fig. 7. 6 actors 
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In general, tag modification can be done by specially 
designed actors. However, the set of tag modification actors 
should be kept as small as possible so as not to increase pro- 
cessor complexity. A general solution for the realization of 
the relabeling function is shown in Fig. 8. In this diagram, 
the following notation is used: data/;) indicates a data element 
with an iteration field tag +. 6, extracts the iteration number 
from the token z 2);), which z 1j-(;) will be matched to, #;;) is 
processed by function F. #);) is then relabeled by F'(¢ )j;) such 
that : is able to match oak z1 and relabel z 1. Finally z14;) 
is obtained. The graph containing one 6, actor, two 6, ac- 
tors and the function F performs F! for 22. The function 
F is known from the algorithm and is implemented by the 
usual arithmetic and logic actors. This general method does 
not need other specially designed tag modification actors. In 
the butterfly, Fj’ and F" are realized as shown in Fig. 9. 


Fig. 10 is the block diagram of the FFT computation. 
A stream of one dimension data tokens first passes through 
the input adaptor which prepares extra context levels and 
proper iteration numbers for each data to identify the loca- 
tion of data in the FFT graph. The set of input data z(j,#) 
is separated by the distributor into two groups and which are 
later forwarded to the two terminals of the butterfly. The 
butterfly block processes paired tokens and generates one pair 
of new tokens for the next stage. The loop control block then 
checks whether the last stage of computation has been 
reached. If not, the iteration number which stands for the 
FFT stage is incremented by one and the token is looped 
back. Otherwise, the token is routed to the output module. 
A set of constants "m" is produced by the constant and k- 
generator and is consumed by the loop control block in order 
to check for the end of loop. As tokens are produced by the 
loop, they are postprocessed by the output block. The order 
is bit-reversed so that a correct set of data could be ready. 
Both the token relabeling function F and the function W in 
the butterfly are function of K and of the indices of the to- 
ken. 


“YF(i)] 


other 
arguments 


Fig. 8. The relabeling function 


Xf al Kia] Ko] 


i] 


*2la] 


Fig. 9. Realization of F-! 


3.1.3. Block diagram of FFT I-Structure graph: The 
data-flow graph which corresponds to the "conventional" I- 
structure organization is represented in a block diagram form 
in Fig. 11. Instead of entirely unraveling the FFT graph at 
runtime, an array (using the I-structure representation) is 
used to buffer the data between stages of the FFT computa- 
tion. There are four I-structure buffers in the computation of 
an 8-point FFT since there are 3 stages. Note that since the 
I-structure definition allows the consumption of an array 
when it has been partially produced, pipelining is allowed 
between the stages without waiting for the entire production 
of the intermediary data. At least on this scheduling point of 
view, the I-structure representation is similar to the Token 
Relabeling method. However, as is apparent on the graph of 
Fig. 11,-several additional actors (namely SELECT and AP- 
PEND) are required to handle the I-structures. These are not 
necessary in the Token Relabeling method. 
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Input 
Adaptor 


Fig. 11. Block diagram of FFT IS graph 
3.2. The LU decomposition 


The LU decomposition of a matrix is another problem 
adopted for the evaluation of the behavior of the Token Rela- 
beling and I-structure techniques. There are several reasons 
in the choice of this algorithm: 


e It is a two dimensional problem 


The computation of LU-Decomposition (also referred to 
simply as LU-D) requires two broadcast operations 
which can be difficult to efficiently implement in a 
data-flow environment. 


We will only describe the outline of the algorithm to 
explain the block diagrams of Token Relabeling and I- 
structure data-flow graphs. Note that the details of this algo- 
rithm have been presented by Hwang and Cheng (1982). In 
Fig. 12, it is shown how the matrix is partitioned into three 
parts. Part I contains the components of the matrix in first 
row. Part II contains the components of the matrix in the 
first column except the one in the first row. The rest of the 


matrix belongs to part III. The overall outline of the algo- 
rithm is presented below: 


loop size_of_ problem > 0 
output I and vertically broadcast I to I and III 
update Il 
output IT and horizontally broadcast II to IIT 
update III and set III as new problem 

end loop 


Matrix A 
New Matrix A 


Fig. 12. Matrix partitioning for LU-D algorithm 


A new problem with smaller size is created and fed to 
the next iteration after the current iteration has been com- 
pleted. Each iteration generates part of the components of 
the U matrix and of the L matrix. 


8.2.1. LU-D Token Relabeling graph: Fig. 13 shows the 
block diagram of the Token Relabeling data-flow graph for 
LU decomposition. The distributor separates tokens into 
three streams (I, II, and Ill). The tokens in stream | are sent 
out as the partial result of the U matrix. In the mean time, 
the first broadcast block broadcasts tokens in I to II and II. 
Receiving tokens from the broadcast block, the second distri- 
butor separates tokens into two new streams (II and III). In 
the new II block, paired tokens in II are processed and a new 
version of I] components is generated. These new II com- 
ponents are sent out as the partial results of L matrix. And, 
they are also broadcasted to III. The new block III processes 
three tokens which come from the two broadcast blocks and 
the distributor and produces a new version of III components. 
This new version is later sent back as a new smaller problem 
to be solved. No loop is needed since the size of "new III" 
will eventually be zero and no more actors will be fired. 


Since the token relabeling function is not limited any 
longer by only the actors D and D™, any function is allowed 
to modify the tag. As a result, any token in the graph can be 
arbitrarily chosen and passed to any iteration at will, through 
the use of the tag detection and tag modification primitives. 
The broadcast of a token to other iterations of computation is 
a significant example of the advantages that can be obtained 
from this method. Systematic mapping of tokens onto the 
proper iteration is required in order to apply this technique. 
In this application, the distributor and broadcast blocks 
indeed realize this mapping. 


3.2.2. LU-D I-structure graph: As in the FFT algo- 
rithm, the difference between the block diagrams of the Token 
Relabeling and I-structure graphs is in the data accessing 
modes. The I-structure graph (Fig. 14) uses array accessing 
actors while the Token Relabeling graph properly labels to- 
kens and uses distributors to route tokens. Both of them 
share the same major functional blocks and follow the same 
algorithm. Since the I-structures provide synchronization 
mechanisms among tokens, the graph is constructed by 
several functional blocks without interconnections among 
them. However, pointer generation is complicated by the fact 
that sets of names and indices of arrays to various iterations 
for selections and appends of arrays must be generated. 
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Fig. 14. Block diagram of LU-D IS graph 


IV. SIMULATION AND ANALYSIS 


A simulation approach was taken to compare the per- 
formance of the Token Relabeling scheme (TR) and of the I- 
structure scheme (IS). The two applications described in sec- 
tion 3 were simulated under both sets of array representation. 


4.1. The architecture model 


The architecture model of the Arvind/MIT tagged 
data-flow machine (Arvind, Kathail and Pingali, 1980) was 
adopted for the machine model of the simulator. It simulates 
64 Processing Elements (PEs) interconnected by a packet 
switching 6-dimension Hypercube network. Each PE can be 
viewed as three units: 


The Assoctative Memory Unit in which incoming tokens 
are associatively compared with previously arrived to- 
kens. Matched tokens are then sent, along with the op. 
code to the next unit. 


The Processing Unit receives the ready instruction 
packets and processes them according to the op. code of 
the template. 


The Token Formatting Unit receives results from the 
PU and forwards them to the associative memory unit 
or to another PE through the message passing network. 


The J-structure controller where array access operations 
are handled. Such actors as SELECT or APPEND are 
executed by this unit. 


4.2. Simulation results 


As in the tagged data-flow architecture, the simulated 
graphs are executed under the U-interpretation principles. In 
addition to gathering statistics, this simulator actually com- 
putes the results from the simulated data-flow graph and pro- 
vides correct results. We now describe the results from the 
simulation. The measures of performance we concentrated on 
are listed below: 


4.2.1. The graph complezity: Simply counting the 
number of actors in the two graphs would not give an accu- 
rate measure of the complexity of the graphs. Instead, a 
dynamic measure must be introduced by counting the number 
of actors which are actually executed. A convenient method 
to obtain this figure is to simulate the graphs on a stngle pro- 
cessor. Simulation shows that for the FFT algorithm, the 
design of the IS graph has higher complexity than that of the 
TR, while for the LUD algorithm, the TR graph is better. 


4.2.2. Speed up: Fig. 15 shows the speed-up of two 
graphs for an FFT computation. The TR graph exhibits a 
much better speed-up behavior than the IS graph does. All 
the components of the arrays can be distributed to various 
processors and processed truly independently. With the com- 
munication overhead of structure accessing, the IS graph 
turns out to have poor speed-up behavior. 


4.2.8. Execution time: Fig. 16 shows the execution 
times of FFT-TR and FFT-IS graphs in a multiprocessor en- 
vironment. For both graphs, tasks are distributed to proces- 
sors according to the same mapping function. 


For a problem with a size smaller than twice the 
number of processors in the system, the number of processors 
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used for solving the problem is half of the size of problem. 
Therefore, in Fig. 16, the curve should follow a logarithmic 
function (log N) instead of an N -log N function in the ideal 
case (where N is problem size). The TR graph is close to this 
prediction since its speed-up shown earlier is almost linear. 
The IS graph appears to follow the N-log N function because 
the rate of speed increase becomes slower while the processor 
number increases. This effect was shown in Fig. 15. 
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Fig. 16. Execution time 


4.2.4. Number of active tokens: In Fig. 17, it is shown 
that the peak number of active tokens in the system for the 


IS graph is more than twice that for the TR graph in every 
‘case. One reason for this is that the requests for structure ac- 


cessing in the IS computation are always issued before the 
data are ready. Another reason is that sets of structure 
names and indices are always generated before the data is 
computed. Fig. 18 shows that the computation of the IS 
graph has more than twice the average number of active to- 
kens in the system than the computation of the TR graph 
has. The reasons can be found in the following: 
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Fig. 17. Peak number of active tokens 
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Fig. 18. Average number of active tokens 


1) structure names, indices are waiting for data arrival at 
APPEND actors 

2) the requests for data issued by SELECT actors are 
waiting in the I-structure controller. 


4.2.5. Network load: Fig. 19 shows that the computa- 
tion of the IS graph causes much heavier load for the network 
than the computation of the TR graph does. The higher 
average number of active tokens in the system introduces 
more load on the network. Therefore, the figure shows a two 
to three times increase in network load for IS computation 
over that for the TR computation. 


4.3. Analysis 


In our performance analysis of the data-flow multipro- 
cessor, Many parameters can be observed. We have kept 
those most closely related to the problems of array handling 
schemes evaluation. 
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Fig. 19. Network load 


4.8.1. Speed: Because both the TR scheme and the IS 
scheme implement graphs based on the same algorithm, the 
time complexity of the two graphs is similar. The comparison 
of their execution times eloquently displays the difference in 
time overhead respectively incurred by the TR and IS 
schemes. It should be noted that: 


1. The execution speed of the TR graph is noticeably 
higher than that of IS graph. This is because the TR 
scheme implements the concept of direct data forward- 
ing: A token generated by an actor carries the informa- 
tion about its destination actor and is transmitted 
without going through any intermediate structure 
storage. On the other hand, a token in the IS graph 
format is sent to a designated structure storage and 
then passed to a destination actor upon request. The 
overhead is twofold: communication overhead and I- 
structure handling. 


. The peak number of tokens in the system for the IS 
graph is larger than that for the TR graph since the re- 
quest for a data element often exists long before the 
data is finally computed. Another measure is the aver- 
age number of token in system. Again the average of 
tokens of IS graph computation is higher than that of 
TR graph one. The reason for the higher average 
number of tokens can be found in the large amount of 
structure accessing activities. 


. The speed up curve for the TR graph is better than 
that for the IS graph. The reason is that the I-structure 
scheme requires requests through complex I-structure 
controllers. 


4.8.2. Network load: The IS implementation increases 
the load on the network more than the TR does because of 1) 
indirect data movement, and 2) structure names and indices. 


4.8.8. Complexity of the architecture: For the TR 
scheme, structured data is converted into scalar elements 
which are stored in the matching store. This is already sup- 
ported by the basic data-flow principles. On the other hand, 
the I-structure scheme requires its own controller. 


4.8.4. Design and generation of data-flow graphs: A 
TR graph may be more difficult to design or to generate be- 
cause the token relabeling function may not be simple and 
may not be easy to find. The relationship between high level 
description of an algorithm and the relabeling graph is not al- 
ways clear which may make the transformation difficult. 


However, an IS graph contains concepts close to con- 
ventional sequential programming technique. Therefore, the 
mapping from high level description to graph is simpler. 
Since the IJ-structure provides synchronization among depen- 
dent stages, loop control is often unnecessary. The body of 
the graph is simple, but the index generator is complex. 


V. CONCLUSIONS 


We have studied in this paper the implementation of 
two array representation methods. The underlying model of 
execution was chosen to be the U-interpreter as executed on 
the MIT tagged data-flow machine. The I-structure method 
(Arvind and Thomas, 1980) was compared to the token rela- 
beling method (Gaudiot, 1985, and 1986). 


Using a deterministic simulation approach, two numer- 
ical algorithms were simulated with the two array representa- 
tion methods. The results of the simulation above pointed to 
a net speed-up of the Token Relabeling over the I-structure in 
both cases. Indeed, the TR method presents the following ad- 
vantages over the I-structure implementation: 


e Simplification of the accessing pattern since the tokens 
need not transit through a special structure controller. 


e The data-flow principles of execution can be retained for 
array handling. This improves schedulability. 


e Automatic garbage collection 


However, while the token relabeling method can be ap- 
plied successfully in many cases, there are several drawbacks 
which often will render the I-structure more applicable: 


e The token relabeling assumes a complete consumption 
of the input stream. Therefore, program constructs 
which do not completely consume the input stream 
could require an I-structure implementation. 


e Due to the relabeling actors, the graph for the TR im- 
plementation is more complex than that of the IS. This 
implies that for multiple function token relabeling, the 
overhead might cvershadow the advantages. 
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ABSTRACT 


Motivated primarily by their potential VLSI imple- 
mentation, systolic and wavefront arrays have recently 
attracted significant research interest. A critical research 
topic is to formalize and systemize the design of such 
arrays directly from algorithm descriptions. Signal Flow 
Graphs (SFGs) provide a popular description for recursive 
parallel algorithms used in digital signal processing. In 
this paper, a Data Flow Graph (DFG) is used as an 
abstract model for wavefront array processors. We 
address first the issue of transforming a SFG into a DFG 
by an equivalence transformation. Then we discuss the 
timing analysis for generalized (cyclic or acyclic) DFG net- 
works. Finally, we outline the algorithms to assign 
minimal number of queues required on all the edges of the 
DFG and yet achieve the best possible throughput rate. 
As a side-product of the timing analysis theorem, the 
deadlock problem associated with the DFG is also resolved 
naturally. 


1. Introduction 


Systolic arrays! and wavefront arrays * have recently been intro- 
duced as efficient VLSI parallel processors. Since then, numerous 
researchers have attempted to formalize the design of these arrays.® In 
previous publications,*: 5 we have proposed methods of mapping algo- 
rithmic descriptions onto Signal Flow Graphs (SFGs). These SFGs 
then serve as a notational tool for a particular space-time parallel 
implementation. This notation facilitates the understanding of recur- 
sive algorithms in terms of the computations required for each recur- 
sion and provides a convenient tool for mapping algorithms onto 
parallel processing arrays. <A systolic realization of an algorithm 
expressed as a SFG is easily derived via a cut-set based systolization 
procedure. To derive a data-driven wavefront realization of a SFG, 
we introduced a theorem demonstrating the functional equivalence of 
each SFG to a particular Data Flow Graph (DFG). A DFG serves as 
an abstract model of a wavefront architecture, so this equivalence 
relationship provides a correct wavefront implementation for any 
SFG. Note also that in our treatment, a DFG is a cycelte graph 
instead of the usual acyclic assumption used by many other authors. 


Optimization techniques for synchronous systolic realizations of 
SFGs have been studied at length.5.7 Corresponding techniques for 
wavefront realizations currently do not exist, due to the additional 
complexity of analyzing asynchronous parallel behavior. In this paper 
we address the general issue of transforming a SFG to a DFG, which 
is correct, deadlock-free, and if desired, optimal in terms of 
throughput and/or hardware. In the following sections, we first review 
the SFG model of computation. We then introduce the DFG model 
as an abstraction of a network of wavefront array processors, and 
explore some of its properties. The timing analysis of an arbitrary 
DFG is treated in great detail and constitutes the core of this paper. 
In addition, several Theorems are presented, and algorithms, which 
transform SFGs into optimal DFGs, are discussed. 


_ * This research was supported in part by the National Science Foundation under 
Grant ECS-82-13358 and by the Innovative Science and Technology Office of the 
Strategic Defense Initiative Organization and was administered through the 
Office of Naval Research under Contract No. N00014-85-K-0469 and N00014-85- 
K-0599. 

t On leave from the Los Alamos National Laboratory, Los Alamos, New Mexico 
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1.1. Signal Flow Graphs 


A SFG is a directed graph G = (V, E, D(e)), where V is the set 
of nodes (vertices) and E is the set of edges. Nodes model the compu- 
tation and edges model the communication in a parallel algorithm. 
Each edge e has a non-negative integer weight, denoted as D(e), which 
represents the number of delays (D’s) on the edge. Directed loops of 


edges with zero delays are disallowed. It is assumed that the 
computations in nodes and communication between nodes take zero 


time. A recurston is defined as the computation inside a SFG for a 
single set of input data. So, once a data set is input to a SFG in a 
recursion, it is assumed to go through the nodes instantly, except 
where blocked by delays (D’s) on edges. If the data is blocked by a 
delay, it will stay in the delay (D) until the next recursion. Therefore 
the delays in the SFG have the function of separating two consecutive . 
recursions, and keeping the state of the system. To start, all data in 
D’s are assigned according to the initial conditions of the algorithm. 
With these observations, it is not difficult to see that the SFG actu- 
ally displays the activities in one recursion of the algorithm. This 
simplifies the understanding of the complicated space-time activities 
associated with parallel processing. 


For illustration, an example SFG for matrix multiplication of 
two matrices, C = A x B is shown in Figure 1(a). In each recursion, 
a new column of matrix A and a new row of matrix B are sent into 
the array. Each node will multiply the two data coming from up and 
left, and add it to the partial sum, which is fed back from itself by a 
delay edge. The operation of a node is shown in Figure 1(b). 
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Figure 1 (a): SFG Array for Matriz Multiplication 
(b): Node Operations 
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1.2. Systolic Arrays 


The transformation of such a SFG description to a_ systolic 
array can often be accomplished automatically, via e.g. a cut-set 
retiming procedure.48 Systolic arrays are very amenable to VLSI 
implementation and they have the important advantages of modular- 
ity, regularity, local interconnection, highly pipelined multiprocessing, 
and continuous flow of data between the processing elements (PEs). 
They are especially suitable to certain classes of computation-bound 
algorithms and have a good number of digital signal processing appli- 
cations. In fact, they are often derived as a direct mapping of a com- 
putational algorithm onto a processor array.° 


The disadvantages of systolic arrays lie in the fact that the data 
movements in a systolic array are controlled by global timing- 
reference beats”. From a hardware perspective, global synchroniza- 
tion incurs problems of clock skew, fault-tolerance, and peak power. 
The burden of having to synchronize the entire computing network 
will be intolerable for ultra-large-scale arrays. From a software per- 
spective, in order to synchronize the activities in a systolic array, an 
exact number of additional delays are required. A simple solution to 
these problems is to take advantage of the control-flow locality, as 
well as the data-flow locality, inherent in most DSP algorithms. This 
permits a data-driven, self-timed approach to array processing. Con- 
ceptually, this approach replaces the requirement of correct ” timing” 
by correct “sequencing”. This concept can be realized in dataflow 
computers and wavefront arrays. 


1.3. Wavefront Arrays 


Interconnection and memory conflict problems remain very 
expensive in a general-purpose dataflow multiprocessor. Such prob- 
lems can be greatly alleviated if modularity and locality are incor- 
porated into dataflow multiprocessors. This motivates the concept of 
Wavefront Array Processors (WAPs).* Moreover, there are many one- 
or two- dimensional digital filters and parallel matrix algorithms 
which are generally expressed in a SFG form. Therefore, it is desir- 
able to explore the relationship between the SFG forms and the 
corresponding wavefront arrays. 


Conventionally, the approach to deriving wavefront arrays is to 
trace the computational wavefronts and pipeline the fronts on the 
processor array. In this paper, however, we introduce a new approach 
based on converting a SFG array into a Data Flow Graph (DFG) 
array. The DFG is then converted into a wavefront array by properly 
imposing several key elements in data flow computing. To achieve 
this, we introduce a general methodology for the timing analysis of 
DFGs, and use it to optimize the conversions in terms of throughput 
and queue requirements. 


1.4. Data Flow Graphs 


Data Flow Graphs (DFGs) are used extensively in the area of 
computer architecture, perhaps most commonly in data flow 
research.? Our purpose is to use a DFG as a formal abstraction of a 
network of wavefront array PEs. To suit our needs, we use the fol- 
lowing definition of a DFG. 


A basic DFG is a directed graph G = (V, E, D(e), Q(e)), in 
which nodes in V model computation, and directed edges in E model 
asynchronous communication. Each edge e has a queue capacity, 
represented by an positive integer weight Q(e). Each edge e is also 
associated with a non-negative integer weight D(e), representing the 
number of initial data tokens on the edge (initial state). The state of 
a DFG is represented by the distribution of tokens on its edges. Each 
edge may contain a nonnegative number of tokens that is less than or 
equal to its queue capacity. These token may be thought of as filling 
in a part of the edge queue. This leaves the remainder of the queue 
empty. Each empty queue slot is called a space”. Therefore, each 
edge e is also associated with a non-negative integer weight S(e), 
representing the number of initial spaces on the edge. Obviously, the 
total edge queue capacity Q(e) is equal to D(e) + S(e), the sum of the 
initial tokens and spaces on an edge. Hence, the state of a DFG 
may also be represented by either the distribution of tokens or spaces 
on its edges. 


A node is enabled when all input edges contain a positive 
number of tokens and all output edges contain a positive number of 
spaces. The state of a DFG is altered by the firing of enabled nodes. 
The new state is determined by subtracting one token (and adding 
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one space) from each input edge of the fired node, and adding one 
token (and subtracting one space) from each output edge. 


DFGs have the following properties: 


(1) Persistence - Once a node is enabled, it remains enabled until it 
is fired. 


(2) Conservation - In any nondirected loop, the sum of the total 
number of tokens on edges in one direction and the total 
number of spaces on edges in the opposite direction is 
unchanged by state transition. A special case of this is the 
directed loop, in which the total number of tokens (and the total 
number of spaces) remains constant. 


2. Equivalence Transformation Theorem 


2.1. The Equivalence Relation between SFG and DFG 


Theorem 1 : (Equivalence Transformation between SFG’s and 
DFGs) Barring deadlock situations, the computation of any SFG can 
be equivalently executed by a self-timed, data-driven machine with a 
topologically identical DFG. The number of initial tokens assigned on 


each DFG edge is equal to the number of delays in the corresponding 
SFG edge. || 


Proof: What needs to be verified is that the global timing in the 
SFG can be (comfortably) replaced by the corresponding sequencing 
of the data tokens in the DFG. Note that the transfer of the data 
tokens is now timed” by the processing node. This ensures that the 
relative time” between data tokens received at the node is the same 
as it was in the SFG, as far as that individual node is concerned. By 
induction, this can be extended to show the correctness of the 
sequencing in the entire network. || 


Remark: In the above transformation, any queue capacities 
greater than or equal to the numbers of initial tokens on the edges are 
acceptable, from the point of view of functional correctness. The dis- 
cussion on throughput of any DFG equivalent of a SFG is presented 
in the next section. 


For convenience, we shall term this transformation the 
SFG/DFG Equivalence Transformation. The SFG/DFG 
equivalence transformation helps establish a theoretical footing for the 
wavefront array as well as provide insight towards programming tech- 
niques. The transformation implies that all regular SFG’s can be 
easily converted into wavefront arrays, making modularly designed 
wavefront processing elements very attractive to use. Based on the 
equivalence relationship, the correctness of the DFG is assured. 


The D’s in the SFG locate the proper setting of the initial condi- 
tions in the corresponding wavefront array. We stress that the initial 
data token distribution plays a very important role in assuring the 
correct sequencing in a data-driven computing network. The initial 
state assignment is straightforward: for each delay in the SFG there is 
an initial data token (regarded as an initial value) assigned to the 
corresponding DFG edge. We also note that an arbitrary length 
queue may be inserted on any edge without affecting the validity of 
the equivalence transformation. It may however, affect the deadlock 
situation, and may significantly influence the throughput rate, as dis- 
cussed in Sections 3-5. 


2.2. Example: Linear Phase Filter Design 


To illustrate the role of the initial states and the correctness of 
data sequencing, as guaranteed by the equivalence relationship, let us 
discuss the SFG/DFG equivalence transformation via a linear phase 
filter example. Linear phase filters have two key features: they have 
a symmetrical impulse response function, i.e., h(n) = h(N-1-n), and 
they do not add phase distortion to the signal. Figure 2(a) shows a 
SFG which takes advantage of the symmetry property, and reduces 
the amount of multiplier hardware by one half. By the SFG/DFG 
equivalence transformation, the dataflow graph is derived as in Figure 
2(b). In order to ensure the correct sequencing of data, the W data 
should propagate twice as slowly as the Y data does. Note that the 
queues play the role of ensuring such a correct sequencing. 


Let us now explain the initial conditions shown in Figure 2(b). 
Note that one initial zero-valued token is placed on each Y-data edge; 
and two initial zero-valued tokens are placed on each W-data edge. 


t These can be easily demonstrated by examining a simple Petri Net 10 model of 
a DFG. 


The first zero-valued token of the Y-data edge, when requested by the 
Y-summing node, will be passed to meet the V-data token arriving 
from the upper node. When the operation is done, the way is cleared 


for sending the next Y-data token from the right-hand PE. The situa- 
tion is similar for the W summing node, but only one zero-valued 
token is used” and the W-data is still one token away from meeting 
an X-data in the summing node. It will have to wait until the Y-data 
and the second ”0” meet in the lower summing node. This explains 
why the propagation of W is slower than Y. (This is just what is 
needed to ensure a correct sequencing of data transfers.) 


Notation for the DFG: 
An empty buffer is denoted as a bar on an edge, 
and a full buffer, a bar with a dot on it. 


Figure 2(a): Linear Phase Filter SFG 
(b): Linear Phase Filter DFG 


3. Throughput Considerations and Examples 


Once the correctness of the DFG is assured, the questions of 
pipelining speed (i.e. throughput rate) should be explored. To facili- 
tate these analyses, we shall incorporate the notion of time into our 
DFG model: 


Each node is assigned a deterministic computation time (positive 
number) and the node fires after tt has been enabled for tts computa- 
tion time. Now a complete specification of a DFG consists of a 5- 
tuple, G = (V, E, D(e), Q(e), T(v)), where each node v is assigned a 
positive number T(v), representing the computation time of the node. 
The persistence property of DFG results in a deterministic behavior 
of state transitions over time. 


3.1. An Example Showing the 
Throughput 


Figure 3(a) shows a SFG which has two directed paths departing 
at one end and merging at the other. On the other hand, two 
corresponding DFGs are shown in Figure 3(b) and 3(c). The operation 
times of the nodes are indicated by the numbers inside the nodes. 
With only one space on the lower branch in Figure 3(b), the DFG can 
accept input tokens at times 1, 6, 11, .... However, in Figure 3(c), an 
equivalent DFG with two spaces on the lower path can accept input 
tokens at times 1, 2, 6, 7, 11, 12, ..... In Figure 3(d), we show a simi- 
lar DFG (not corresponding to the SFG in Figure 3(a)), in which there 
is one space and one token on the lower path. And this DFG can 
accept input only at times 1, 6, 11, .... 


Effect of Queues on 


It is clear from this example that the queue capacities and token 
distribution on the edges play an important role in determining the 
pipelining period of the DFG. 


$ The conclusion that the correctness of the computation is independent of the 
node computation times also follows from results derived by Karp and Miller.!! 
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Notation for the DFG: 
The number tn a node ts the operation time for that node. 


Figure 3(a): A SFG (b): First DFG corresponding to (a) 
(c): Second DFG corresponding to (a) (d): Another DFG 


4. Timing Analysis for General DFG Networks 


The previous example has demonstrated that the pipelining 
period depends heavily on the queue capacities of the DFG edges. In 
this section, we shall provide a complete timing analysis for general 
(i.e. acyclic and cyclic) DFG networks. Our objectives can be stated 
as twofold: (1) Given a DFG, with defined initial tokens and spaces 
on the edges, ”is it deadlock free?” and ”what is the average pipelin- 
ing period?” (2) Given a desired pipelining period, "how to assign 
minimal queues on the edges of a DFG to achieve the maximal 
speed?” Both the questions may be answered by a unified approach 
based on a duality concept of ”token” and ”space”. 


4.1. Duality of Token and Space 


Figure 4 (a) shows a directed loop with many spaces, but only 
one token. Since the number of tokens in a directed loop is con- 
served, there will always be only one token in this loop. The token 
can traverse the loop repeatedly by a series of firing of the nodes in 
the loop, and each trip around the loop takes 7, time, which is the 
sum of all node operation times in the loop. The pipelining period is 
T, . However, if we put one more token in the loop, then by the time 
one token traverses around the loop, the other one also finishes its 
own trip around the loop. So, the period is only half, T, /2. 


Suppose we keep putting more tokens into the loop. We can 
observe that the period decreases for a while, and then it increases. 
Let us look at a contrasting case in Figure 4(b). Here, we have many 
tokens, but only one space. Since the firing of a node requires at least 
one space on all output edges, we see that only one node can fire at 
any time. So, effectively, this space will traverse (in reverse direction) 
around the loop in T; time also. The pipelining period of this loop is 
again 7. By the same analogy, if we put in one more space in the 
loop, the period will become 7; /2. Recall that the number of spaces 
in a directed loop is conserved. 


(a) (b) 


Figure 4(a): A directed loop with only one token. 
(b): A directed loop with only one space 


From this example we see clearly that tokens and spaces play a 
very similar role in firmg nodes. They are both the ”resources” to be 
used by the nodes in order to fire. They display a duality relationship 
which is quite analogous to the duality of ”electrons” and ”holes” in 
semi-conductors. More precisely, a space in one direction plays the 
same role as a token in the reverse direction. There have been some 
observations on the roles of tokens and spaces in a DFG,!% 18 but not 
to the extent treated here. 


4.2. Augmented Flow Graph (AFG) 


Based on this key observation, we can simplify our discussion by 
adopting an augmented graph of the original DFG, which is termed 
the Augmented Flow Graph (AFG). The augmented graph is con- 
structed by starting with the DFG and adding a reverse edge e’ for 
every existing edge e. We leave the number of tokens on e 
unchanged. To e’, we assign a number of tokens equal to the number 
of spaces on e. This conversion is shown in Figure 5(a). In effect, we 
treat spaces as tokens in the AFG. 


Recall that, in a DFG, a node can fire only when there is at 
least one token on all the input edges and one space on all output 
edges. However, in an AFG, spaces are represented by dual tokens. 
Therefore, the firing rule of a node in an AFG is also modified to: a 
node ts enabled when and only when there exists at least one token on 
every input edge. (It then fires after it has been enabled for its com- 
putation time.) It can then be shown that the operation of this AFG 
is the same as the original DFG. In particular, the throughput 
analysis will remain the same.{ 


The DFG properties can be restated in terms of the AFG: 


(1) Perststence - Once a node is enabled, it remains enabled until it 
is fired. 


(2) Conservatton - The number of tokens in a directed loop remains 
constant. 


As an example of this conversion, we show in Figure 5(b) a 
DFG, which is a directed loop, and in 5(c), its AFG. Note that 
because of augmenting all edges in the original loop, we obtain two 
directed loops. The inner loop contains only tokens and the outer 
loop contains only spaces. We also show in Figure 5(d) a DFG, which 
is an undirected loop, and in (e), the AFG of (d). Notice that in this 
case the two directed loops in (e) contain both spaces and tokens. 
The tokens and spaces here play exactly the same role in the timing 
analysis, which will be explained in the following theorem. 


4.3. Pipelining Period 


Theorem 2 : Given a DFG with preassigned initial data 
tokens and spaces, the pipelining period a of the DFG is: 
\ a 


Ty T, 
(Die + Sicce )’ (Price + Sic) 

Where L is any undirected loop in the DFG, T,, is the sum of all node 
operation times in loop L, Dc is the total number of tokens on edges 
in the clockwise direction in L, and S,¢ is the total number of spaces 
on edges in the clockwise direction in L. Dzco and Szcog are simi- 
larly defined for the counter clockwise direction in L. Ty is the max- 
imum of the operation times of all nodes in the DFG. Lastly, Tp is 
the maximum of any sum of the operation times of a pair of nodes, 
which are connected by an edge with only one queue on it. || 


a = Maz { Ty, Tp, 


The above theorem has an equivalent but somewhat simpler 
statement in terms of the more convenient AFG notation. 


Theorem 2 (Rephrased Version): Given an AFG derived 
from a DFG with preassigned initial data tokens and spaces, the pipe- 
lining period a of the AFG (or DFG) is: 

Toi 
Kp, 


where DL is any directed loop in the AFG, Tp, is the sum of all node 
operation time in loop L, Kp, is the total number of tokens (includ- 
ing dual tokens) in L, and Ty is the maximum of the operation times 


of all nodes in the AFG. [| 


a = Maz Ty , (1b) 


{ With this definition, AFGs form a subclass of marked Petri Net Graphs, aug- 
mented with node times. 

t Similar results have been reported by Reiter!* in the scheduling context and by 
Ramamoorthy and Ho! for petri nets. Ramamoorthy and Ho’s analysis, dealing 
with a class of petri nets that they define as ”decision-free, consistent, extended 
timed petri nets”, is the closest to our own. However their treatment neglects 
the case where 


In addition, their proof, while formally presented, utilizes some non-rigorous tech- 
niques in the manipulation of infinite terms. 
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GO C= O40 


(o) 
(b) (c) 
(d) (e) 


Notation for the AFG: 
Tokens tn the AFG are denoted as dots on the edges. 


Figure 5(a): Rule for converting a DFG to an AFG 
(b): a directed loop DFG (c): tts AFG 
(d): an undirected loop DFG (e): tts AFG 


Note that if DL is a clockwise directed loop in the AFG gen- 
erated by the undirected DFG loop L, then Kp, = Dic + Sicc. 
Likewise, for counterclockwise DL, Kp, = Szc + Dycc. Therefore, 
the above two statements are the same in content and differ only in 
notation. For convenience, our proof will be given only in terms of 


the AFG notation. For this, let us first establish several useful lem- 
mas. 


Lemma 1: The pipelining period of an AFG (or DFG) must be 
greater than or equal to every node operation time. || 


Proof: This is obvious, since no internal pipelining is assumed 
inside a node. || 


- Lemma 2: Consider a directed loop L in the AFG. The pipe- 
lining period ap, of this loop, when it is operating independently of 
the remaining part of the AFG, is: 


Tox 
z: (2) 
DL 


ap. = 


Proof: To show this, first recall that, from the conservation pro- 
perty of an AFG, the number of data tokens in a directed loop DL is 
constant. If we mark one token which begins its trip around the loop 
at one particular node, it is clear that there will be Kp, tokens fired 
at that particular node during the period when this marked token 
completes its loop trip. This marked token will need Tp; in time to 
return to the starting node and this firing pattern will repeat at a 
period 7p; , assuming that this loop operates independently of the 
rest of the AFG. So, we obtain the average period for that loop UL, 
which is Tp, /Kp,. Note also that all nodes tn thts loop must operate 
at the same pipelining period. || 


Lemma 3: All nodes in the AFG operate at a uniform period 
in the steady state. [] 


Proof: While all the nodes in the DFG are in general weakly con- 
nected only, the nodes in the AFG are always strongly connected, i.e. 
there exist directed paths in both directions between any pair of nodes 
in the graph. Consequently, all nodes in the AFG have to operate at 
a uniform period in the steady state. Were this is not the case, some 
edges of the AFG will eventually have an unbounded number of 
tokens. Recall that there are only finite initial tokens in the AFG, 
and all tokens are contained in some directed loops. By the conserva- 
tion of tokens in the directed loops, the number of tokens in the AFG 
remains finite. |] 


Proof of Theorem 2: The proof of Theorem 2 has two parts: 
*necessity” and ”sufficiency” . 


Necessity: To prove the necessity part, we need to show that 
the DFG can not operate with a pipelining period less than the a in 
Eq. (1). Since the AFG is strongly connected, according to Lemma 3, 
all the loops will operate at the same period. Obviously, the slowest 


loop will prevail; therefore, the pipelining period for the total system 
will be at least the maximum of all periods associated with the indivi- 
dual directed loops. Thus, the necessity part is proved. 


Sufficiency: Here we want to prove that the DFG can operate 
at the period of a in Eq. (1). 


Consider two directed loops A and B. Assume that ag = ap, 
and loops A and B are coupled at some nodes. Then the pipelining 
period of both loops A and B together will be a, , except for some 
*phase difference” between A and B when both loops start computing. 
After some finite amount of time, both A and B will be synchronized 
at the period of a, . 


Now assume that a4 > ag. We first modify the loop B into B’ 
by adding extra operation time into nodes in loop B that are not also 
in loop A, such that the new period for loop B’ becomes a, . This is 
always possible, since the number of tokens in a directed loop is con- 
stant, and we can increase the period by prolonging the total node 
time around a loop. By step (2), now the two loops A and B’ can 
operate at the period of a, . 


Comparing loops B and B’, it is clear that any node time in B’ 
should be greater than or equal to the corresponding node time in B. 
It is not hard to envision that the loop B can always operate at the 
speed of loop B’, since the nodes of B are as fast or faster than their 
B’ counterparts. Since loop B’ can operate at the period of a, , it is 
straightforward to see that loop B can operate at this speed also. So, 
the resulting loops A and B can operate together at the period of ay. 


By induction, the above argument can be extended to any 
number of directed loops in the AFG. Namely, if the slowest period 
of all directed loops in the AFG is a, all nodes in the AFG will even- 
tually operate at a period of a, and the time needed for the synchron- 
ization of all loops will be finite, since there are only finite loops in 
the AFG. Thus, the sufficiency part is proved. Q.ED. |] 


A special case of this formula concerns the effect of putting only 
one queue on an edge in a DFG. In Figure 6(a), we show a simple 
DFG with two nodes, and only one queue on the edge between the 
two nodes. In Figure 6(b), the corresponding AFG is shown. It is 
clear that an edge between two nodes in a DFG induces a small loop 
in its AFG counterpart. This results in a special effect for the case 
when there is only one queue on the edge. According to Eq. (1b), the 
pipelining period must be greater than or equal to T(1) + T(2). This 
implies that these two nodes can only operate sequentially, i.e., only 
one can operate at any instant. This is the reason that the term Tp 


appears in Eq. (1a). 
OL %e 


(a) (b) 


Figure 6(a): a DFG with one buffer on the edge 
(b): the equivalent AFG of (a) 


Another observation is that while there are many directed loops 
in the AFG, only some of them are simple directed loops, i.e. loops 
which visit nodes only once, except for the node where the loop 
*begins” and ”ends”. In Figure 7, there are two simple directed loops, 
denoted as (n, €1 No €3 4), (ng eo Mg €4 Mo). An example of non- 
simple directed loop is (n1 €1 Nog €o M3 €4 M2 €3 2 ,). Fortunately, 
we need to consider only the simple directed loops for the pipelining 
period computation. This claim is proved in the following lemma: 


e 
art 2 
e€ e 
3 4 


Figure 7: AFG Example 


Lemma 4: The pipelining period of a non-simple directed loop 
DL in an AFG is less than or equal to the maximum of the periods of 
all simple directed loops contained in DL. || 


Proof: Without loss of generality, we can use Figure 7 as the 
AFG. We want to show that . 


beaten K, = K(e1) + K(es) 
Per To = T(ng) + T(n3) Ko=K (eo) + K(e,) 


Pe ea a 
az Ko Rh. = 


A : th t = > — t Y th t 
ssumin at: a @s «easy see at: 
g Ky S K y 


Ty (T,+ Te.) 
K, ~ (K+ Ke) 


Other cases can be proved similarly. [| 


(T,+ T2) 
(K,+ Ke) 


It can be deduced from this lemma that we can consider only 
the simple directed loops for the pipelining period computation. So, 
in Eq. (1b), we can assume that DL is a simple directed loop. 


4.4. Deadlock Analysis 


The issue of deadlock can be regarded as a special application of 
the above Theorem. The result is stated in the following Corollary: 


Corollary: The DFG is deadlock free if and only if there exists 
a finite solution a for Eq. (1). A finite solution for a exists if and only 
if the AFG contains no empty directed loops. |] 


Proof: It follows directly from Theorem 2. |] 
An example of deadlock is shown in Figure 8. 


(a) 


Figure 8(a): A deadlocked DFG (b): Its corresponding AFG 
Note that there are no tokens in the loop indicated by dashed edges. 
According to Eq. (1b), there exists no finite solution for a. 


5. Optimization of Throughput 


Theorem 2 gives the pipelining period of a DFG when the initial 
token assignments and number of queues on edges are given. The 
dual problem of determining the assignment of minimum queue capa- 
cities on the edges in a DFG to achieve the minimum (or optimal) 
pipelining period a” is treated in the following theorem: 


Theorem 3 : Given a DFG with initial token assignment, (i) 
the minimum (or optimal) pipelining period and (ii) the minimum 
queue capacities on the edges to achieve that pipelining rate can be 
determined as follows: 


(i) The minimum pipelining period a” is 


* T, 
a = Maz { Tia —— t (3) 


D DL 


where DL is a simple directed loop in the DFG, Tp, is the total node 
time in DL, and Dp, is the total number of tokens in DL. Ty is the 
maximum of the operation times of all nodes in the DFG. 


(ii) The assignment of the minimum queue capacities on the 
edges of the DFG requires that the queue capacities must be the 
minimum numbers satisfying the condition in Eq. (1a) in Theorem 2: 


(4a) 


T, 
= - Dice 
a 


Sic = 


T; 
Sicc = = - Dic (4b) 


for any simple undirected loop L in the DFG. and, 
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Q(e)> T (source + T (sink ) 


a 


(4c) 


Q(e) 2 De) (4d) 


for any edge e in the DFG. [| 


Note that a, T,, and Dic, Dicc, Sic, and Sco, are defined 
in Theorem 2. And Q(e) is the queue capacity of an edge e, T(source) 
and T(sink) are the operation times of the source and sink nodes of 
the edge e. 


Proof: Note that the optimal pipelining period may be obtained 
by putting oo spaces on all edges in the DFG. Theoretically, it is 
clearly the best one can do in order to maximize the pipelining rate. 
In this case, for all DFG undirected loops with both clockwise and 
counterclockwise edges (all but the directed loops), 
Sito = Sico = 00. As a result their individual pipelining periods 
will be 0 (according to Eq. (2)), which can never adversely affect the 
overall pipelining period. Thus, such undirected loops may be 
excluded for the purpose of applying Eq. (1) for determining a’. 
Therefore, only those directed loops in the DFG remain to be con- 
sidered. In this case, Eq. (1) becomes Eq. (3) and the Step (i) is thus 
verified. The validity of Step (ii) follows directly from Theorem 2 and 
Eq. (1). (Eq. (4d) is needed to insure spaces for the initial tokens.) [] 


In fact, the formula Eq.(4) in the step (ii) of this theorem is also 
valid for any given desired pipelining period which is greater than or 
equal to a”. In general, there are non-unique solutions to this assign- 
ment. Among several possible approaches to deriving a feasible solu- 
tion, we propose in the next subsection a simple cut-set procedure * 8 
to tackle this assignment problem. 


6. Linear Queue Assignment Algorithms 


6.1. Comparison with Retiming for Synchronous Systems 


In this section, we will present a cut-set retiming scheme to 
analyze the timing of an asynchronous system, i.e. a DFG. This cut- 
set procedure basically extends the earlier timing analysis research on 
synchronous circuits.® 7 In order to deal with generalized asynchro- 
nous data flow arrays, the DFG modeling is required to abstract an 
asynchronous circuit with initial conditions. Note that initial condi- 
tions affect the optimal pipelining period a’ . 


Moreover, compared with the systolic array, the wavefront array 
represents an effective means to deal with multi-rate processing ele- 
ments. In systolic arrays, a common procedure is to adopt a uniform 
clock unit based on the slowest node, which is not a good scheme in 
terms of pipelining speed. It is also noted that, the DFG modeling, 
with queues accommodating initial tokens, has an important advan- 
tage. It allows us to avoid the tedious (and sometimes impossible) 
procedure of initial tokens reassignment as required in systolic design. 


6.2. Cut-Set Procedure for Asynchronous Systems 


A direct approach to compute the optimal queue assignment 
would involve tracing all the simple loops - a very time consuming 
task. To avoid this, we propose a cut-set retiming procedure (to be 
detailed in the Appendix), to assign queues on the edges in the DFG. 
In short, a cut-set retiming procedure consists of two basic transfor- 
mations of a computing network which preserve the functional 
correctness of the network. One transformation is time scaling, i.e. 
slowing down the clock rate. The other is the transferring of time 
delays along a cut-set of the computing network. The objective of 
cut-set retiming is to transform from the SFG in which nodes have 
zero-delays, to a retimed DFG in which the operation time of the 
node is assigned to its output edges. With these two basic operations, 
The cut-set procedure determines the optimal ptpelining period of the 
DFG, a’, and also assigns to each edge e of the DFG an appropriate 
time delay, denoted as t(e), needed to achteve this optimal period. This 
procedure is stated in the following theorem: 


Theorem 4 : After the cut-set procedure, the queue capacity 
Q(e), needed for the edge e in a DFG for optimal pipelining period 
a , can be computed as: 


Q (e) = Maz <2) sheds Peete) || 


Qa 


(5) 


where the "ceiling function” [z | denotes the smallest integer greater 
than or equal to x. Also, t(e) and a are defined as above (and 
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derived in the Appendix). D(e) is the number of initial tokens on edge 
e, T(sink) is the operation time of the sink node corresponding to the 
edge e. [| 


Proof: We want to show that the queue capacities obtained from 
Eq.(5) satisfy Eq.(4). According to the rule used in the cut-set retim- 
ing procedure, t(e) should be greater than or equal to the operation 
time of its source node. (Note that in our retimed graph the node 
operation time has to be absorbed by its output edges.) Now, consider 


- any undirected loop L in the retimed DFG. For the purpose of illus- 


tration, an undirected loop and its retimed DFG are shown in Figure 
9. 


C (a) 


Figure 9(a): An undirected loop in the DFG. 
(b): The retimed DFG. 


Let us first assume that there are no tokens in this loop. We 
want to assign queue capacities to the counter-clockwise (CC) edges 
first. From the cut-set procedure, the total time assigned to the 
counter-clockwise (CC) edges is the same as the total time assigned to 
the clockwise (C) edges. Note that the total time assigned to the C 
edges (or the CC edges) is greater than or equal to the sum of node 
times of nodes A, B, D, since they are the source nodes of the C 
edges. Therefore; if we add the total time of the CC edges and the 
total operation times of the sink nodes (in this case, they are nodes C 
and E) of the CC edges together, as implied by Eq.(5), the sum is 
greater than or equal to the loop time 7; . So, the queue assignment 
of the CC edges satisfies Eq.(4b), i-e. 


T, 
Sicc 2-4 
a 


- Die 


We can show that the queue assignment for the clockwise C 
edges satisfies Eq.(4a), follows basically the same argument except 
exchanging the roles of C edges and CC edges. 


Now let us consider the case when there are Do tokens on 
some C edges in the undirected loop L. By the cut-set retiming pro- 
cedure, for each token on a C edge a time-delay a is assigned on 
that edge initially. As a result, after the cut-set retiming, the total 
time of the CC edges is greater than or equal to (total source node 
times of the C edges) - a° Dic. If we sum up this total time and all 
the sink nodes time of the CC edges, the sum is greater than or equal 
to T, -a° Dig. Therefore, Eq.(4b) is again satisfied. Of course, the 
counter-clockwise case can be proved similarly. Moreover, if the loop 
under consideration happens to be a directed loop, it can be simply 


regarded as a special case, and there is no need of any different treat- 
ment. 


Next, we want to show that Eq. (4c) is also satisfied for all 
edges. This is true since t(e) is always greater than or equal to its 
source node time, T(source), and therefore t(e) + T(sink) is greater 
than or equal to T(source) + T(sink) of node e. It is then clear that 
Eq.(4c) is satisfied for all edges in the retimed DFG. Eq.(4d) is also 
satisfied because D(e) is included in the maximum operator in Eq.(5). 


Finally, we note that since the queue capacities have to be 
integers, this quotient is rounded up by the ceiling operation. It is 
obvious that the queue capacity Q(e) must be greater than or equal to 
D(e) in order to accommodate all the initial data tokens on edge e. 
Therefore, the maximum of the above two integers is taken for the 
queue assignment. Q.E.D. |] 


6.3. Linear Programming Formulation 


The cut-set queue assignment above is not minimal, in large 
part because the retiming is not unique. In general, there are many 
SFG retimings that will meet the constraint that each edge delay is 
greater than or equal to its source node computation time. To minim- 
ize the queues, we would like the ”minimal retiming”, or the retiming 
that minimizes the overall sum of the edge delays. 


To specify a particular retiming, we can use notation similar the 
notation introduced for the synchronous retiming case.6 This notation 
is equivalent to describing a complete retiming in terms of cut-sets 
around each individual node. A retiming is specified by a mapping R 
from graph nodes to numbers. The number assigned to each node 
corresponds to the delay added to the outgoing edges and subtracted 
from the incoming edges. The delay transferred to an edge by a 
retiming is R(source) - R(sink). Using this notation, we can reformu- 
late the retiming as a linear programming problem as follows: 


The cost function is the sum of the edge delays after retiming. 
With initial edge delay of a° D(e), the retimed edge delay is 
a” D(e)+ R(source )- R(sink). Since a” D(e) is a constant, we 
need to minimize: 


> R (source ) — R (sink ) (6a) 


edges 
under the constraint: 


For all edges: 


R (source ) — R (sink) > T (source )- a" D(e) (6b) 
Once the minimal retiming is derived, then queues are assigned 
by equation (5). 


7. Integer Programming Algorithms 


Since queues must have integer lengths, the linearized” 
methods of the previous section do not necessarily yield the absolute 
minimal queue assignment. To eliminate this small quantization 
error, we can deal directly with the integer quantities, expressing the 
problem in an integer programming formulation. 


7.1. Simple Minimal Queue Assignment 


The assignment of the minimum queue capacities on the edges 
of the DFG requires that the queue capacities must be the minimum 
numbers satisfying the conditions in Eq. (7a,b,c,d), i.e. 


minimize }} Q(e ), under the linear constraints: 


e: edges 


Ty, 
Sic > ry 
a 


- Dice 


T, 
Stoo 2 —- - Dic 
a 
for any simple undirected loop L in the DFG. And, 


S(e)> ( T (source ) + T (sink ) -D(e) 
a 
Q(e) = D(e) 
for any edge e in the DFG. 


The validity of this formulation simply follows Theorem 3. 
Note that here the variables are all integers, thus giving a integer 
linear programming formulation. 


(7c) 


(7d) 


7.2. Minimal Queue Assignment with Initial Token Redistri- 
bution 


In some instances we find that part of the queue capacity of an 
edge, needed to accommodate initial tokens, is never used in the rest 
of the computation after the initialization stage. In these cases a 
smaller edge queue capacity can support the computation after initial- 
ization. Use of this smaller capacity necessitates the redistribution of 
initial tokens so that all can fit into the reduced edge capacities. It 
turns out that this redistribution can be formulated along the same 
lines as we proposed for the integer programming formulation and can 
be integrated into a larger size integer programming problem to solve 
the real minimal queue assignment problem. Space limitations pre- 
clude us from including the details. 


8. Applications of the Timing Analysis 


8.1. Improving Processor Utilization 


It is clear from our timing analysis that any node in the DFG 
can at best operate at the optimal period, i.e. once in a’ time. If a’ 
is large compared to the node computation times, the utilization of 
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nodes is quite low. There are two ways to improve this situation. 
One way is to use the technique of processor sharing,‘ i.e. we use one 
real PE to simulate several nodes, so as to improve the utilization of 
PE’s. For example, if the average node utilization in a DFG is only 
1/3, i.e. all nodes compute for only one third of time in average, it is 
possible that we can group 3 nodes into one real PE to obtain full 
utilization of PE’s. It is not trivial to partition the nodes in an arbi- 
trary DFG and assign them to real PE’s to achieve maximal PE utili- 
zation. 


Another way to get around this problem of low node utilization 
is to change a” by changing the node operation times in a DFG. Note 
that the operation of a node can be changed by using different 
hardware realizations. Then it is possible to reassign the node times 
of a DFG to obtain better a” , and thus better utilization. 


8.2. A Lattice Filter Example 


The SFG of a Lattice Filter is shown in Figure 10(a), and one 
equivalent DFG of the SFG is shown in Figure 10(b). The initial data 
tokens, queue capacities, and node operation times are also displayed. 


We first apply the main theorem to determine the pipelining © 
period of the DFG in Figure 10(b). The augmented flow graph, AFG, 
is shown in Figure 10({c). And the loop which has the maximum 
period a@ is also shown in Figure 10(d). From this loop, we obtain a 
== 11. 


+) Ge (+) OY) 

MH) &® &} & Y © © ® 

+) Y) 

{o) (b) 

) ) 

Q © 

Y) 
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Figure 10(a): a Lattice Filter SFG (b): one equivalent DFG of (a) 
(c): the AFG of (b) (d): the slowest loop in the AFG 


To assign minimum queues on the edges in order to achieve the 
optimal pipelining period, we first derive the retimed DFG of the Lat- 
tice Filter by the cut-set procedure. The retimed DFG of the Lattice 
Filter is shown in Figure 11(a). The optimal pipelining period a* 
obtained by the cut-set procedure is 5.5. Applying Eq.(5) to all edges 


in (by. DFG, the assignment of minimum queues is shown in Figure 
11(b). 


(b) 


Figure 11(a): the retimed DFG 
(b): queue assignment for optimal period 


9. Conclusion 


We have presented a systematic approach of converting a paral- 
lel algorithm described by a Signal Flow Graph to a functionally 
equivalent Data Flow Graph, which sets the stage toward its wave- 
front array implementation. Based on the notion of token and space 
(and their intriguing duality property), pipeline timing analysis and 
optimal queue design of generalized dataflow arrays are investigated. 
Of the four algorithms we proposed for the optimal queue assignment, 
the cut-set procedure is computationally easiest; however, it does not 
give an optimal or minimal assignment. The linear programming for- 
mulation gives closer approximation of the minimal solution at the 
expense of more computation. Finally, both integer programming for- 
mulations with or without initial token redistribution yield true 
minimal assignments at the cost of even more computation. It is 
expected that dataflow, wavefront, and fault-tolerant architectures 
will play a central role in the future VLSI or wafer-scale computing 
systems. It is our belief that, for these applications, the theorems and 
the timing analyses proposed in this paper will prove to be very use- 
ful. 


Appendix : Transforming a SFG to a Retimed DFG 


Let us for the time being assume that each node in the SFG 
actually takes some known amount of time (though each node may 
take different times). To analyze the queue assignment, we first intro- 
duce a retimed DFG, in which the operation time of the node is 
assigned to its output edges. This is illustrated in Figure 12. 


Figure 12: Assignment of node times to output 
edges in the retimed DFG. 


In order to synchronize the operations among nodes in the 
retimed DFG, edges will also be assigned extra time delays to serve 
the synchronization purpose. This assignment is based on the cut-set 
retiming procedure. 


A Cut-set Retiming Procedure 


We first define a cut-set for a SFG as: A cut-set in a SFG is a 
minimal set of edges which partitions the SFG into two parts. The 
retiming procedure is based on two simple rules: 


(i) Time-Rescaling: All delays D may be scaled, ie.,. D — a , 
by a single positive number a. Correspondingly, the input and 
output rates also have to be scaled down by a factor a. The 
time-rescaling factor (or, equivalently, the slow-down factor) a 
is determined by the slowest loop in the SFG array. 

(ii) Delay-Transfer: Given any cut-set of the SFG, we can group 
the edges of the cut-set into in-bound edges and out-bound 
edges, depending upon the directions assigned to the edges. The 
Rule (ii) allows advancing k time-units on all the out-bound 
edges and delaying k time-units on the in-bound edges, and vice 
versa. It is clear that, for a (time-invariant) SFG, the general 
system behavior is not affected because the effects of lags and 
advances cancel each other in the overall timing. Note that the 
input-input and input-output timing relationships will also 
remain exactly the same only if they are located on the same 
side. Otherwise, they should be adjusted by a lag of +k time- 
units or an advance of -k time-units. 


The optimal a” for the retimed DFG is decided by the slowest 
loop in the SFG. In other words, a* should be the minimum 
number such that after time scaling, there are enough delays to be 
distributed for all loops. Note that when there are no directed loops 
in the SFG, the optimal time interval between two input data, 
namely, a” is the maximum of all node operation times. This is 
required because a pipelined system can not operate faster than the 
time needed for the slowest section in the pipeline. 
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Procedure for Transforming a SFG to a Retimed DFG 
Array 


A retimed DFG array is derived from the original SFG by the 
following procedure: 


(1) Set initial a as the maximum of all node operation times. Scale 
all delays. 


(2) For each target edge (edge that does not have more or equal 
amount of delays as required by the node it incidents from), find 
a cut-set containing the target edge and such that after suitable 
delay-transfer along the cut-set, the target edge will have the 
exact amount delay it needs, and at the same time, no new tar- 
get edges are generated by this delay-transfer. 

(3) If such a cut-set can not be found, a directed loop L which con- 

tains the target edge will be found. Rescale the a, i.e. a = 

T, /D,.- 

(4) Repeat the above step until there are no more target edges. 

The final a is the optimal pipelining period a”. (| 

The detailed algorithm for this has been previously published.8 

Only the important ideas have been outlined here. 
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Abstract -- This paper presents preliminary work on the 
analysis of dataflow algorithms. The focus is on 
developing techniques for modeling dataflow computa- 
tion to obtain both theoretical and architecture- 
dependent performance estimates. Operator nets pro- 
vide the basic model of a computation. The dataflow 
language used is Lucid. An example from the field of 
digital signal processing is used to demonstrate a 
theoretical data dependency analysis, analysis on an 
operator net and Lucid program, and performance fig- 
ures derivable in terms of dataflow architecture models. 


Introduction 


Dataflow is emerging as one of the important 
models of parallel processing, providing the potential to 
exploit the maximum concurrency inherent in an algo- 
rithm without requiring explicit statement of the paral- 
lelism. However, because the execution sequence is not 
expressed explicitly in the dataflow program, conven- 
tional algorithm analysis techniques for predicting the 
performance of an algorithm are not immediately appli- 
cable. In this paper, we use examples of analyses on a 
digital signal processing algorithm to demonstrate some 
analysis methods that provide both theoretical and 
architecture-dependent performance estimates. 


Algorithm, Language, and Architecture Models 


Our model of a computation will be an operator net 
[2], which is a directed graph whose nodes represent 
operations or functions to be performed on the data 
items (datons) which arrive at the node via the incom- 
ing edges. An operator net is built up from subgraphs 
which specify subcomputations. The dataflow language 
used is Lucid [7], which allows a direct linear 
representation of operator nets. Lucid is a functional, 
definitional language in which the values of variables 
are infinite time-varying sequences. The structure of 
Lucid programs resembles the mathematical statement 
of a function definition. The model of a Lucid program 
is a filter which processes a time-varying infinite input 
stream, with the output of the program being the out- 
put stream which results from application of the filter 
to the input. Lucid programs therefore have a ‘“‘built- 
in’? time index. The model is in many ways well 
matched with operations in digital signal processing. 


We will consider dynamic or tagged dataflow 
models [1, 7], in which each daton is associated with a 
tag, and the datons which are to be used together as the 
arguments of an operator or function are identified by 
matching their tags. The basic model of execution in 
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such architectures is a ring. Data-driven architectures 
(e.g., the Manchester Dataflow Computer [3, 7]) are 
modeled by a single ring; demand-driven machines (e.g., 
the Eduction Engine [4]) are modeled by two intersect- 
ing rings, one for handling demands and one for apply- 
ing operators to the results of demands; the hybrid 
Eazyflow Architecture [5] is modeled by intersecting 
data, demand, and application rings. The Eduction 
Engine and Eazyflow Architecture have been designed 
to perform direct execution of operator nets, expressed 
in linear form as Lucid programs. 


Analyses 


Three types of analyses are demonstrated. A data 
dependency analysis based on the theoretical statement 
of the computation exposes the available parallelism. 
Analysis of the operator net and/or Lucid code provides 
a means of determining the parallelism that can be 
extracted from the program. From these analyses, 
requirements on the architecture can be derived. 


The digital filtering computation 


Pp q 
Un = 7% Zp 4 By » 5, Yn —k 
k =0 


k=1 

will be used to illustrate the analyses. We assume that 
the data arrival rate is S (in signal processing terms, a 
sampling rate of S), the sample period is 6, and that 
the data output rate matches the data input rate. (A 
faster data output rate is not possible; a slower rate 
implies eventual infinite buffering of inputs.) Except 
where specified otherwise, we will take the multiply- 
accumulate, denoted mac, (sum= sum+ w*v, where 
sum, w, and v are scalars) as the basic operation per- 
formed. In (1), z, = y, = Oforn< 0. 


forn 20 


(1) 


Inherent Parallelism 


Conventional analysis of the data dependencies in 
the computation defines the maximum degree of paral- 
lelism attainable. Graphically, this can be represented 
by structures such as Petri nets. In many cases, this 
analysis can be performed directly on the mathematical 
statement of the problem. For example, in the digital 
filtering case, expansion of Eqn. (1) shows that each z. 
will be used in P =p +1 terms and that each y, will be 
used in Q =q terms. In any time interval 6, then, it will 
be possible to initiate P+Q mac operations. Because 
of the associativity of addition the mac’s can be per- 
formed in any order, so the only constraint on the com- 
putation is that imposed by the availability of the z,’s 
and y,’s. The maximum degree of parallelism (the 
number of mac’s that can be in progress simultaneously) 
will therefore be a(P +Q), where a= = and 


tac 4S the total time to perform one multiply- 


accumulate. Depending on the architecture, t, |. may 
represent only the actual time for the arithmetic opera- 
tion or it may, in fact, include the complete cycle of 
instruction-fetch/decode/execute or = token-match/ 
instruction-fetch/execute. 


Operator Net and Lucid Parallelism 


The operator net and Lucid program provide 
equivalent representations of the computation. In addi- 
tion to representing the data dependencies, these forms 
also incorporate temporal dependencies introduced by 
the language constructs used. In particular, the effect 
of the program strategy and data structures used can be 
derived from the operator-net/Lucid representation. 
Since Lucid programs implicitly operate on time-varying 
sequences, the notion of ‘“‘Lucid-time”’ is contained in 
operators such as first (which accesses the first element 
of the sequence), next (which yields the remainder of 
the sequence), is current (which temporarily freezes 
time with respect to a specified variable), asa (which 
selects a value as soon as a condition is met), and fby 
(which allows the construction of the time-varying 
sequences): and by built-in variables such as index 
(which is automatically incremented by one for each tick 
of Lucid-time). The operator net reveals the temporal 
relation between portions of the computation. Incor- 
poration of timing constraints introduced by the use of 
Lucid-time constructs allows analysis of individual code 
blocks. Combining these analyses with information 
about data input arrival rates allows derivation of the 
potential parallelism in the Lucid program. 


Fig. 1 shows two different Lucid code segments to 
compute the steady-state output of Eqn. (1). Given the 
notion of data streams (and lacking the notion of 
stored, randomly accessible vectors) and the capabilities 
of Lucid, it is easier to access ‘“‘future’’ data items in the 
input or output streams than to access “‘past”’ items, so 
the computation has been reformulated as 


Implementation Using Recursive Function Calls 


//input = z; zpad_xr= z with P-1 zeros prepended 
/output = y; zpad_y = y with Q zeros prepended 


y where y = zsum(c,zpad_z,P) + ysum(d,zpad_y,Q) 
where 

rsum(coeff,z,P) = 

f(coeff,z,first P) fby zsum(coeff, next z,first P); 
ysum(coeff,z,Q) = 

f(coeff,z,first Q) fby ysum(coeff,nezt z,first Q); 
{(w,v,N) = sum asa indez eq N 

where sum = O fby sum + w * vu; end; 


end; 
end 
Non-Recursive Implementation 
/{A from lis <A,,A;,Ajz4o.°°°> 
y where 
y = g(c*X,P) + g(d*Y,Q) 
where 


Tis current indez; 
X = zpad_sz from I; 
Y = zpad_y from I; 
end; 
g(s,N) = sum asa index eq N 
where sum = 0 fby sum + s; end; 
end | 


Fig. 1. Portions of two digital filter algorithms. 
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where ¢,=a,_. for O<: €p and d,=6 _, for O<1 <q. 
Zero-padding of z and y ensures correct alignment of 
the two sums (see Fig. 1). 


Fig. 2 shows the operator net for the zsum and 
ysum functions from the recursive version, where f can 
be specified by a subgraph representing the operator net 
for the function f. The operator net reveals the tem- 
poral relation between the two major portions of the 
computation. In terms of the dataflow model, the com- 
putation of f can begin immediately. After a delay of 
one time unit 6 (corresponding to advancing one time 
unit in the z or y data stream) the zsum (or ysum) 
function will be enabled, allowing another f computa- 
tion to begin. The operation of f is constrained by the 
sequential character of the input and by the fby con- 
struct to proceed linearly. (Thus, although the associa- 
tive addition allowed accumulating the terms in any 
order in the theoretical analysis, the Lucid model 
imposes a linear order on the computation.) In general, 
then, ¢, =(N-1)6+t_.. Except for the case of N 
small and ft,  >>6, t, =AN 6 for some A. Combining 
the analysis of f with the enabling pattern of the zsum 
or ysum functions yields Fig. 3, where the length of the 
rectangles is AN 6. Once steady state has been achieved, 
taking a horizontal cross section across the graph shows 
the number of instances of zsum (or ysum) that can 
proceed simultaneously. Including the addition that 
combines zsum and ysum gives an approximate steady 
state potential parallelism of \(P +Q)+1. 


In the non-recursive implementation, the fact: that 
the filter moves across the signal is contained com- 
pletely in the implicit advancing of Lucid-time. At each 
time step J, time is frozen and an output y value is 
computed using X =zpad __s,,zpad_2,,,,°°° and 
Y =zpad __y, ,zpad__y, ,,,'°°. As in the recursive 
implementation, a new computation can begin upon 
arrival of a new z value or completion of a new y 
value, i.e., at each interval 6. In function g, the fby 


Fig. 2. Operator net for function zsum. 


Fig. 3. Time sequence of zsum computations, 
shown for AN =5, in time units of 6. 


forces linear operation, with execution time proportional 
to P (for g(c*X ,P)) or Q (for g(d*Y,Q)). The 
analysis is similar to that for f above. Combining the 
implicit advancing of the computations with the 
analysis of g yields the same execution characteristics 
as shown in Fig. 3 for the implementation using recur- 
sive function calls. In the non-recursive case, the basic 
operations are additions and multiplications rather than 
multiply-accumulates. The asymptotic behavior will be 
the same for the two cases; specific differences will 
depend on factors such as how the recursion and how 
the tagging requirements arising from the is current 
construct are implemented. 


Architecture Analysis 


We consider analyses relating the dataflow archi- 
tecture to the Lucid algorithm. The characteristics of 
the algorithm are used to derive requirements on the 
architecture in order to achieve the desired performance. 
This entails combining the constraints imposed by the 
input arrival rate, desired output rate, and the compu- 
tational complexity characteristics obtained via the 
analysis of the Lucid program. The resulting model of 
the progression of the algorithm’s execution can be com- 
bined with the architecture model to derive the process- 
ing rates required at each component of the architec- 
ture. For simplicity, we use a data-driven model, shown 
in Fig. 4. The computation can proceed essentially as 
outlined above. Assuming equal data arrival and out- 
put rates of S samples per second, the minimum 
required processing rates shown in Table 1 can be 
derived. The ‘“S(P+Q +1)” terms are due to the com- 
putation steps; the “S”’ terms account for the input. 
The fundamental operation is taken to be a 3-operand 
multiply-accumulate. If more basic arithmetic opera- 
tions and 2-operand matches are used, the entries due to 
computation approximately double. 


Table 1. Processing Rates for Dataflow Ring (sec”*) 


Switch +S tokens 
Token Queue +S tokens 
Matching Store S(P+Q+1)+S matches 
Instruction Store S$(P+Q-+1) accesses 

Processors S(P+Q+1) operations 


The operations count in the processors implies that 
in time 6, P +Q +1 operations must be performed. The 
parallelism analysis has shown that this may be done in 
as many as P +Q+1 processors. Depending on 6 and 
trac» tewer processors may suffice. The more stringent 
requirement is on the operation of the matching store. 
If P, Q,and S are such that the matching store can- 
not function at the required rate, multiple submachines, 
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ming strategies, and analysis techniques. 


PROCESSORS 


input 


SWITCH | 


output 


| INSTRUCTION 
STORE 


MATCHING 
STORE 


Fig. 4. Ring of a data-driven architecture [7]. 


such as those proposed for the Eazyflow Architecture [5] 
and the next generation Manchester system [3], could 
provide the needed capability. 


Summary 


This paper addresses the problem of analyzing sig- 
nal processing algorithms for dataflow architectures. 
The dataflow language Lucid differs significantly from 
traditional computer languages, just as the dataflow 
model of computation differs significantly from the von 
Neumann serial computer. Development of efficient sig- 
nal processing algorithms for Lucid/dataflow execution 
therefore requires new algorithm approaches, program- 
We present 
some analysis tools which allow us to obtain design 
parameters for dataflow architectures targeted for signal 
processing applications. 
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Abstract -—— First, a model of static data flow 
computer and a model of data flow graph are pro- 
posed, then a model of system is presented for 
calculating practical paralleliem degree with over- 
head of instruction execution on data flow compu- 
ters as its parameter. From the computation, we 
can conclude that the maximum practical parallelism 
degree of a program running on a static data flow 
computer is determined with MP/OH (MP is the mean 
parallelism degree of a program, Oh is the over- 
head of instruction execution on the computer ). 
Therefore the overhead has great influence on the 
performance of a data flow computer. 


Introduction 


The main purpose of research on data flow com- 
puters is exploiting parallelism in programs and 
speeding up their running. Instructions on data 
flow computers are driven by data, therefore data 
flow computers have great potential power in ex- 
ploiting parallelism. However, progress in study~ 
ing data flow computers is not great. In recent 
years, some people have criticized the approach 
of data flow. They point out that one of its pro- 
blems is the great overhead, such as communication 
overhead in packet switched networks, extra acces- 
sions of memory, etc. While other people don't 
think the overhead is a problem. They consider a 
data flow computer as an asynchronous pipelined 
structure, and believe that if there are enough 
instructions available, each part of the computer 
can run efficiently even though the overhead is 
great. 

In this paper, an approach is presented for 
studying the overhead. First, a model is given 
for a class of static data flow computers, then 
the relationship between the overhead of instruc~ 
tion execution and the parallelism degree on such 
a computer is established using the theory of 
stochastic process, The computed results show 
that for an user program, the practical maximum 
parallelism degree is only limited by the overhead 
of instruction execution. 


1. MODEL OF STATIC DATA FLOW COMPUTERS 
AND MODEL OF DATA FLOW GRAPH 


A typical structure of static data flow compu- 
ters is shown in fig. 1. This is in fact a macro 
pipeline consisting of several units performing 
different operations. In such a system, many 
auxiliary operations are introduced, such as fet~ 
ching instructions, sending them to processors 
through switched network, constructing result 
packets, sending packets back through network, 
storing packets into memory, and arbitrating whe- 
ther instructions can be fired, etc. The so called 
EFFECTIVE OPERATION means the operation which 
directly do what the opcode of an instruction in- 
dicates, For example, the effective operations 
of arithmetic instructions are to perform arith~ 
metic operations. The OVERHEAD of an instruction 
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execution is all of its auxiliary operations, 

In this paper, we suppose the effective operation 
of each instruction takes one unit of time. The 
overhead of instruction execution is measured in 
units of time. In general, the overhead of in-~ 
struction execution depends on what the instruc- 
tion is and what environment the instruction is 
executed in. Here, the overhead practically 
refers to the average value of overhesds of all 
instructions, In the following, the overhead of 
instruction execution fia briefly called overhead, 
denoted by OH. 

The model of pipelined etruatare for studying 
overhead is shown in fig.2. The N processors in 
the model, which represent N processors in a 
system, execute only effective operations of in- 
structions in one unit of time. The n pipelines 
correspond to delays of auxiliary operations. The 
number of stages in each pipeline represents the 
corresponding value of overhead. Suppose the 
delay of one stage is one unit of time, then the 
number of stages of a pipeline is OH, OHe{1,2,3, 
eoe}- The memory in fig.2 is an idealized unit 
with unlimited size and accessing time 0. Its 
delay is considered as some of the stages of the 
pipelines. 

A data flow graph is a directed cyclic graph. 
In fig.3, a data flow graph for computing N! is 
depicted. In order to describe the dynamic beha- 
viour of a program, the data flow graph of the 
program should be unfolded to become an acyclic 
graph, That is, each function call should be 
replaced by its data flow graph and each loop 
should be unfolded. Therefore, no cycle exists 
in the graph obtained (the graph is an infinite 
one, if its program running will not stop). In 
such a graph, each node represents an instruction. 

A pseudo-node, which is named start node and 
denoted by sn, and which emits all arcs to nodes 
that input tokens from outside of the graph, can 
be added to an unfolded graph, as shown in fig.5. 
Such a graph is called an extended data flow graph. 
In the flowing, we will only discuss this kind of 
graphs and call them data flow graphs(DFG) in 
brief. Suppose the length of each arc in such 
graphs is one. 

DEF.1.1 Level of a node. The level of the 
start node is zero, the level of any other node 
is the length of the longest path to it from the 
start node. 

DEF.1.2 Parallelism degree of level j of a 
program (j is a positive integer) PD; is the nun- 
ber of nodes at level j in its extended data flow 
graph, Parallelism degree of a program, PP is 


= PDs im h 
2% 5 A witness ee 


Lim 3 PD; /m if the graph is infinite 


DEF.1.3 Balanced data flow graph. If there 
are two or more paths from the start node sn to 
a node in a data flow graph, then their lengths 


are the same. Such a graph is a balanced data 
flow graph, otherwise it is an unbalanced one. 

When a balanced data flow graph is used to pro- 
cess a set of data in a pipelined way, high effi- 
ciency can be achieved. Unfortunately, most data 
flow graphs are unbalanced ones, In order to 
transform them into balanced ones, some identity 
instructions (I), which only delay tokens by one 
unit of time, should be inserted in some _ short 
paths, thus increasing their lengths. An unba- 
lanced DFG is shown in fig. 5. 
instruction into it, so that it becomes a balanced 
one, as shown in fig. 6. 

On balanced DFG, the nodes at level i receive 
data tokens only from level i-1. In other words, 
instructions at level i are only driven by tokens 
from level i~1. So, we propose an assumption, 

ASSUMPTION ON DRIVING LEVEL BY LEVEL. On extend- 
ed DFG, instructions at level i. are only driven by 
tokens from level i-1l. 

For any unbalanced DFG, it is possible that an 
instruction at level i receives tokens not only 
from level i-1, but also from other levels. Because 
the tokens from level i-1 goes through the longest 
way from the start node sn to the instruction, it 
is mostly possible that the instruction is fired 
by this token. Therefore, this assumption is 
reasonable in general. 

Let Object_No;.; denote the total number of 
result packets generated by all instructions at 
level j-1, and Operand No; the total number of 
operands required by all instructions at level j. 
According to the previous assumption, we have 


Object_Noj;.,;= Operand_No; jé{ 2,3,4,---,m} 


A stochastic process PDj;, j¢{1,2,3,...} is 
used for describing the dynamic behaviour of 
parallelism degree of a program. Suppose that 
all PD s are independent, identically distributed 
random variables with distribution N(MP, 6), where 
MP and 6“ are the mean and the variance of PD; 
respectively, PD; is the parallelism degree of 
level j. | 


2. MODEL OF SYSTEM FOR COMPUTING 
PARALLELISM DEGREE 


First, we make some assumptions on this system. 
Assumption 2.1. The system has n processors, 
In other words, the maximum parallelism degree of 
the system isn, LetN= {0,1,2,...,n}, is a set 
of all possible practical parallelism degree of 

the system. 

Assumption 2.2. The maximum parallelism degree 
of a program is Pmax. Pmax=MAX(P1, Pz,.0-, Pm), 
P; is the parallelism degree of level j, m is the 
maximum level. LetP= {1,2,3,..0, Pmax}, is a set 
of all possible parallelism degrees of the levels. 

Assumption 2.3. The numbers of result packets 
produced by instructions at each level are inde~ 
pendent, identically distributed variables with 
revised Poisson distribution, 

Suppose the mean number of result packets 
produce by an instruction at level j is Aj; the 
revised Poisson distribution P’(Aj) is 


x-1 
: ~r; 
P’(Aj) = ‘aionandecenceseaen e 7 Se 172, 35008 
(x — 1)! A; €{1,2 3 y0+e9Amax} 
= MAX(A},Aa)-++, Am) 
Assumption 2.4. Any result packet is trans- 
mitted to each instruction at next level with 
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We insert an identity 


identical probability. 

Assumption 2.5. The numbers of operands needed 
by instructions at each level of a program are 
independent, identically distributed variables 
with revised Poisson distribution P’(8;), where 
B; is the mean number of operands required by an 
instruction at level j. 

DEF.2.1. <A description vector of level i of a 
program is a k-tuple sequence M;, Mj= (8515 Aas 
coo sBin )y a;ep , 1Sisk, k= A mae*MP, 85; 
is the number of ee requiring i operands. 
At beginning, PDj= 24a;;. 

According to assuliption 2.5. 


(j-1) ip): 

Let a5; = i 2rieiee e 3 

(i - 1)! 
Mj € P* px coe D = ps 

Let M =7*, called a set of description 
vectors, 

When a result packet is produced and passed to 
the next level, the Probability with which the re~ 
sult is sent to a_set of instructions requiring i 
operands is a;;/ > a;; according to assumption 
2.4. After that,’ the number of instructions re~ 
quiring i operands decrements, whereas the number 
of instructions requiring (i-1) operands incre- 
ments. The description vector Mj is transformed 
to M; 

j° 

rr (" 9 B32 gooogGsi +1,8,; —Ty000 5 85x ) i>1 
: (854 -1,852 9853 geves8sx ) i=1 
When the result packet is transmitted to a set of 
instructions requiring only one operand (i=1).one 
instructions in the set can be fired, The number 
of instructions a;; decrements (a,;;-1),the total 
number of instructions at level j also decrements, 
and the number of instructions in the buffer in- 
crements. In other cases, the instructions at 
level j does not decrease, Obviously, when & ia; 
result packets are received at level j, all’! 
instructions at this level can be fired. 

From the assumption on driving level by level, 
PDj-4 *Aj-1 = PD *8; then A;= Aj-y*PD;_, /PD; 

DEF.2.2. The state of processors is an integer 
PR, O=PRen, n is the number of processors in the 
system. 

DEF,2.3. The state of the pipeline is a vector 
PI with OH elements, PI = (my ,nz,..- Noy 
N;6N 9 1Si1S<0H, PLE News ee yd yy 


ce] 

DEF.2.4. The state of the buffer is an inte- 
ger BU, BUE{0,1,2,..., m*Pmax } 

DEF. 2.5. A descriptor of a program with m 
levels is a m-tuple sequence PGm, PGr= (M1,Mz,.00, 
Mm) MjEM, TS jam, PGm EM* M¥ee > ¥M EmM™, 

The model of system for computing parallelism 
degree is shown in fig.7, in which PR is the state 
of processors, PI the state of pipeline, BU the 
state of the buffer. The value of BU represents 
the number of instructions, which can be fired, 
but are waiting for free processors, The memory 
of the system is composed of two parts. One is a 
half infinite linear storage with locations 1,2, 3, 
e+ee Hach location can store a description vector, 
and location 1 stores M,, location 2 stores M2,... 
location m stores M,, that is, m description vec~ 
tors are stored in order from location 1 to m. The 
second part of the memory is a pointer Point.,which 
points to a location that can be read or written 


currently. Point can stay at the original place 
or move one location to the right. Its initial 
is at the first location. Except the m locations 
storing PGm, the locations in the memory all store 
blank vectors B, B = (b,b,b,...,b), bisa blank 
character. 

DEF. 2.6. 
parallelism degree is a triplet (PR, PI, BU), in 
which PR, PI, and PU are the state of processors, 
the state of the pipeline, and the state of the 
buffer respectively (as def. 2.2,2.3,2.4). 

Let PS be a set of system states, 

PS = WEN + £ 0,1,2,060,m*Pmax} , SE PS 

DEF.2.7. The model of system for computing 
parallelism degree is a 5-tuple (PS, s.,M, B, é), 
in which 7S is the set of system states, s. is the 
initial system state, so= (0, (0,...,0).0), Mis 
the set of description vectors of a program, B is 
the blank vector, and § is a map 

§ : PS*M ——* PS#M*{L,A} 

L means that the Point does not move 
A means that the Point advances one location 
to the right 
& is called the transform rule of the system 
state. We shall describe it with the algorithm 
TRANS as follows. ‘ 

Supposition 1. Poisson (A) is a random number 
generator with revised Poisson distribution. A is 
the mean of the random numbers, 

Supposition 2. U(Mj) is also a random number 
generator, which give a positive integer i, 1=i 
=k, according to assumption 2.4, 


Algorithm TRANS 
if Mj = B then stop else 
if M;= (0,0,...,0) then j:=j+1 else 
begin 

RNs=PR3 PR:=PI 3 

for i :=2 to OH-1 do PI; :=Plja1 3 

if BU s= n then (Plows= nz BUs= 

else [PIo,:=BUs BU:=0J; 

Packet := 03 F 

while RN>O do ( Packet:=Packet + Poisson 
( Aj-1 3 RNs=RN~1) 5. 

While (Packet>0 and M;& (0,0,...,0)) do 
begin | 
i:=U(M;)3 aj, 2= aj 
if i>1 then a;;_, :=85;_, 

}! ? 
BU :=BU+1 
end 
end; 


-~n) 


+1 else 


Suppose the system is executing instructions at 
level (j-1) in a data flow graph. After a unit of 
time, instructions in n processors of the system 
are all finished, each of which produces some re—- 
sult packets randomly. The total number of these 
result packets are stored into a variable Packet. 
At the same time instructions at the first stage 
of the pipeline PI is sent into the processor PR, 
and some instructions are fecthed into the last 
stage of the pipeline PI from the buffer BU. The 
result packets in Packet are passed to the instruc~ 
tions at level j with identical probability. The 
random number generator U(M; ) produces a number i, 
then one result packet is transmitted into one of 
the instructions at level j, which needs i operands 
If i>1, then the description vector M;= (a,,a2,.6, 
&;., +158; =—1,000,8x )3 If i=1, then one of ins~ 
tructions can be fired, and is passed to the buffer 
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The state of the system s for computing 


BU. After all the result packets (in Pecket) are 
passed to the memory one by one, Mj; is changed to 
M; » BU to BU, and system goes into a new state. 

When the model is running, the values of PR 
and the number of steps of the system proceeded 
are accumulated, and then the practical parallel 
degree can be calculated as + PR/steps. 

Before running the model, the mean of paralle- 
lism degree of a program, its varience, and the 
overhead of instruction execution are set as 
parameters. When the system finishs, the rela~ 
tion among the overhead, parallelism degree, and 
the number of processors can be determined. 


3. RESULTS COMPUTED AND DISCUSSION 


According to the model, some results computed 
for various parameters have been obtained. And 
two sets of curves are depicted in fig.8,9. 

In fig.8, we suppose the communication over-~ 
head is in proportion to log N, so OH = C;+C.log N, 
where C;,Cz2 are integers (C;,C,>0). Since packets 
pass through the networks two times, C, equsls to 
two. 

Each curve in fig.8 has a maximum value, After 
the maximum value is reached, the practical paral- 
lelism degree falls, the efficiency of processors 
decreases, although the number of processors N in 
system increases, 

The table 1. shows the relationship between 
the maximum practical parallelism degree and other 
variables. ; 

In fig.9, a set of curves of relations between 
practical parallelism degrees and parallelism de- 
grees of systems with OH as parameter are _ shown, 
where on each curve there is also a maximum value 
of practical parallelism degrees. 

From the model of the system, we can see that 
in order to obtain high efficiency, the pipeline 
representing overhead must be filled up with OH*N 
enabled instructions. If MP<OH*N, all instruc- 
tions at one level are not enough for filling the 
pipeline. As a result, the practical parallelism 
degree decreases, 

When MP of a program and OH of a system are 
given, the maximum practical parallelism degree 
is equal to or less than MP/OH, In the following, 
we give the explanation on this. 

(1) Suppose MP<OH*N. 

Since MP instructions are executed during OH 
steps, practical parallelism degree is MP/OH; 

2} Suppose MP >0OH*N 

Obviously, the practical parallelism degree is 
the possible maximum parallelism degree of a 
system, N. Since MP>=0H*N, then N<=(MP/OH). 

Therefore, the practical parallelism degree is 
not greater thanMP/OH. Consequently, when paralle- 
lism degree of a program is a constant, its prac- 
tical maximum parallelism degree is only deter- 
mined by the average overhead of instruction exe- 
cution on a system. and have nothing to with the 
number of processors in the system. If we hope 
to speed up the running of a program on a system, 
only by means of increasing the number of proces~ 
sors, even at the cost of increasing the overhead, 
then the result may be countrary to our hope. The 
increasing of the overhead may cause the practical 
parallelism to be decreased, and the running time 
of programs increased. 

The practical parallelism degrees computed from 


the above model is much less than MP/OH, because 
the execution of instructions enabled on data flow 
computers is in random order, which is not suitable 
for pipelines. 

From the above computation, we can see that 
overhead has great influence on the performance 
of a data flow computer. If we want to increase 
the parallelism degree of data flow computers, we 
must be. sure that the overhead does not increase 
quickly. 

Obviously, although the conclution is obtained 
from static data flow computers, it is also useful 
for other types of multi-processor systems. 
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USING FACTS FOR IMPROVING THE PARALLEL EXECUTION 


OF FUNCTIONAL PROGRAMS 


Alberto Pettorossi 
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Abstract -- We address the problem of improving 
the efficiency of the parallel evaluation of func- 
tional programs. We assume that those programs are 
evaluated by a set of concurrent agents which 
communicate with each other and cooperate together 
while the computation progresses. Communications 
among agents improve the performance, because prop- 
erties or facts about the functions to be computed 
are exploited. In particular we show that those 
communications may avoid redundant computations 
of intermediate results. 


1. Introduction 


Functional languages have been advocated as a 
formalism for expressing algorithms which allow to 
overcome the limitations imposed by the von Neumann 
computer architecture [1]. Applicative expressions 
in fact, can be evaluated in a parallel way by a 
set of concurrent computing agents, because there 
is a unique value associated with any subexpres- 
sion, independently from the context in which it 
occurs (by the referential transparency property). 
That context independence property allows us to 
compute the various subexpressions in a parallel 
way by assigning each of them to an individual com- 
puting agent. 

In order to fix our ideas and to introduce 
our Hope-like notations [3], let us consider the 
following simple program for computing binomial co- 
efficients: 


bin(n,0)=1 

Bo: 2bin(n,n)=1 
bin(n,m)=0 if n<m | 
bin(nt+1,m+1) = bin(n,m)+bin(n,m+1) 


if mm 

The computation of the function bin(n,m) can 
be informally described as a rewriting of sets of 
agents.(The formal definition of a generic compu- 
tation will be given later). For the time being,an 
agent can be thought of as a pair consisting of a 
string, i.e.,the name of the agent, and an expres- 
sion, i.e.,the expression which the agent has to 
evaluate. 

We may assume that initially there is the 
agent e::bin(n,m), which can be rewritten into the 
set of agents: 

{ez:: .0+.1, O::bin(n-1,m-1), 1::bin(n-1,m)} 

| where n>m21. 
In the rewriting we see that:i) new agents are gen- 
erated by the recursive calls of the program equa- 
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tions, ii) the left and right sons of the initial 
agent « have names 0 and 1 respectively, and iii) 
-k denotes in any given expression the result of 
the computation of the agent whose name is k for 
k=0 or 1. , 

We assume that the convention for giving 
names to the agents is the following: the initial 
agent has name ec, and if the agent x has k+l sons, 
j.e., 1t uses for the rewriting a recursive equa- 
tion with k+l recursive calls, the names of its 
sons are x0, xl, ..., xk, respectively, according 
to the left-to-right order of those recursive calls. 

For our binomial function, in the second step 
of the computation the agents 0 and 1 generated by 
the agent «, can be rewritten independently from 
each other, so that the computation of bin(n,m) 
can be more precisely viewed as a nondeterministic 
and parallel rewriting of sets of agents into new 
sets of agents (see figure 1). The nondeterminism 
consists in choosing one (or more) agents to be re- 
written, and parallelism consists in rewriting all 
agents which have been chosen, according to the re- 
cursive equations of the program. 


{e::bin(n,m)} 


fet: .0+.1, O::bin(n-1,m-1), 1::bin(n-1,m)} 
/ | a 
{e::.0+.1, fer:.0+.1, 
0::.00+.01, O::.00+.01, 1::.10+.11 
1::bin(n-1,m), 00::bin(n-2,m-2), 
00::bin(n-2,m-2), 01::bin(n-2,m-1), 
01::bin(n-2,m-1)} 10::bin(n-2,m-1), 
| 11::bin(n-2,m)} 


figure 1. Possible rewritings of the parallel 
computation of bin(n,m) when n>m22. 


Notice that, according to the naming conven- 
tion, each set of agents can be viewed as a tree- 
structured set, and therefore we may say that the 
computation progresses as a nondeterministic and 
parallel rewritings of trees of agents. 

In this paper we will be concerned with the 
problem of improving performance of the trees of 
agents for the parallel evaluation of functional 
programs. 
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We will propose a method based on the discov- 
ery or knowledge of some "facts" concerning the 
functions to be evaluated. Those facts will be im- 
plemented as suitable communications among comput- 
ing agents and the occurrence of those communica- 
tions will realize the expected efficiency improve- 
ments. , 

Suppose, for instance, that we know the fol- 
lowing fact about the binomial coefficients func- 
tion: if n>m22 then the left son of the right son 
of bin(n,m) is equal to the right son of the left 
son of bin(n,m). In the language of facts which we 
will formally introduce later, we write: 

bin(n,m) 101 = bin(n,m) 110 
(see figure 2). 


bin(n,m) 
0 ] 
bin(n-1,m-1) bin(n-1,m) 
0 1 0 1 


bin(n-2,m-2) bin(n-2,m-1)=bin(n-2,m-1) bin(n-2,m) 


figure 2. A fact for the binomial coefficients 
function: bin(n,m) ]01=bin(n,m) 110 
where n>m22. 

In the rewriting of the tree-structured set 
of agents we may then avoid the expansion of the 
agent 10 and allow the expansion of the agent 01 
only. Once the value of bin(n-2,m-1) has been ob- 
tained, the agent 01 may send it to the agent 10, 
and save its computation. We will see in the se- 
quel how that communication takes place and how it 
can be implemented by associating a local memory 
with each agent. 

Avoiding the expansion of the agent 10 re- 
duces the number of agents we need for performing 
the required computation of bin(n,m) and in most 


cases that saving is crucial,because in practice the 


number of available computing agents cannot be con- 
sidered as unbounded. 

There is another reason why the knowledge of 
relevant facts about recursive functions may dras- 
tically improve the efficiency. Suppose that during 
the computation of a given function we have used 
all the agents at our disposal. At that point we 
have to backtrack by deallocating the tasks from 
some agents which are leaves of the current tree 
of agents. The deallocation will make some agents 
available for the rewriting of other leaf agents 
and then more computation steps can be performed. 
It is desirable that backtracking steps are done 
in the most effective way, so to avoid,if it is 
possible, subsequent backtracking steps and at the 
Same time, allow the maximum utilization of the 
agents at our disposal. Unfortunately the dealloca- 
tion-allocation procedure in general cannot be op- 
timized, if we do not know enough properties about 
the dynamical expansion of the trees of agents 
during the computation of the particular function 
we have to evaluate. In the binomial coefficients 
case, for instance, when computing bin(n,m) it may 
be better to expand the tree of agents so that we 
activate as soon as possible bin(m,m) or bin(n-m,0) 
because no extra agents are needed when computing 
those function calls. 
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In the Section 2 we will present a method for 
implementing facts about recursive functions and | 
we will see how they can be translated into commu- 
nications among agents. 

In the subsequent Sections we will give a more 
formal presentation of the ideas introduced in Sec- 
tions 1 and 2, and we will study some properties of 
the communications among agents. 


2. The general scenario and a preliminary 
example 


We consider the functional programs written as 
a set of recursive equations in a language LO (to 
be defined later), and we assume that the program- 
mer discovers or knows some useful facts about the 
functions to be computed. Those facts, written in 
the language of facts LF, are first submitted to a 
calculus which may or may not accept them. The role 
of the calculus is twofold: i) it makes sure that 
the facts are correct with respect to the given 
programs, and ii) it validates the efficiency im- 
provements they determine, once implemented as com- 
munications among agents. Those two aspects of the 
calculus are very important, because otherwise we 
may derive, as we will see, erroneous or ineffi- 
cient programs. 

Facts which are accepted by the calculus are 
then translated by a translation algorithm, which 
produces a new set of recursive equations, written 
in a language L1, where it is possible to denote 
communications among function calls. 

The general scenario of our approach is the 
one depicted in figure 3. 


Recursive Equations 
Knowledge Programs + Facts 
(written in the 
language LF) 


Recursive Equations 
Programs (written 
in the language LO) 


Calculus C 


Semantic Recursive Equations 
Sem0“7—--\ Translations Programs + Facts 
. accepted by the 
. Calculus C 
\ 
\ Facts 
\ Translation 
\ Algorithm tr 
\ 
N 
"Parallel Programs" \ Recursive Equations 


(Rewriting Rules for Seml 


Programs + Communi- 
Communicating Agents) 


cations(written in 
the language L1) 


figure 3. The general scenario for improving 
the parallel execution of function- 
al programs. 

The scenario we presented can be considered 
as an extension of the ideas of the transformation 
system of Burstall and Darlington [2]. However in 
our approach the knowledge the programmer has about 
the functions to be computed, is expressed in a 


language LF of facts (not as “eureka steps" [2]). 
Moreover, the presence of the calculus which deals 
with the facts and ensures their correctness and 
their usefulness for improving efficiency, avoids 
the problems inherent to the Burstal1-Darlington 
approach, where it may be that during the transfor- 
mation process correctness is not preserved [8] — 
and efficiency is not improved. 

Notice that we explicitly deal with functional 
programs executed in a parallel way, so that the 
facts discovered by the programmer may refer to the 
behaviour of the associated sets of computing 
agents, but we allow only facts which can be written 
in a high level language LF. In this language it is 
possible to describe properties of sequences of ex- 
pressions generated by the standard rewriting tech- 
nique,so that the programmer is not involved in 
all the peculiarities connected with the computing 
agents behaviour. The system can accept facts if 
they are "true" facts and “efficiency improving" 
facts. Accepted facts about functional programs 
are used for increasing the efficiency of their 
parallel execution. | 

The semantics Sem0 and Seml are identified 
as 2 translations. Given a program P in LO (or L1) 
SemO (or Seml) produces a corresponding parallel 
program P'. Indeed, Sem0(or Sem1) defines a set of 
agents for evaluating the functions defined in P, 
and a set of rewriting rules which specify the 
concurrent behaviour of those agents. — 


The meaning of P' is given by a transition re- 


lation which defines for any given finite set of 
computing agents, i.e., a configuration, all pos- 
Sible future configurations.That transition rela- 
tion is able to capture the nondeterministic and 
parallel execution of the program P' by construc- . 
eG tree of configurations as indicated in fig- 
ure l. 

We will show that the translation Sem0 from 
a program P to P' (or the translation Semletr from 
P with some additional facts to P') can be done 
automatically for some classes of programs and 
facts. . 

The system is able to improve efficiency by 
automatically introducing suitable communications 
among computing agents. Those communications are 
derived on the basis of the given facts only. 

It will be shown that the introduction of the. 
intermediate language L1 in which functional pro-- 
grams with communications are written, allows us 
to have a simple calculus for checking the correct- 
ness of the communications among computing agents. 

The language of parallel programs should be 
considered as a lower level language, which is un- 
derstood by the system which realizes the parallel 
execution of functional programs via a set of com- 
puting agents. | 

The following example (given at schema level 
and taken from [4])will clarify the main ideas we 
have presented so far. 

Let us consider the program schema P: 

p. Jt(x a(x) if p(x)=true 

f(x) = b(x,f(c(x)),f(d(x))) otherwise 

et us suppose that we discovered that for 
any x s.t. p(x)=false we have f(x)J01 = f(x)J10. 
That fact can be expressed as 

F(c(d(x)) )=f(d(c(x))) 
and it means that the left-son call of the right- 
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son call of f(x) is equal to the right-son call of 
the left-son call of f(x). That fact is then sub- 
mitted to the calculus (whose definition will be 
given later). If it is accepted as a "true"fact 
and an "efficiency improving fact", it will be in- 
corporated into the program P by the translation 
algorithm which produces the following program P1: 


1: F(x)=a(x if p(x)=true 
P24 (x)=b(x,F(c(x)) (1 comm 1), 
f(d(x})(O comm 1))decl 1 
The informal meaning of the expression "e 
is that a temporary memory location 1,is kept dur- 
ing the evaluation of the expression e. That loca- 
tion is discarded when the evaluation of e is com- 
pleted (see figure 4). 


otherwise 


F(x) 


F(c(x)) F(d(x)) 


F(c(c(x))) F(c(d(x))) F(d(c(x))) F(d(d(x))) 


figure 4. The program for f(x) with communica- 
tions via the location 1. 


The informal meaning of f(c(x))(1 comm 1) (and 
analogously for f(d(x))(0 comm 1)) is that the val- 
ue of the right son of f(c(x)), i.e., F(c(d(x))) 
is stored in the location 1. To be more precise, 
F(c(x))(1 comm 1) means that i) during the evalua- 
tion of f(x), the agent which has to compute 
f(c(d(x))) may look at the content of the location 
1 to get the result of its computation, and ii) at 
the end of the computation of f(c(d(x))), that val- 
ue is stored in 1. If a computing agent looks at 
the content of a location 1 and there is no value 
stored in it, its computation continues by activa- 
tion of recursive calls. 

From the informal description we have given, 
it is clear that efficiency improvements may take 
place, because before the end of the computation 
of f(d(c(x))), the corresponding agent may look at 
the location 1 and get the value it has to compute. 
| In particular, that agent may simply wait for 
the result of its computation to be written in 1 
by the companion agent, which has to compute 
f(c(d(x))). If it does so, the generation of redun- 
dant agents is avoided, but we have to make sure 
that the agents do not wait for each other and no 
deadlock occurs. 

Notice that the identification of a descen- 
dant call of a, function f(x) is done via a string 
in {0,1,...,k}'. f(x)Je denotes the function call. 
f(x) itself, and f(x)J1s.j recursively denotes the 
j-th son call of f(x)]s, for se{0,1,...,k}*. and 
O<j<k. 

The calculus used in our approach is based on 
the symbolic evaluation of functions and it makes 
sure that for the given functions c and d, we have 
indeed: f(c(d(x)) = f(d(c(x))). The efficiency im- 
provements due to the realization of the facts, 
are guaranted by the property that when a communi- 
cation occurs, it is like making many rewritings 
of trees of agents in one step only. 


The following new fact, called periodic redun- 
dancy in [4], can be discovered about the function 


x): 
f(x) jon = f(x)]1™1 = for some m and n20, 


where ae denotes the string of n j's. 

The above fact can be implemented as new ad- 
ditional communications between computing agents, 
and we get the following program P2: . 


F(x)=a(x) if p(x)=true 


P2: § f(x)=b(x,f(c(x))(1 comm 11)(0" comm 12), 


f(d(x))(0 comm 11)(1™ comm 12))) 
decl 11,12 otherwise 


Notice that the programmer, when he discovers 
new facts, has only to suggest them to the system, 
and their translation into new recursive equations 
with communications is done automatically by the 
translation algorithm (at least for the facts we 
considered here, i.e., the ones which can be ex- 
pressed as equality of functions calls). Indeed it 
is not difficult to see that one can derive the 
communication expressions of the form: 

s comm 1 and deci 1 | 
starting from the strings denoting function calls 
in the language of facts. 

The goal of our project is to build a system 
for recognizing different classes of facts con- 
cerning functional programs, and exploiting those 
facts for improving efficiency. 


3. Parallel programs 


In this section we introduce the notion of 
parallel programs [8-10]. 

AExp, MExp, Exp are sets of terms constructed 
in the standard way from constants, variables and 
function symbols.(The sets of constants, variables 
and function symbols are fixed in our consider- 
ations.) The elements of the sets AExp, MExp, Exp 
are called agent-name expressions, message expres- 
sions and expressions, respectively. If T is a set 
of terms then by CT we denote the subset of T s.t. 
teCT iff t is a term without variables. , 

An agent expression is a triple 


< agn,msg >:: e 


where agneAExp, msgeMExp, ecExp. 

A computing agent (or simply an agent) is an 
agent expression without variables. 

Configurations are finite sets of agents. By 
CON we denote the set of all configurations. 
| A rule (or recursive equation) is an expression 
of the form: - 

Th <= rh if cond 
where lh,rh are finite sets of agent expressions 
and cond is a boolean expression. 

By r(xl1,...,xk) we denote the rule r in which 
x1,...,xk are the only variable occurrences. If 
al,...,ak are constants (of suitable types) then 
by r(al,...,ak) (or r or lh <= rh if cond) we de- 
note a concrete instance of r which we derive by 
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Substituting al,...,ak for xl,...,xk in r.- 

A parallel program is any non-empty finite set 
of rules. 

Let c,c'eCON and let r be a concrete instance 
of the rule r and let us assume that: 

c —Lsc' holds iff cond is true and Ih <c 

- and c'=(c-lh)u rh . 

—=— = is calied the transition relation of r. 

The transition relation corresponding to a 
given sequence s= rl,...,rk of instances of rules 


is defined as the composition of the transition re- 
lations: 


rl 


nese 


rk 
pees, ——* and it is denoted by By. 
The transition relation of a program P (denot- 


ed by -—») is defined as follows: 

c t»c' holds iff there is a non-empty fi- 
nite sequence s of instances of rules in P (de- 
rived by the same substitution)s.t. for an arbitra- 
ry permutation s' of s we have: , 

C =e c', 

As an example, let us consider the parallel pro- 
gram BO' for computing the binomial coefficients .BO0' 
is a set of rewriting rules which implements BO. 


{<x,E>::bin(nt+1,mt+1)}<={<x,E>::.x0+.xl, 
<x0,E>::bin(n,m), 
x1,E>::bin(n,m+1)} 

if nom 
Hi<x,E>::.x0te, <x0,E>::n}<={<x,E>::nte} 

BO :2{<x,E>::et.xl, <xl,E>::n}<={<x,E>::etn} 
{<x,E>::bin(n,0)}<={<x,E>::1} 
{<x,E>::bin(n,n)}<={<x,E>::1} 

{<x ,E>::bin(n,m)}<={<x,E>::0} if n<m 

{<x,E>::ntm}<={<x,E>::k} where k=ntm 

{<x ,E>::n-m}<={<x,E>::k} where k=n-m 

where E is the empty message.The messages will play 

significant role in others examples, here they do 

not. 

We see the above rules in action in the fol- 
lowing computation of bin(3,2), where we write 
x::exp instead of <x,E>::exp and by > we denote 
the transition relation of BO’. 


fe::bin(3,2)} > {e::.0+.1,0::bin(2,1), 

1::bin(2,2)}——> 

{er2.0+.1,02:.00+.01,00: :bin(1,0) ,01::bin(1,1), 
ot > 


{er:.0+.1,0::.00+.01,00::1,01::1}——> 
fe:2.0t1,0: 2141} >{e::.0+1,0::2} 
{e::2+1} >f{e::3}. 


> 


4.The translation functions Sem0 and Seml. 


This section is a technical section where we 
formally define the language LO in which prelimi- 
nary program versions are written, the language LF 
in which facts as equality of terms are written, 
and the language L1 in which recursive programs 
and facts are translated. We also present methods 
for translating programs in LO (or L1) into parai- 
lel programs. 

Wewill define a <calculus,fact-translation> 
pair for which facts accepted by the calculus can 
be correctly translated into communications among 


recursive calls. A correct fact-translation tr is 
the one which makes the diagram of fig.3 commute. 

We will give an explanation of the various 
notions by discussing the problem of computing 
the binomial coefficients. 

The language LO in which the functional pro- 
gram for computing binomial coefficients is writ- 
ten is a Hope-like language [3]. An expression e 
of LO is defined by 

e::= n{xig(e,...)If(e,...) 
where neConstants, xeVariables , geBasic_ Functions 

and feRecursive_Functions. 

A program P in LO is a set of recursive equa- 
tions each of which is of the form: 

f(e,...)=n (base case) 
or | 

f(e,...)=e with f occuring in e' (recursive 
| case). 

For LO expressions we assume a parallel and 
distributed evaluation using computing agents [9]. 
The translation Sem0 of functional programs in LO 
into parallel programs is defined by the following 
list of rules: 


1. Generation of sons. 


f(e0,...,ek)=g(...,f(e,...),...,f(e',...),...) 
produces: 0 p 
{<x,E>::f(e0,...,ek) }<={<x,E>::g(... 5.X0,.09-XD yee) 
<x0,E>::f(e,...),...5 
<xp,E>::f(e',...)} 
2. Base configurations, 
f(e0,...,ek)=n produces: 
{<x,E>::f(e0,...,ek)}<={<x,E>::n} 


3. Values to fathers. 
{<x ,E>t:g(...5eXjye.-) <Xj,E>t in} <= 
7 {<x,E>::g(...,n,...)} 
4. Basic functions evaluation. : 
{<x,E>::g(nl,...)}<={<x,E>::m} if m=g(nl,...) 
Initial agent. For computing f(nl,. the initial 
configuration {<e,E>::f(nl,...)} is “generated. 

E is the empty message. ‘It will play a signif- 
icant role in Seml, here it does not. 

One can check that our parallel program BO0' 
is the result of the above translation applied to 
BO. 

In order to produce an improved version of 
the program BO we may use our system by supplying 
it with "facts", i.e., properties of the computa- 
tion of bin. We will restrict our attention to 
facts which can be expressed as equalities of 

agent expressions. They can be discovered by sym- 
bolic. evaluation. 

The syntax of the language LF of facts is as 
follows: 


e::=...(as in LO)| eJs with se{0,1,...,k}* 
fact: :=f(e,...)Jsl=f(e,...)Js2. 

A fact is checked by a Calculus and, once 
accepted, a [rans] ation n algorithm produces from it 
an improved version of the given program. 

The Calculus for checking facts is given by 
the rewriting rules, which we will present for any 
program PO in Ll with one recursive case only. Let 
PO be: 

f(e,...)=nl,...,f(e',...)=nk (base cases) 

f(e,...)=e' (recursive case). 

The rules for the Calculus are: 
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—{<x,m>s:g(... 


. @lee—e 
. nis >error if s#e 
. X]Jse—>error if s#e 
. g(e0,...,ek)]jse—> if O<jsk then ejJs 
else error | 
sek) Ise—>if f(e0,...,ek)=e is an 
instance of the recursive 
case then els else error. 


oO Pm Wh = 


. f(e0,... 


We write: 
els = e'Js' 
iff it is possible to reduce els and e'Js' to the 
same expression (different from error) after 
applying to them a finite number of times the 
rules 1-5 and.the rules of the Basic Functions 
algebra. 

For instance, in the program BO we have for 
n=2 and m22: 
bin(n,m) JO1——>bin(n-1,m-1) J1-—>bin(n-2,m-1) 
bin(n,m) ]10+—>bin(n-1,m) JO+—>bin(n-2,m-1) ,and 

bin(n,m)101=bin(n,m) 10. 

The program produced by the translation is 
written in the following language L1. The syntax of 
L1 is like that of LO, with the following addi- 
tions: 


e::=...le(s comm 1) with se{0,1,...,k}> 
leLocations. 
The r.h.s. of an equation can be of the form: 


e or e deci l. 
The parallel programs ms which we will get after 
the translation of programs in L1 have a special 
form of messages. 
Messages are either E, i.e.the empty message 
or em=L, where: 
i) em is the empty elementary message ¢ or a 
a constant elementary message neConstants, 

ii) L is a set of names of son agents which may 
read or write the associated elementary mes- 
sage. L is represented as the list ESlsés 
sn],where each sic{0,1,...,k}”. 

For instance, if the left son of an agent has 
to read/write the message of its father, that mes- 
sage will initially be: o=[0]. 

Now we are ready to define the translation of 
the set of programs in Ll into the set of parallel 
programs. 

A program P in L1 is translated as a program 
in LO (i.e.,rules 2,4 and the initial agent rule 
for Seml are like those for Sem0) with the follow- 
ing additions and changes: 


| 1! Generation of sons with communications. 


f(e0,...,ek)=g sot eee free 
f(e}.. .)(s ¢ comm 1),...)decl 1 
produces: J 
{<x,E>::f(e0, oo , 
{<x,o2L>: o XO} essa Xiyese 
<x0, E>: ee »9<XJj E>: :Flel.. a Niger 


where L= tis| Ss comm : "occurring in the j-th 
recursive call}. 

3. Values to fathers. 
oeXJo--e)y <Xj,m'>iin}<= 
{<x,m>:i:g(...,n,...),<xjam'>:in}. 

Notice that contrary to the previous case,after 
reporting to its father a son agent remains pre- 
sent. It may need to communicate its value to a 
message location. 


5. Writing a message location. 


{<x,o2L>::e,<xs ,m>: :n}<={<x,n=L-{S}>::e,<xs ,m>tin} 

. if sel 
Notice that after writing a location a son agent 
remains present because afterwards it may have to 
return its value to its father. 


6. Reading a message location ("reading com- 
munications. 


{<x ,n=L>::e,<xs ,m>::el}<= 
{<x n=L-{S}>::e,<xs,m>::n} if SeL. 

The Calculus and the Translation algorithm 
enjoy the necessary properties with respect to cor- 
rectness and efficiency of the derived programs, 
as the following results show. 

Let PO be the class of all programs in LO 
with 1 recursive definition only and let P1 be the 
class of all programs in L1. We consider the class 
TR of translations from PO into P1 such that if 
treTR then tr adds s comm 1 and decl 1 annotations 
only. 

Correctness Theorem for Communications. If 
for every program POcPO and s comm 1, s'comm 1 oc- 
curring in tr(PO) in the recursive call at position 
j and j’ respectively, we have in our Calculus: 

flisce) 8 = FC...) 15's" 
then tr is correct, i.e. for every POePO the paral- 
lel programs Sem0(PO) and Seml1(tr{P0)) compute the 
same function. | O 

Remark. Semi(tr(P0)) is more efficient than 
Sem0(P0) if some reading communications take place. 
Let us consider the following program Bl in 


Ll: 
bin(n,0)=1 
bin(n,n)=1 
bin(n,m)=0 if n<m 
Bl: £bin(nt1,m+1)=bin(n,m)(1 comm 1) + 
+bin(n,m+1)(0 comm 1)decl] 1 
if n>m21 
bin(n,l)=n if n>l 


The program Bl is equivalent to BO, j.e., 
the parallel programs Sem0(BO) and Sem1(B1) com- 
pute the same function bin.This follows from the 
Theorem above and from the fact: 

bin(n,m)J01 = bin(n,m) 110 
accepted by the Calculus when n>m=22. 

The following parallel program Sem1(B1) is 
the result of translating Bl: 
{<x,E>::bin(nt+1,m+1)}<={<x,o=[01,10]>::.x0+.xl, 

, <x0,E>::bin(n,m), 
<x1,E>::bin(n,m+#1)} 


if n>m21 
{<x ,mes>::bin(n,0)}<={<x,E>::1} 
{<x,mes>::bin(n,n)}<={<x,E>::1} 
{<x,mes>::bin(n,1)}<={<x,E>::n} if n>l 
{<x ,mes>::bin(n,m)}<={<x,E>::0} if n<m 
{<x,mes>::ntm}<={<x,E>::k} if k=ntm 


{<x ,mes>::et .xl,<xl,mes ]1>: :n}<= 
{<x,mes>::etn,<xl,mes 1>::n} 
{<x ,mes>::.x0te,<x0,mes 1>: :n}<= 
| {<x ,mes>: :nte,<x0,mes 1>::n} 
{<x,o2[01,10j>::e, <x01,mes>::n}<= 
: {<x,n=L10J>::e,<x01,mes>::n} 
{<x,¢2[01,10J]>::e, <xl0,mes>::n}<= 
{<x ,n=L01]>::e,<xl10,mes>::n} 
{<x,n=[01]>::e,.<x01,mes>::e1}<= 
{<x ,n=[ ]>::e,<x01,mes>::n} 
{<x,n=[10]>::e, <xl0,mes>::el}<= 
{<x ,n=[J]>::e,<x10,mes>::n}. 
Let us consider an example of computation of 
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the program Sem1(B1) with the initial configuration 
{<e,E>::bin(5,3)}: 


{<e ,E>::bin(5,3)}—> 
{<e,o*[01,10]::.0+.1, 


<0,E>::bin(4,2), <1,E>::bin(4,3)}——> ... 


umn > 
{<e,o2[01,10J>::.0+.1 


™. 
<0,6~C01,10J>::3+.01 <1,9~[01,10]>::,10+1 


<OI,¢~[01,10]>::3 <10,E>::bin(3,2)} 


ema > 


{<e,32[10J>::.0+.1 


<0,o=[01,10J>::3+.01 <1,6=(01,10]>::.10+1 


Pa 
<01,¢=[01,10]>::3 <10jE>::bin(3,2)} 
eB 

{ <e,32[J>::.0+.1 


<03962001,10]>::3+3 <1,9=£01,10J>::.10+1 


<10,E>::3 } 


In the transition A the value computed by the agent 
01 is communicated to the agent e« and in B this 
value is communicated to the agent 10 (by the 

agent «). That communication avoids the com- 
putation of bin(3,2) by the agent 10. 


Let us consider now a program FIB in LO: 
fib(0)=1 
FIB: «fib(1)=1 
fib(nt+2)=fib(n+1)+fib(n) 
and a program FIB1 derived from FIB and the fact 
fib(n)]01 = fib(n)]10 (n>1) 


fib(0)=1 
FIB1: Jfib(1)=1 
fib(n2)= SrA comm 1)+ 
fib(n) Me comm 1))decl 1 

where we use the annotation W to tell the system 
that the newly generated computing agent with the 
task fib(n) is forced to wait until the correspon- 
ding value appears in the location 1. 


The reader can easily modify our language L1 
and the fact-translation algorithm for coping with 
the new kind of annotations. 

Fact. Let FIB1' be the result of application 
of that new translation algorithm to FIB1.The par- 
allel program FIB1' uses a linear number of com- 
puting agents (with respect to n) in the computa- 
tion with the initial configuration 

{<e,E>::fib(n)}. 
The length of this computation is a linear func- 
tion of n. | 


In the <calculus,fact-translation> pair we use, 
it is possible to express facts as equality of 
terms, so that repeated evaluations of common sub- 
expressions are avoided. The calculus is based on 
the unfolding rule [2] and the symbolic evaluation 
technique. The associated translation realizes com- 
munications among concurrent agents and improves 


the efficiency of their computations. 


5. Another class of facts. 


In this Section we will consider a new class 
of facts which turn out to be very useful for de- 
riving efficient versions of parallel programs. 

As an example we will derive a program for finding 
the connected components of undirected graphs. 

Some facts can be viewed as. making "jumps in- 
to the future". They improve efficiency by identi- 
fying several subexpressions which can be evaluat- 
ed in a parallel way, and by allowing us to know 
in advance the results of some concurrent compu- 
tations. 

Other facts have more sophisticated structure, 
but all of them can be discovered by analysing the 
behaviour of the given functional programs, maybe 
using a little induction. 

We think that by developing more techniques 
similar to those presented in this paper it will 
be possible to construct advanced systems which 
(semi )automatically derive parallel programs with 
high performances (at least for some specific clas- 
ses of problems). | 

In this Section we will not give all the tech- 
nical details. We leave them to the reader. 

We will suggest the construction of another 
<calculus,fact-translation> pair for the facts of 
the form "jumps into the future".The calculus is 
based on symbolic evaluation and induction, while 
the translation algorithm in most cases reduces to 
a straightforward inclusion of the discovered facts 
as new rewriting rules for the computing agents — 
(see for instance, the following facts Fl and F2). 

Let us derive the program for computing the 
connected components of a given graph.We are given 
the following functions: 
dec succ: nodes —~> set nodes 

succ(n) computes {n}us, where s is the set of 

nodes adjacent to n. 
dec f: set nodes——> set nodes 

f is. the extension of succ to a set of nodes. 

f(nilset)=nilset 

f({n}Gs)=succ(n)uf(s) where & denotes disjoint 
union. 

Given a graph G, that is, a particular function 
succ, we may use the following function F to com- 
pute the connected components of G, by taking as in- 
put for F the set of singletons of the nodes in G. 
dec F: set set nodes —-> set set nodes 

F(nilset)=nilset 

F({a}oX)=F({f(a)}ux) if 

F({a}OX)={a}uF(X) if f(a)=a 

F({a,b}6X)=F({aub}uX) if a 


For instance, given the graph: 


1 2 3 


4 5 7 


where the nodes are denoted by natural numbers, 


we get: - | 
F({{1},...,{7}})=... | | | 
=F({{1,2,4},{2,1},{3,6,7},{4,5,1},{5,4},{6,7,3}, 
a AI56 537) = 45 
={{1,2,4553413,6,7)) « 

The reader can observe that if he translates 
the functional program defining F into a parallel 
program by applying the algorithm Sem0, he will not 
get an efficient implementation. 

It is easy to see that the following fact holds: 
(Fl) F({a,b}6X) = F({faufb}uX) if fanfb#nilset. 
By using that fact we can simplify the rewriting 
of the expression: | 

F({{1},...,{7}}). 
The system can implement that fact as a special 
rule of the form: 


{xr:al...a...ak, yi:bl...b...bm}<= 
{x::al...faufb...ak, y::bl...B...bm} 
if fanfb# nilset 


where we omitted some set brackets and % denotes 
the erasing of b. 

The computing agents with names x and y can 
communicate if fanfb#nilset, and if this communi- 
cation takes place, the result can be computed much 
faster. Also the following fact F2 holds: 

(F2) F(f{al,...ak}) = F({fal,..,fak}) 


where f is equal to f or it is the identity. 

The translation of that fact can improve the 
efficiency, because it allows the computations of 
the fai's in a parallel way. However, in order to 
avoid the repeated evaluations of common subexpres- 
sions it is advisable to activate as often as pos- 
sible the communications which derive from the fact 
Fl. In a sense a good implementation of the pro- 
gram for F should balance the requirements from 
facts Fl and F2. ~ 

That objective can be achieved by implementing 
the following fact F3 [13]. | 7 


Let us define first the following operation: 
g(X,Y) = faub|] fanfb # nilset and acX,beY}. 
u {aeX | fanfb = nilset for all beY} 
u {beY | fanfb = nilset for all aeX} 
where X,Y are families of finite sets of nodes. 
We define also the operation C on finite sequences 
of families of finite sets: 
C(X1,X2,X3,X4,... )=£g(X1,X2) .g(X3,X4),...]. 
Finally we can formulate the following fact F3: 
(F3) F({{nl},...,{mk}}) = CP({{n1}},....{{nk}}) 
where pt log kt. 


Now we are ready to present the parallel pro- 
gram in which facts Fl, F2 and F3 are implemented. 


As usual we will write it as a set of rewriting 
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rules for computing agents. | 


Connected Components Program 
1 1X0 8 ca lec eu 2: bl...b...bm}<= 


{xO :: al...faufb...ak, xl :: bl...B...bm} 


if fanfb # nilset 


2. {x al...a...b...ak}<= 
{x al...aub...B...ak} 
if anb#nilset 
3. {xO :: al...ak, xl :: bl...bm, x:: .xOu.x1l}<= 
{x :: al...ak bl...bm} 
if rules 1. and 2. cannot be applied 
4. {x :: F(fal,...,ak})}<={x :: .xOu.xl, 
xO :: F({al,...,afk/2]}), 
x1 :: F({afk/2]+1,...,ak})} 
5. {x :: F({at)}<={x ::f{a}} 
where instead of <x,m>:: we write only x:: because 


messages are not used. a,b,al,...,ak,bl,...,bm are 
finite sets of numbers (i.e. ,nodes of graphs). 


6. Conclusions 


Through some examples we presented various 
kinds of facts for improving the efficiency of the 
parallel evaluation of functional programs. We also 
presented the structure of a system which is es- 
sentially a <Calculus,Fact-translation> pair for 
making those improvements in an automatic way.Facts 
are to be accepted by the Calculus and then they | 
are used by the translation algorithm to produce 
efficient parallel implementations of the function- 
al programs to which they refer. We think that more 
work should be done in the direction we indicated 
here. In particular it would be important to de- 
velop a theory for the automatic improvements of 
functional programs and their implementations. 

In a sense we follow the program deriva- 

tion technique a4 la Burstall-Darlington, and we 
provide a framework for overcoming some difficul- 
ties encountered by those authors.In particular we 
propose an intermediate "language of facts" ,used 
by the programmer for writing some properties or 
facts of the program to be improved. Facts accept- 
ed by the calculus are indeed bound to increase 
the efficiency» and in this way we can 

make sure, contrary to the original Burstall- 
Darlington approach, that efficiency is improved 
when new versions are derived. Moreover the lan- 
guage of facts gives to the programmer freedom for 
expressing intuitions or discoveries, while the 
"eureka steps" &@ la Burstall-Darlington need to be 
expressed into the recursive equation language 
used for programs. Similarly, in our approach, the 
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language of the new versions of the program may be 
be different from the language of the old versions. 
Therefore, efficiency improvements may also derive 
from better compilers one can construct for the 
language of the new versions. Our approach allows 
also for the incremental discovery of the facts to be 
incorporated into old program versions. Most of 
those facts can be interpreted as establishing 

some communications among computing agents, so that 
they work in a cooperative way while the computa- 
tion progresses. 
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A NEW STRUCTURING MECHANISM FOR SUPPORT 
OF SPATIALLY REDUNDANT DISTRIBUTED COMPUTATION 
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Abstract-~Spatial redundancy denotes 
computation characterized by simultaneous 
execution of redundant program elements on 
separate computing channels. This paper 
_ presents a mechanism called molecular activity 
for the support of spatially redundant 
distributed computation. It is shown that 
molecular activities extend the domain of 
existing software fault-tolerance techniques to 
include applications constructed as sets of 
communicating processes and involving access to 
global data. Molecular activities also allow 
exploitation of high-level parallelism in a 
large class of applications, particularly in 
the area of artificial intelligence. 


I. Introduction 


Spatial redundancy is a term used to 
describe computation characterized by the 
simultaneous (or concurrent) execution of 
redundant program elements on separate 
computing channels. Some mechanism for 
reaching agreement among the redundant elements 
is employed to facilitate decisions regarding 
the results produced by the computation. 


The concept of spatial redundancy has been 
employed to achieve tolerance of software 
failures through a notion known as a design 
diversity and a technique called n-version 
programming [1]. In a broader sense, spatial 
redundancy could also be used to exploit 
parallelism in the solution of certain 
important types of problems. Many 
computationally difficult problems are 
characterized by the existence of alternative 
algorithms or heuristics for obtaining a 
desired solution. Examples include certain 
planning problems in the field of artificial 
intelligence, as well as various VLSI design 
problems involving heuristic search of a large 
solution space. Although it is known that one 
algorithm or heuristic may perform 
significantly better than another for a 
particular instance of the problem (in terms of 
execution time or solution optimality), there 


is typically no way to determine a priori which 


one will perform best. Spatial redundancy 
could potentially be employed to carry out 
alternative strategies in parallel, with some 
mechanism being used to select the desired 
result (and perhaps preempt other 
still-executing alternatives). 
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In a distributed system environment, 
Spatially redundant computation should ideally 
allow a redundant program element to be 
structured as a set of communicating processes 
residing at various nodes of the system. This 
would allow for exploitation of concurrency and 
parallelism within the individual program 
elements. 


Thus, the implementation of spatial 
redundancy could lead to the existence of 
multiple distinct sets of communicating 
processes, each set representing one of the 
redundant elements. If it is allowed that 
several different distributed computations 
(each employing spatial redundancy) might exist 
Simultaneously in a distributed computing 
system, the situation may become quite 
complex. Clearly, some technique for 
structuring the interactions between these 
computations is necessary. 


In this paper we discuss mechanisms for 
structuring spatially redundant distributed 
computations. It will be shown that the 
conventional mechanism for enforcing data 
consistency among concurrent computations, the 
atomic action, is insufficient for supporting 
general spatially redundant distributed 
computation. New mechanisms known as atomic 
activities and molecular activities are 
proposed for this purpose. It is argued that 
these new mechanisms are necessary in order to 
extend the usefulness of existing techniques 
for structuring redundant computations, such as 
n-version programming, to cover a wide range of 
distributed applications. 

II. The Need for a New Structuring Mechanism 
Distributed computation is characterized by 
programmed actions on data objects that may be 
located at various nodes of a distributed 
system. Data objects accessed by a distributed 
computation can be classified into two 
categories: 


1. Local Data Objects--These are objects 
declared in, and local to, the program 
that specifies the computation. Their 
lifetime coincides with that of the 
computation. 


2. External (global) Data Objects--The 
lifetime of these objects supercedes 
that of the computation. They may 
represent permanent or semi-permanent 
information that exists independently 
of the various computations which 
access and manipulate them. The 
collection of data objects ina 
distributed database provides an 
example of external data objects. 


One of the fundamental differences between 
internal and external data objects lies in the 
ability to statically determine which objects 
will be accessed and/or modified by a 
computation. Since local objects are declared 
within, and belong to, a computation, it is 
possible to enumerate all of the local objects 
that will be accessed by computation at the 
time when the program specifying the 
computation is written. However, the external 
data objects to be accessed, and the nature of 
that access, may be determined, at least in 
part, during the execution of the computation. 


For instance, consider an application 
concerned with routing certain connections 
during the design of a VLSI circuit. This 
application must access a large database 
representing the layout of the circuit. 
However, the specific portions of the layout 
data to be traversed during an instantiation of 
the routing application, and the portions to be 
modified, can be determined only during the 
execution of the router. 


The basic mechanism for maintaining the 
consistency of external data objects operated 
upon by concurrent computations is the atomic 
action [2]. An atomic action is some | 
program-specified computation that reads or 
modifies the state of one or more data objects 
and appears as an indivisible operation from 
the point of view of computations outside the 
atomic action. 


Built-in atomic actions greatly simplify a 
programmer's task in coping with unplanned 
concurrency and failures. The programmer of an 
atomic action can write his software without 
bothering about other computations and failures 
that may affect the data that his program 
manipulates. In recent years, a number of 
distributed systems supporting atomic actions 
have been implemented. However, the 
intoduction of spatial redundancy into 
distributed computation introduces new problems 
in consistency maintenance that cannot be 
addressed by the simple use of atomic actions. 


Consider an example of two distinct 
computations, Cl and C2, in a distributed 
system that may access and modify common 
external data objects. Let us suppose that 
those two computations are entirely logically 
independent and that they may reach execution 
concurrently (perhaps on different nodes of the 
system). For example, in the VLSI design 
example mentioned earlier, they might represent 
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different design operations being carried out 
upon the database representing a partially 
completed circuit. Structuring each of these 
computations as an atomic action would allow 
them to execute in a correct and serializable 
fashion. However, suppose that design 
diversity--for instance, in the form of 
n-version programming--was used to implement 
each of the computations in order to enhance 
software reliability. 


Now, consider the atomic action 
representing the spatially redundant 
computation Cl. Note that the external data 
objects used by each of the n-versions must be 
copied into a local data space for that 
version, before execution and modifications are 
made to the local copy. This is done because 
the versions must execute in isolation, and 
hence must not independently modify the actual 
external environment. At the conclusion of the 
execution of the n-versions, a vote is taken on 
the modifications to (copies of) external 
objects, and based upon the outcome, the actual 
modifications are made to external data 
objects. 


There are a number of problematic aspects 
to this scheme, including: 


1. It may not be possible to determine a 
priori which external objects will be 
accessed by a version of the 
computation. Even if it is possible, 
the number may be too large to 
effectively or efficiently copy into n 
different local data spaces. For 
example, duplicating the huge state 
space of a large artificial 
intelligence problem might not be 
practical. (Note that other schemes, 
such as the use of logging in each of 
the versions in lieu of copying 
external state space into local states, 
are possible, but these suffer from 
similar problems.) 


2. Since the n versions may be executed 
concurrently on distinct processing 
nodes, and since external data objects 
may also be distributed among nodes, a 
run-time structure must be constructed 
to coordinate and synchronize the 
copying of external objects to the 
local data spaces, the final voting 
upon results, and the copying of 
results back to the external data space 
of the global atomic action for Cj. 
This structure is over and above that 
required for the support of atomic 
actions. 

III. Atomic and Molecular Activities 
In this section we propose a new 
structuring mechanism, called a molecular 
activity, intended to deal with problems of the 
type illustrated in the above example. This 
mechanism provides a general way of structuring 


Spatially redundant distributed computations 
that access external data objects. As such, it 
is applicable not only to n-version programming 
problems of the sort described above, but also 
to supporting other constructs for design 
diversity, such as the parallel execution of 
recovery blocks or conversations. [3-5]. It is 
also appropriate for supporting the parallel 
execution of alternative strategies in order to 
speed up the solution of computationally 
difficult problems. The mechanism 
automatically manages the consistency of 
external objects accessed by spatially 
redundant computations without requiring that 
the objects be identified in advance. In 
addition, the mechanism provides the support 
for committing final results of the spatially 
redundant computation. 


We define an atomic activity as a 
program-specified computation whose 
computational steps are executed such that the 
intermediate state of data objects manipulated 
by an atomic activity are not visible to any 
computation outside the atomic activity. An 
atomic activity that constitutes a molecular 
activity may be specified by a sequential or 
concurrent program, and hence, the computation 
of an atomic activity may be represented by 
either a single sequential process or a set of 
interacting processes. 


A molecular activity is defined as a 
collection of one or more atomic activities 
that may execute concurrently such that, from 
the point of view of computations outside a 
molecular activity, it appears as an 
indivisible operation whose effect on the state 
of the data objects in the system reflects the 
actions of at most one of its component atomic 
activities. The indivisibility of molecular 
activities implies that the concurrent 
execution of any two molecular activities is 
equivalent to the sequential execution of the 
two molecular activities in some order. Thus, 
during the execution of a molecular activity, 
objects whose values are read by steps in the 
atomic activities constituting a molecular 
activity cannot be modified by any computation 
outside the molecular activity. Also, 
intermediate states of data objects that arise 
during the execution of a molecular activity 
will never be observed by computations outside 
the molecular activity. In fact, even for the 
computations within the molecular activity, 
modifications made to the data objects by one 
atomic activity will not be visible to other 
atomic activities constituting the molecular 
activity. 


Let Sy denote the set of data objects 
accessed by the computations in the molecular 
activity M. Let A= Aj, Aj,.--, A, denote 
the set of atomic activities defining the 
molecular activity M. The initial state I of M 
is defined by the values of the data objects in 
Sm before they were modified by any computation 
within M. The final state F of M is defined by 
the values of the elements in Sy after the 
completion of the molecular activity M. 


From the definition above, it is clear that 
the final state F of M is determined by the 
transformation on I performed by the 
computation of at most one of the atomic 
activities in A. Now, the criterion for 
determining which, if any, of the atomic in A 
is to be chosen to reflect the execution of the 
molecular activity M is application dependent. 
It is specified by a set of user-defined 
procedures encapsulated within a construct 
called a decision module, which is associated 
with each atomic activity of the molecular 
activity M. 


The informal definition of a molecular 
activity given above provides only an abstract 
description in terms of the effect of its 
execution on the state of the system and 
ignores details of the internal structure of a 
molecular activity. A more formal model of 
atomic activities and molecular activities is 
presented in [6]. 


IV. Molecular Activities-—Implementation 
Issues 


The implementation of a system supporting 
molecular activities must satisfy the following 
requirements. 


Rl: Let My denote the initial state of a 
molecular activity M that is composed of 
atomic activities A,;, Aj, ..., A,. Then, 
the concurrent execution of A;, Aj, ..., A, 
should be such that each atomic activity Aj 
in M is oblivious to the transformations on 
My; made by any other atomic activity of M. 
We call this the isolation requirement. 


R2: After the completion of the molecular 
activity M, the final state Mp of M should 
reflect the transformations on My made by 
at most one of the atomic activities of M. 
We call this the global data consistency 
requirement. See ee RU eens 


R3: The transformation from the initial 
state, My, to the final state, Mp, of the 
molecular activity M appears as an 
indivisible and primitive step to all 
computations outside M. Requirement R3 can 
be refined as follows: 


R3a: During the execution of the 
molecular activity M, the intermediate 
states of objects accessed by 
computations in M should not be visible 
to computations outside M. 


R3b: If Ay, Ag, .-+, A, represent 
atomic activities belonging to 
different molecular activities, the 
concurrent execution of Aj, Aj, «e+, Ap 
must be equivalent to the sequential 
execution of Ay, Aj, «--, A, taken in 
some permutation. 
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To satisfy requirements R1 and R3 we need a 
mechanism that controls access to data 
objects in the system. We call this 
mechanism the synchronization mechanisn. 

To ensure requirement R2 we need another 
mechanism, which we call the committment 
mechanism. The implementation of these 
mechanisms in a distributed system is 
presented in [6]. 


V. Concluding Remarks 


In this work we have described a mechanisn, 
called a molecular activity, for the support of 
Spatially redundant distributed computation. 
This mechanism allows each redundant element to 
be constructed as a set of distributed 
communicating processes known as an atomic 
activity. Multiple molecular activities can 
execute simultaneously in a distributed system, 
and they can access and modify a common global 
data space without loss of consistency. 


Molecular activities allow the domain of 
application of existing constructs for 
achieving software fault tolerance to be 
extended to include applications constructed as 
sets of communicating processes and involving 
access to global data. No methods have 
previously existed for allowing n-version 
programs to share global data, and no explicit 
work has addressed the viability of versions 
constructed as sets of communicating processes 
in a distributed environment. With respect to 
other proposed constructs for design diversity 
[4,5], molecular activities can also be used to 
support parallel execution of the interacting 
sessions of a conversion. Details are given in 


[6]. 


The potential use of spatial redundancy to 
allow simultaneous application of alternative 
algorithms or heuristics for solving a 
computationally difficult problem is largely 
unexplored. An extended example of this 
approach applied to a planning problem in 
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artificial intelligence is given in [6]. The 
example demonstrates the viability of this 
technique. Spatial redundancy employed in this 
fashion exploits parallelism at a very high 
level. The molecular activity construct allows 
each activity to consist of a set of 
distributed processes. Hence, the use of 
Spatial redundancy does not mitigate the 
exploitation of parallelism within individual 
versions of the algorithm or heuristic. 
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Abstract — This paper describes a technique for retargetting Poker, 
the first complete parallel programming environment, to new parallel 
architectures. The specifics are illustrated by describing the retarget 
of Poker to CalTech’s Cosmic Cube. Poker requires only three fea- 
tures from the target architecture: MIMD operation, message passing 
inter-process communication, and a sequential language (e.g. C) for 
the processor elements. In return Poker gives the new architecture a 
‘complete parallel programming environment which will compile Poker 
parallel programs, without modification, into efficient object code for 
the new architecture. 


Introduction 


Software portability for sequential computers became an issue in the 
early sixties as higher-level languages began to supplant the machine 
specific assembly languages and as machine varieties proliferated; it 
was a much harder problem than originally supposed and it remains 
a serious problem today. By comparison, portability of a parallel 
program should be substantially more difficult because: 


e architectural differences among parallel computers are much more 
fundamental than among sequential machines, and 


e the characteristic that makes portability difficult — the depen- 
dence of programs on machine specific features — arises often in 
parallel computation in order to get good performance. 


We say “should be more difficult” because to date there is little expe- 
rience: There is little production parallel software and there are few 
truly parallel languages, and few parallel machines. But in spite of 
the potential problems, there is some reason for optimism. 

The Poker[1] language has been retargetted from the Pringle Par- 
allel Computer [2] to the CalTech Cosmic Cube [3]. Thus, programs 
written for the CHiP [4] family of computers can run on one of the 
cube family of architectures without modification. This is possible 
because 


e the Poker language uses a reasonably universal program abstrac- 
tion, 


e Poker programs have a (unique) structure that is both visible and 
simple, and 


e the Poker language and environment is structured to make retar- 
getting simple. 


There is no impediment to porting the Poker language to other parallel 
computers as this paper will explain. 

The benefit of portable parallel software is obvious: Programs can 
be written without regard for the underlying architecture. However, 
portability only guarantees that programs will function. With Poker, 


‘we are making a stronger claim: Poker programs will run with an. 


efficiency that is comparable to that of programs which were specif- 
ically written for that architecture. This leads to a key point about 
the retargetable Poker parallel programming environment: 
Poker requires a small set of system functions of the host 
architecture and can thus serve as the definition of the basic 
software support required of a new parallel computer. 


Simply by creating the few basic interface systems that Poker requires, 
an architecture automatically inherits the available Poker programs 
and a complete software environment. This vastly reduces the soft- 
ware development efforts for new parallel machines. _ 


tSupported in part by National Science Foundation Grant DCR-8416878 
and by Office of Naval Research Contract No. N00014-85-K-0328. 
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Overview 


Before explaining the details of the retarget, some familiarity with 
the Poker system must be acquired. Towards this end we present first 
a high level view of the Poker system pointing out some of the key com- 
ponents and their interactions. Later, we discuss the relevant software 
and hardware pieces, devoting a section each to: the Poker program- 
ming environment, the Cosmic Cube, and the new cross-compiler. 
Finally, we connect all the pieces and discuss the effects of the new 
extended system. 

Figure 1 shows the relationship between the Poker Parallel Pro- 
gramming Environment and the parallel computers which can be pro- 
grammed with it. Poker is a sequential system that makes extensive 
use of a relational database to represent programs. It is written in 
C[5] and built on top of UNIX"). 

To see how Poker works, focus on two components of the sys- - 
tem: the cross-compiler and the debugging environment. The cross- 
compiler accepts a Poker program as input and produces an object 
version suitable for execution on one of three machines: the Pringle, 
the Cosmic Cube, or a simulator/emulator of a parallel machine that 
runs on whatever sequential machine is hosting Poker itself. The com- 
piled version is then down-loaded to the target parallel computer and 
executed. During execution, the run-time support of the machine 
sends tracing information back to Poker’s debugging environment so 
that the programmer can view the execution of the program within 
the Poker environment. 

Thus, one writes and runs programs from an environment that pro- 
vides flexible interactive graphic support and a view of the program 
consistent with its definition. During the program debugging activ- 
ity there is communication between the front-end processor and the 
back-end parallel machine to facilitate program tracing. When the 
program is debugged, no tracing is requested and Poker serves as the 
operating system for the back-end machine, running the program “flat 
out.” 

Thus, Poker is both a language and an environment: A sequen- 
tial system from which to compile, execute, and debug programs on 
parallel computers. A “port” of the Poker language entails retar- 
getting the cross-compiler and constructing the run-time software to 


‘support the communication between the Poker environment and the 


parallel computer. Only the Poker programs and run-time system 


‘get ported to the new machine; the Poker environment runs on a se- 
-quential computer.) This paper will detail the activities required in 


crossing the vertical line of Figure 1. 
The Poker Programming Environment 


The Poker programming environment is built around a program- 
ming abstraction that is common to non-shared memory parallel al- 
gorithms, as described in the next section. However, Poker’s interface 
to the abstraction is quite non-standard. The section on Concrete 
Poker Programs outlines the structure and semantics of Poker pro- 
grams. As we shall see, this formulation of parallel algorithms lends 
itself to easy retargetting for new parallel architectures. 


Abstract Poker Programs. 


A non-shared memory parallel algorithm is conceptualized as a fi- 
nite graph. The graph’s vertices are labeled with process names corre- 


(2) UNIX is a trademark of AT&T Bell Laboratories 


(6) Currently, the Poker environment runs on @ number of computers, including 
Vaxes, Vax-stations, Sun workstations, IBM PC/RTs, and HP-9000s. 
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Figure 1: An Overview of the Poker System. 
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tree: parent <- max ( myValue, leftValue, rightValue) 
leaf: parent <- myValue 
Figure 2: The Maximum Algorithm. 
sponding to sequential programs. The graph’s edges are of two types: 
Edges between vertices are communication paths between processes, 


and edges “dangling” off of the graph are channels through which. 


streams of data pass to or from the graph. (Technically, graphs can- 
not have dangling edges, of course, but it is convenient to abuse the 
definition.) 

For example, Figure 2 shows the maximum algorithm. The graph 
is a binary tree. The vertices are labeled with one of: the name of 
the leaf process, which passes its local value to its parent, or the 
name of the tree process, which receives two values from its children, 
finds the maximum of these values and its local value, and passes the 
result to its parent. The dangling edge is a stream containing a single 
output value produced at the root. 

A problem is not generally solved by a single algorithm of this 
form. Rather this form is usually one “phase” of a computation. The 
problem is partitioned into a series of phases, each possibly with a 
different graph structure. Inter-phase communication is permitted 

only among those that share the same “location” as described by a 
one-to-one mapping function. Values from previous phases may be 
inherited by the vertex executing in the corresponding vertex in the 
next phase. 


Concrete Poker Programs. 


Unlike a C or Pascal program, a Poker phase program is not a 
-monolithic piece of text. Rather, it is composed of five components 
that correspond closely to the abstraction just presented (Figure 3 
shows the five components of the maximum algorithm encoded as a 
Poker phase program). 


e Communication Graph: A finite graph with dangling edges. The 
boxes correspond to processes and thus will be the vertices of the 
graph. The circles, which are switches for the CHiP computer 
[4], can be ignored for the present discussion. 


e Process Definition: A (usually small) set of processes written in 
a sequential language, in this case a slightly modified version of 
C[5]. These can be thought of as standard procedures with formal 
parameters called at the beginning of a phase. 


e Process Assignment: A labeling of the vertices of the graph with 
the names of the processes and actual parameters, if any, to be 
executed at that processing site. 


e Port Name Assignment: A labeling of the edges of the graph, 


but from the point of view of the vertex, 1.e. each edge has two . 


names, one for each end, or “port”. 


e Stream Name Assignment: A labeling of the dangling edges of 
the graph giving names to the input and output streams; these 
names will subsequently be bound to files. 


The correspondence of the five Poker program components to the 


‘phase abstraction given above should be clear. 
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A Poker program is composed of a finite set of phase programs 
together with an execution scheduler that describes the sequence in 
which they are to be invoked. Those vertices of different phases which 
occupy the same location in the Switch Setting View may communi- 
cate across phases through the use of inter-phase variables. 

Inter-phase variables live for the duration of the Poker program 
and exist separately from the variables declared local to a phase. The 
local process codes may access the values of inter-phase variables by 
the import and export statements: 


import local from inter-phase 
export local to inter-phase 


Import copies the value of the inter-phase variable into the local vari- 
able. Export copies the value of the local expression into the inter- 
phase variable. This is the only way to pass information between 
phases. It provides a well-defined interface to inter-phase commu- 
nication and lets process codes load from and store to inter-phase 
memory during the course of computation. 

Two other features of Poker C are the trace list and inter-process 
communication. The optional trace list found in the declaration sec- 
tion of each routine specifies the variables whose values will be traced 
in the debugging environment, if a traced execution of the program 
is requested. Trace “variables” are not restricted to the conventional 
“variables” of Poker C. They may include variables, labels, and other 
items such as the current procedure and depth of recursion. 

Process input and output, respectively, are given by expressions of 
the form: 

vartable <- port 
port <- variable 


The values transmitted are simple scalar values from the language, and 
arrays and structures not containing pointers. Messages are tagged 
with their type so that process I/O may be type checked using struc- 
tural type equivalence. 


The Programming Environment. 


The Poker system is the set of facilities that assist the programmer 
in writing and running Poker programs. Poker uses two displays: One 
is a bit-mapped display having the general form shown in Figure 4 and 
used for interactive graphical programming of the parallel aspects of 
the program. The second terminal is used with a standard editor to 
write the sequential C process text. 

Poker stores programs as a database, displaying the program in 
one of several views [6]. Figure 4 shows the display for the Switch 
Setting View; the other views are analogous, with the appropriate 
Poker program constituent displayed in the lower half of the screen. 
The seven views are 


© Switch Setting View: Used to define the Communication Graph 
(Figure 3); the programmer uses a mouse or keypad to draw a 
picture of the graph by connecting the boxes with lines. 


e Code Names View: Used to define the Process Assignment (Fig- 
ure 3); the programmer moves from box to box entering the 
names of the process and actual parameters, if any. 


e Port Names View: Used to define the Port Name Assignment 
(Figure 3); the programmer moves from box to box entering the 
names of the ports. 


e IO Names View: Used to define the Stream Names Assignment 
(Figure 3); the programmer enters the names of the streams and 
their directions. | . 

e Command Request View: Used to compile, assemble, link, load, 
etc. Poker programs. The bottom of the display shows the prog- 
ress of the computation. 


e Trace View: Used to display the execution of the Poker program; 
the programmer can start, stop, and single step the program 
while watching the successive changes of the traced variables. 


Communication Graph Process Definition 


code tree; 
trace myValue; 
common myValue; 
ports parent, left, right; 
begin 
int myValue, value; 
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value <- left; 
if (value > myValue) then 


myValue := value; 


O 
O 
O 
O 
O 


value <- right; 
if (value > myValue) then 
myValue := value; 


parent <- myValue; 
end. 
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Figure 3: The five parts of a Poker program. 
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Figure 4: The Poker Display showing the Switch Setting View. 
e CHiP Parameters View: Used to describe the logical CHiP ar- 


chitecture being programmed. This includes parameters such as 
the number of processors in the processor grid (16 in Figure 3). 


Notice that the display of most of the program constituents includes 
information defined in other program views; two examples (see Fig- 
ure 3) are the graph shown in the Code Names View (Process Assign- 
ment) which is derived from the graph defined in the Switch Settings 
View (Communication Graph), and the material in the right half of 
the table of the IO Names View (Stream Name Assignment) which is 
gathered from the other views. 


The Cosmic Cube 


The Cosmic Cube [3] provides flexible processing facilities that eas- 
ily adapt to hosting the more structured Poker abstraction. Each 
Cosmic Cube program is also a graph: a set of processes, with non- 
shared memory, communicating through logical links mapped onto the 
underlying hardware [7]. However, instead of the static graphs of the 
phase program found in Poker, the Cosmic Cube programs consist of 
a single, perhaps dynamically changing, graph. Processes may create 
new communication links (edges), if the creator knows the address(°) 
of the other process, or destroy old links in order to build the best 
graph for the moment. This gives the Cosmic Cube programs more 
flexibility than we need. 

This abstraction is implemented on a MIMD non-shared memory 
computer with a binary n-cube interconnection; the processors and 
their local memory, or “nodes”, sit at the corners of the n-dimensional 
cube while the edges of the n-cube are formed from physical wires con- 
necting the node processors. Each node may host zero or more pro- 
cesses, living in separate address spaces, communicating by message- 
passing over the physical wires connecting adjacent nodes. The oper- 
ating system automatically forwards messages between non-adjacent 
nodes, preserving the message order between the sending and receiv- 
ing processes. , 

A Cosmic Cube program starts as a single process on the host 
machine(®) connected to one corner of the Cosmic Cube. This host 


(°)}On the Cosmic Cube a process’ address consists of two numbers: the number 
of the physical node in which it resides, and its process number on that node. 
(4) A typical host machine is a Sun Microsystems workstation. 
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process “spawns” the processes in the initial program graph, placing 
them on the nodes of the cube and establishing the graph edges by 
forwarding process addresses to the processes on the cube. These 
processes may then spawn more processes of their own or create new 


communication links by sending known process addresses to other 
processes. 

Typically, process codes are written in Cosmic Cube C, a version 
of the C programming language [5] extended with calls to routines 
in the Cosmic Cube’s operating system, or one of the other extended 
sequential languages supported for the Cosmic Cube. 


The Poker to Cosmic Cube Cross-Compiler 


Retargetting the Poker language and porting the run-time system 
to the Cosmic Cube changed nothing in the existing Poker environ- 
ment; neither the Poker programs nor the Poker environment needed 
modification (process codes written in XX [2] are preprocessed into 
Poker C). Instead we only needed to (1) retarget the cross-compiler to 
translate Poker C process codes into a single Cosmic Cube C program, 
using the information in the database to determine the configuration 
of the resultant graph, and (2) extend the run-time software on the 
Cosmic Cube to include a few routines interfacing to, and control- 
ling, the Cosmic Cube program. This Cosmic Cube C program from 
the translator is then treated as any other Cosmic Cube C program, 
compiling, loading, and executing it using the facilities provided for 
the Cosmic Cube. The difference is that the extra interface routines 
know about the Poker system, so that the user of Poker need not 
know which underlying architecture is executing the user program — 
no matter which target architecture is used, the creation and execu- 
tion of the program will be the same. This retarget and run-time 
system port is the topic of the remainder of this paper. 


Converting Poker C to Cosmic Cube C. 


The first task of the cross-compiler is to convert the Poker C process 
codes into a form acceptable for the processors of the target architec- 
ture. There are at least three ways to support a sequential language 
on the processors of a new architecture. Qne is to directly generate 
object code for the target machine’s processor elements. A second 
method is to provide a kernel that runs on the processors of the new 
architecture and interprets the sequential language (or an interme- 
diary language derived from it). This method could easily support a 
number of interpretive languages such as LISP [8] and Prolog [9]. The 
third approach is to translate the source process codes into a language 
already supported on the new architecture. This last approach works 
best when the languages are similar in structure, as is the case with 
Poker C and Cosmic Cube C. 

Translating one sequential language into another, source-to-source 
translation, is well understood. In our case, the only major translation 
problems come from tracing the variables in the trace list and from 
reducing all of the Poker phase graphs into one more general and less 
structured Cosmic Cube graph. 

Trace variables are handled by the first part of the cross-compiler 
which uses a Yacc [10] based C-to-C compiler to insert calls to a 


_Tun-time trace routine after every instance of a traced variable. The 


C-to-C compiler does not look for aliases; instead it only inserts a 
trace command after each assignment whose left-hand-side is an in- 
stance of a literal in the trace list. If aliases are a concern, Poker will, 
at the user’s discretion, insert a special trace call after each suspect 
assignment, exhaustively checking for any changes in the traced vari- 
ables. Exhaustive traces are expensive at run time but, presumably, 
a great benefit for debugging. 

The C-to-C compiler also replaces instances of the Poker’s inter- 
process communication statements 


variable <- port 
port <- expression 


with calls to the Cosmic Cube send() and receive() routines. 


The C-to-C compiler is more complex than a simple pre-processor 
such as UNIX’s cpp. To see why, consider tracing a variable that is 
modified twice in a single expression. We want to trace both changes, 
but to capture both values of the traced variable we have to insert 
trace calls into the expression. We cannot simply append a trace call 
onto the expression since the value of the expression may be needed 
for an assignment or conditional test. Instead, we have to store the 
expression value in a temporary variable, trace the changed trace vari- 
ables, and then append the temporary variable to return the expres- 
sion value. 

For instance, given the variables 


float f; 
int i, j;, 


and a request to trace f and j, the C-to-C compiler converts the 
conditional expression 


(((f += 4.3) < 10) ? 
fFt f-44t, a=: 
f-=) 


into 


(((_tempint = ((f += 4.3) < 10)), -Trace(&f, FLOAT), 
_tempint) ? 

f++ + j++, _Trace(&f, FLOAT), _Trace(&j, INT), i-- : 

((_tempfloat = f--), -Trace(&f, FLOAT), _tempfloat)) 


where _Trace is a trace routine taking a pointer to a value and a 
constant telling it the type of the value and _tempint and _tempfloat 
are reserved variables declared in the enclosing routine. 

In the first line, _tempint determines which of the next two expres- 
sions to execute: the comma expression before the colon, or the simple 
expression after the colon. We have to insert a trace of f immediately 
after the < expression, since if we waited to trace the value of f until 
after executing the entire conditional expression, we would miss the 
first value of f. 

For the same reason we needed _tempint, the value of the expression 

was used outside of the expression, we also need to use _tempfloat 
to save the value of the conditional expression before tracing f’s new 
value. 

Clearly, this is not a simple textual substitution. Before we insert a 
temporary variable, we have to know the type of the expression. This 
requires keeping a symbol table, to hold the types of the variables, and 
using a parse tree to calculate the types of expressions. In other words, 
the C-to-C compiler is truly a compiler even though the languages it 
consumes and generates are nearly identical. 

The end result of the C-to-C compiler is two code segments from 
each of the Poker C process codes. One segment contains information 
about the inter-phase variables imported to, and exported from, that 
process code. The other segment contains the routines defined for that 
process code. These code segments are used by the cross-compiler, as 
discussed in the next Section. 


Compiling the Poker Database. 


The second part of the cross-compiler combines the pieces of the 
newly translated code with the rest of the information in the Poker 
program’s database to create a single Cosmic Cube C program. The 
first problem here is to figure out how to collapse the phases of the 
Poker program into one graph, while maintaining efficient inter-phase 
communication ©). This is easily achieved by the technique suggested 
in Figure 5. When the phases are stacked as in Figure 5 the processes 
above one another are exactly those processes that share inter-phase 
variables. Combining them into one “aggregate” process keeps the 
Poker processes in a single process space. Inter-phase variables are 
globally declared for that process so that inter-phase communication 
requires no extra computation. 


(*) Keeping efficient inter-process communication is a much more complex issue, 
discussed in the Results. 
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Building these aggregate processes is done as follows. The cross- 
compiler computes the inter-phase data space of each aggregate pro-. 
cess from the inter-phase variables required by each of the Poker pro- 
cesses being placed into the aggregate process. The main() routine 
of each aggregate process is simply an infinite loop containing a call 
to the controlling host process asking for the number of the phase to 
execute, followed by a switch statement to call the main routine of 
the appropriate phase (see Figure 6). Since Poker C does not allow 
externally declared variables, the only potential source of name con- 
flicts among different phases come from the routines, typedef’s, and 
externally defined structures and unions. Prefixing these names with 
a unique phase ID eliminates any possible name conflicts. Linking 
the main() code with the routines from the phases and routines sup- 
porting the calls to receive(), send(), and the like, results in the 
Cosmic Cube C code for the aggregate process. 

The last problem with the aggregate processes is mapping the edges 
of the different phases onto one graph. Since the Cosmic Cube does 
not have any restriction on the degree of the vertices (processes) in 
its program graph, we can project all of the phase graphs onto one 
graph. In actuality, each aggregate process gets a different logical 


interconnection map for each phase. The map holds the eight pairs 
of (process address, port number) that correspond to the terminal 
ends of the wires connected to the ports of the process for that phase. 
Since process addresses are known only after the processes are placed 
on the Cosmic Cube nodes, these tables are initialized at run time. 

This packing results in extremely efficient inter-phase communica- 
tion and phase change. One of the advantages of Poker programs is 
that the phase graphs are known at compile time so that the processors 
do not have to expend any run-time effort constructing or modifying 
the program graph or dynamically allocating more processes. 

After spitting out the aggregate vertex codes, the cross-compiler 
runs them through the Cosmic Cube C compiler to get executable 
versions ready to be placed on the Cosmic Cube. 


Aggregate 
ertices 


Figure 5: Collapsing the phases of a Poker program to create the . 
aggregate processes. 


Extending the Cosmic Cube’s Run-time System. 


The third task of the cross-compiler is to augment the operating 
_ system of the target computer with special processes implementing 


the features we need for the source language. In the case of Poker, we | 


have three such special processes: the File Server providing file system 
support, the Spawner acting as the controller, and the Trace Handler 
providing debugger support. All three of these special processes live 
on the host computer attached to the Cosmic Cube. 


e Spawner: The Spawner has two tasks: 


— Inttiakzing a Poker program onto the Cosmic Cube. This 
includes: 


* Mapping aggregate processes to Cosmic Cube nodes. 
Currently, the aggregate processes are mapped naively 
onto Cosmic Cube nodes, with no attention to the in- 
terconnection between the processes. Better allocations 


are often possible, but the realization of one of these is 


a complex matter discussed in the Results. 


gr ccccconccccceccoscoccccenccccssssenessecceonvoescosescuecseveusconsecosesese 


trace phaset .i, phase1.foo; 
? ports in, out; 


E main() 
int i, foo; 
: import foo from foobar; 
code for i sub1(foo); 
phasel: :  /* rest of main(), phase 1 code */ 
i] 


sub1 (val) 
: int val; 


int temp; 


/* computation... */ 
export temp to result; 


; code phase2; 

main() 
code for /* code for main(), phase 2 */ 
phase 2: 

sub1() 


?  /* code for sub1(), phase 2 */ 
i} 


*x Placing the aggregate processes on the Cosmic Cube 
nodes. The Spawner, takes the executable codes for 
the aggregate processes and places (“spawns”) them on 
the Cosmic Cube nodes. 

* Initializing the edge maps of the “aggregate” processes. 
After the Spawner places all of the aggregate processes 
on the Cosmic Cube the Spawner sends each aggregate 
process a message containing the addresses of the pro- 
cesses on the other end of its edges, so that the aggre- 
gate processes can initialize their edge mapping tables. 
The Cosmic Cube operating system provides these ad- 
dresses as a result of the Spawner placing the aggregate 
processes. 


— Controlling execution order of the phases. The Spawner is 
the Master Control Program that sends messages to all of 
the aggregate processes telling them which phase to execute. 
When each aggregate process finishes its phase code, it sends 
a “Done!” message to the Spawner. The Spawner waits to 
hear the “Done!” messages from all of the processes before 
sending the next “Execute phase n!” message to all of the 
aggregate processes. 


Aaar Vertex. 


/* phase-global variables */ 
int _global_foobar, _global_result; 


main() 


while (TRUE) 


receive (SPAWNER, nextPhase); 
switch (nextPhase) 


case 1: /* phase 1 */ 
phaset(); 
break; 

case 2: /* phase 2 */ 
phase2(); 
break; 

case EXIT: /* terminate */ 
_ExitFromProgram(); 


int i, foo; ; 
foo = _global_ foobar; /* import */ ss y 
_Trace(&foo, INT); : 2 
sub1 (foo); ; > 
-  [* rest of main(), phase 1 code */ : 8 
. } ‘ S 
: 18 
. _p1_sub1 (val) : & 
int val; , - 

int temp; : 

/* computation... */ : 

_global_result = temp; /* export */ 
: 
_  /* code for main(), phase 2 */ = 
.} & 
| a 
: _p2_sub1 () ae] 
a 
/* code for sub1(), phase 2 */ ® 
No 


Figure 6: Packaging the Poker C process codes into a Cosmic Cube 


C aggregate process. 
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e File Server: Processes access Poker’s file system through dangling © 
edges. The File Server keeps “input” edges filled with values from » 


files and stores values from “output” edges into files. All messages 
on dangling edges route through this File Server.(F) 


e Trace Handler: Trace variables are variables whose values are al- 
ways known to the outside world where a programmer may view 
them in a special Trace View within Poker. The Trace Handler 
collects messages from the processes indicating new values for the 
traced values and sends them to Poker’s debugging environment. 
These values come from send() statements that the C-to-C com- 
piler inserts after each assignment to a traced variable. 


The Cosmic Cube Poker Program. 


In summary, the Cosmic Cube program corresponding to a Poker 
program consists of the aggregate processes on the Cosmic Cube 
nodes and three special processes running on the host computer: the 
Spawner, the File Server, and the Trace Handler. These processes 
implement the run-time support, while Poker provides the program- 
ming environment, traces program execution via the Trace Handler, 
and cross-compiles the Poker program to create the Spawner, the File 
Server, and the aggregate processes that run on the Cosmic Cube. 


Results 


Machine dependencies. Most of the Poker system is machine inde- 
pendent. The machine dependent pieces fall into three parts: 


e The Cross-comptler. The C-to-C compiler inserts calls to run- 
time routines for inter-process communication, inter-phase com- 
munication, tracing, and execution control. Changing these calls 
requires slight changes to the C-to-C compiler. If the target 
processors support C, the C-to-C compiler should not need any 
changes. However, if the target language is not C, the C-to-C 
compiler may need major reworking. 


e The Run-time system. The routines supporting the interface be- 
tween the Poker environment and operating system of the target 
processors are highly machine dependent. Large parts of these 
may have to be written from scratch. 


e The mapper from logical to physical interconnections. Not only 
is this the least efficiently implemented feature of the system, as 
described below, but it also is extremely machine dependent. 


Changing these three parts is sufficient to retarget the Poker lan- 
guage to a new architecture. 


Mapping interconnections. When the programmer defines a com- 
munication graph in Poker and it is run on the Pringle (or, one day, 
the CHiP Computer), the graph is “directly implemented” in the sense 
that the figures produced in the Switch Setting View will be quite lit- 
erally compiled into the object code of the machine. This is because 
the machines are configurable: The switches route messages according 
to the interconnection drawn in the Switch Set View. For any fixed- 
interconnection machine, such as the Cosmic Cube, the situation is 
different: The communication graph, which will be considered the 
logical communication structure, must be mapped onto the physical 
‘interconnection structure of the architecture. Of course, programmers 
‘implicitly perform this mapping when programming in other parallel 
languages. 

_ As described in the section on extending the Cosmic Cube’s run- 
‘time system, the mapping is done by the Spawner using an arbitrary 
allocation. If the programmer defined a logical cube graph, the system 
imight not, as things now stand, allocate it so that logically adjacent 
vertices are physically adjacent. In general, the best allocation is not 


: (f) "This is, of course, a potential bottleneck. Multiple File Servers could be used, 
| but the architecture of the Cosmic Cube still requires all messages between the 
i Cosmic Cube and the host machine, and thus the file system, to pass through the 
‘single physical edge connecting the host computer to the Cosmic Cube. Without 
‘additional hardware or a different basic IO design, Cosmic Cube programs risk 
becoming IO bound even with “reasonable” IO overhead. 


so obvious, yet the system should try to minimize the distance be-— 


‘tween communicating processes. The problem of automatically find- 


ing an optimal allocation instead of our ad hoc allocation remains 
unsolved. Moreover achieving optimality is complicated by multiple 
phases with the one-to-one correspondence of vertices between phases; 
a good mapping for one phase may be quite poor for another phase 
of the same algorithm. We are hopeful that the work of Berman and 
colleagues [11] will lead to an automated solution, though the retar- 
getting in no way depends on Berman’s software; it only improves the 
quality of the end result. Presently, we advocate a programmer as- 
sisted mapping, but we have not yet provided the software for it. No 
matter what enhancement is chosen, there is an entry point in Poker 
for the appropriate software. 
Conclusion 


We have shown that it is easy to port Poker to a new parallel archi- 
tecture, the Cosmic Cube, in such a way as to produce object code as 
efficient as any written for that architecture. Furthermore, Poker can 
easily be ported to any parallel computer simply by providing three 
basic features for the new architecture: 


e MIMD operation, 
® message passing facilities, and 


e a compiler for the process codes. : +4 
These are very modest requirements considering that a small amount 
of work would then provide the new architecture with a complete, easy 
to use programming, debugging, and cross-compiling environment, as 
well as all of the existing Poker parallel programs. It is for this reason 
that we claim that Poker can be the definition of the basic software 
support required by a new parallel computer. 
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Abstract 

A mathematical model for predicting the performance of the 
direct binary n-cube interconnection scheme in a packet switched 
environment is presented. These predictions are checked against 
simulations of a comparable system. The results for the network 
are compared to known results for indirect interconnection schemes 
like the crossbar and indirect binary n-cube networks. Several 
modes of operation of the network are analyzed and, it is shown 
that the direct cube performs better than its indirect counterpart 
for reasonable sizes of switches in the latter. 


1. Introduction 

In the current climate of increased focus on large concurrent 
computer systems, it is recognized that one critical segment of 
such systems is the communication network used to connect the 
processing elements together. Various network schemes have 
been proposed for this purpose, including crossbars, indirect 
binary n-cube networks (also referred to as omega networks 
[Lawr75] [Peas77]), and direct binary n-cube networks [Seit85] 
(also referred to as the hypercube connection scheme). These 
interconnection schemes can be classified as direct or indirect. 
A direct interconnection scheme is one in which nodes (process- 
ing elements) are connected together by point-to-point links. In 
an indirect interconnection scheme on the other hand, the net- 
work is a separate entity with inputs and outputs; processors and 
memories are connected to the inputs and outputs respectively 
(most commonly in a shared memory architecture), or processors 
are connected to both inputs and outputs of the system (message 
passing architecture). Examples of the direct interconnection 
scheme are the ring, mesh, and binary n-cube architectures. 
Examples of the indirect interconnection scheme, other than the 
crossbar, are various multistage interconnection networks of the 
omega type, listed above. 

Indirect binary n-cube type networks have been analyzed 
extensively in the literature ({DiJu81], [Pate81], [KrSn83]) for 
various aspects of their performance under different operating 
conditions. In this paper we analyze the performance of an 
important direct connection scheme, the direct binary n-cube 
network, in a packet switched mode of operation. We develop 
several mathematical models for the network, provide simula- 
tion results, and compare its performance with that of the 
indirect binary n-cube network. 


2. Principles of Operation 

The binary d-cube [Peas77] consists of N = 2% nodes connected 
together using dedicated edges. If the nodes are numbered 0 to 
24-1, then the d-cube connection is defined by the statement 
that two nodes whose binary representations differ in exactly 
one bit position are connected together. The graph of such a 
svstem has degree d. See [Peas77] and [Seit85] for detailed 
discussions of the binary d-cube interconnection scheme. 

Each node in the d-cube consists of a processing element (PE) 
and a (d +1)x(d +1) crossbar switch as shown in Figure la. An 
input-output line pair constitutes an edge in the binary d-cube 
and the switch is connected to d of its neighbors using d such 
pairs. (See Figure 1b for the implementation of a binary 2- 
cube.) When a switch receives a message, the message’s destina- 
tion address is compared to the current node address. If there is 
a match, then the switch will attempt to route the message to the 
node’s PE. Otherwise, the message will be routed to one of the 
node’s neighbors, chosen at random from the bit positions that 
differ between the message and node addresses. When the 
switch attempts to route multiple packets to a single output line 
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at the same time, either the multiple messages must be buffered 
at the output or all but one will be lost. We will analyze the per- 
formance of the network with and without buffers at the switch 
outputs. 

Two versions of the switch will be considered for analysis: the 
single-accepting PE scheme, (Figure la) where one message can 
be accepted by the PE each cycle, and the multiple-accepting PE 
scheme, where multiple (up to d) messages can be accepted by 
the PE in a single cycle. The latter is more realistic in a scheme 
where the routing is done in software [SEIT 85] and the PE 
takes on the switching functions. 

For simplicity of analysis, all nodes will be assumed to be 
identical. Furthermore, in Sections 3 and 4 each node will be 
assumed to generate requests for other nodes in an egalitarian 
fashion. As the traffic is being generated by identical nodes, for 
uniform destinations, it is reasonable to conclude that at each 
node the traffic received from any neighbor is identical to that 
received from any other neighbor. (In Section 5, we generalize 
the results to arbitrary distributions of message destinations.) 

Let us define m, to be the message generation rate at each PE. 
(It is also equal to the probability of receiving a message from 
the PE in each cycle.) The rate at which messages arrive from a 
particular neighboring node is defined to be m. (m is also the 
probability of receiving a message from a particular neighbor in 
a cycle.) Also, let the rate of messages going to the PE be 
defined as m,. (If the single-accepting PE model is used, then 
m, is also the probability of a message being accepted by the PE 
in a cycle.) Furthermore, let P, (the probability of termination) 
be the probability that a message received from a neighboring 
node is for the PE. 


3. Analysis of Unbuffered Networks 

The performance measures we are interested in are the proba- 
bility of acceptance of a generated message, and the bandwidth 
of the network as a whole. When multiple messages for the 
same output line are received at a switch in an unbuffered net- 
work, one randomly selected message “wins”, and the rest of the 
messages are blocked. All blocked messages are lost from the 
system. 

Let P, be the probability that a generated message is ulti- 
mately accepted. The total number of messages generated in the 
cube in one cycle is Nm,. The total number of messages 
accepted in the cube in one cycle is Nm,. Thus 
P, =Nm,/Nm, =m,/m, . Network bandwidth is simply Nm,. m, 
is computed in the following two sections. 


3.1. Calculating the termination probability, P, 

We define the distance between two addresses on a cube to be 
the number of cycles needed to travel from one node to the 
other. This is exactly the number of differing bit positions in the 
two addresses. Hence the distance between any two node 
addresses is an integer between 0 and d. Given a node address, 


the number of node addresses that differ by exactly & bits is : 


Let the distance between the message destination and the 
current node be defined as the hopcount of the message. 

Starting with an empty network, on cycle 1 each PE can gen- 
erate a message that will have hopcount / with probability 


(Ayo ~1). At all subsequent cycles, this will also be the distri- 


bution of messages coming from each PE. 1/d of these messages 
will be sent to each of the neighboring nodes where it will be 
combined with the messages from d-—1 other nodes. Let us 
define P,, the probability that a (non-terminating) message 


received from a node is successfully passed on to another node, 
to be 


= Messages going out to the nodes _ dm 
“Messages trying to go out tothe nodes dm(1-—P,)+m, | 


(Recall that m is the rate at which messages arrive at an input to 
the switch.) At the end of cycle 2 the distribution of messages 
sent to the neighboring nodes will be as follows: 


Hops: 0 i-1 d-1 
eee eel) 2 
rob: 7 4 ri 


where A = P,?(N—1-d)+P,(N—1). The first term in each proba- 
bility is contributed by the PE on cycle 2, and the second term is 
contributed by the d neighboring nodes. In general, at the end 
of cycle i, messages from node distance i—1 away (0<i —1<d), 
can have hopcounts from 0 to d-i. (This is because a message 
can travel no further than distance d.) After d cycles, the distri- 
bution reaches steady ate This cistr Bulon, is given by 


Pr[hopcount = i—1] = by (4). aes a ‘S > (4)P. aes 


k=l] jak 
Thus the probability of receiving a message with hopcount = 0 
(i.e., the termination probability) can be derived to be 


P, = {(Pa-1)((1 + Pa)* — 1} / {Pa((1 + Pa)? —N)} 


3.2. Calculating message rates m and m, 

Given a message from some other node, it terminates at this 
node with probability P,. Otherwise, it is passed to one of d —1 
other nodes as it is never passed back to the node it came from. 
Therefore, with probability (1—P,)/(d—1) a message is routed to 
one particular output line. (Actually, the message is routed to 
one of k outputs, where k is the number of differing bit positions 
between node and message addresses. However, this is a reason- 
able assumption, as comparisons with simulations will show.) 
The probability of getting a message from the PE is simply m,. 
This message is routed to one particular output line with 
probability 1/d. 

Considering one particular output, the probability that no 
message arrives at this output is (1—m,/d)(1—m(1—P,)/(d—1))4~". 
Since the output line of this node is the input line for some other 
identical node, the rate at which messages leave the node must 
be equal to the rate messages arrive at the node. Thus, 


m =1—(1—m,/d)(1 — m(1-P,V(d-1))4* . 
Using the single-accepting PE model and employing a similar 
argument, 
mg =1—(1—mP,)* . 
When the multiple-accepting PE model is used we obtain: 
m, =dmP, . 


3.3. Unbuffered network results and discussion 

These results, along with corresponding simulations are 
presented in Figures 2 and 3. Figure 2 plots the probability of 
acceptance of a message in a direct binary cube as a function of 
the message generation rate for various network sizes. (A 
multiple-accepting PE model is assumed.) The model and the 
simulations agree to within 5% in all cases. The performance of 
the direct cube compared with that of the crossbar and indirect 
binary cube networks is presented in Figure 3. (Fairly standard 
models are considered for the crossbar and indirect binary cube 
networks, as for instance [KrSn83] and [Pate81].) Since the 
crossbar and indirect cube networks are by definition single 
accepting networks, a single-accepting PE model is assumed for 
the direct network figures in these graphs. Indirect networks 
can be constructed out of switches of different sizes and the 
choice of switch size affects both cost and performance. Figure 
3 considers indirect binary n-cube networks constructed using 
switches of size 2, 4, and 8. We notice that it takes a switch size 
of about 8 in the indirect network to equal the performance of 
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the direct network. 
Various measures of cost for these two types of networks are 


shown in the table below (where the symbol “k” means multi- 
plied by 1024). 


—— 
n- ae n-cube 

[Network Size | 256 [1024/4096 || 256 ——:1024 4096 

eT a ae TTS a 


Switches 256 | 1k | 4k ei ee | 6k | 2k 
2304 | 11k | 52k 
20736 | 121k 676K || 4k 


Lines , 1280), 11k | 6k | 52k S| 20k 

Crosspoints 4k 20k 20k |\96k | 96k | 128k | 
The ability of the direct binary cube to outperform its indirect 
counterpart comes at the expense of increased hardware (using 
any of the cost measures), as the table illustrates. However, the 
hardware is not used in the most effective manner, as a com- 
parison with the indirect cube using 8x8 switches shows. (The 
two networks have similar performance.) Thus it would seem 
that given a certain amount of hardware (essentially a number 
of switches and PEs), it would be preferable to connect them in 
the form of an indirect binary cube. It should be pointed out 
that this observation applies to the equiprobable reference 
model where a PE wishes to talk to any of the others with equal 
probability. To our knowledge, this is the mode in which 
existing implementations of the system software for direct cube 
architectures operate (for example, the Intel Hypercube). A 
more complex allocation of tasks to PEs, utilizing knowledge 
about locality of communication, would improve the perfor- 
mance of the direct binary cube. 


4. Analysis of Buffered Networks 

In a buffered network, multiple messages heading for the 
same Output in a cycle are saved up in a queue at the output side 
of the switch. The line going to the PE needs a buffer only in 
the single-accepting PE model; the multiple-accepting PE model 
assumes some mechanism of giving up to d messages to the PE 
at once. In the first part of this analysis, the queue lengths are 
assumed to be infinite; finite length queues are analyzed in Sec- 
tion 4.5. 

When infinite buffers are assumed, no messages are lost. 
(P, =1, and bandwidth =Nm,.) A good measure of network 
performance in this case is T, the mean message delay time. 
This time is defined to be the average time a message spends in 
the system before reaching the destination PE. 


4.1. Calculating the termination probability, 
rate, m 

By following an analysis similar to that for unbuffered net- 
works (setting P, to 1) we obtain P, = (2N — 2)/(dN). 

To derive expressions for m and m,, we observe that with 
infinite length queues, no messages can be lost. Hence, m, = m,. 
If we allow multiple messages to be accepted by the PE, then 
m, =dmP,. Therefore, 

m = m,/(dP,) = Nm,/(2N —2). 
If the PE accepts only single messages in each cycle, then the 


above expression for m is still valid if we consider dmP, to be the 
rate at which messages are sent to the PE’s buffer. 


P,, and message 


4.2. Multiple-accepting model 

In order to determine T for the multiple-accepting model, we 
first must determine the length of the buffers at the switch out- 
puts of each node. Consider a single output queue and associ- 
ated output line. We say this system is in state i (with probabil- 
ity b;) when there are i messages in the system. Examining first 
only the messages received from outside the node, the probabil- 
ity of receiving i messages at a particular output line of the 
queue is given by 
ai) = (472 - m-P,y(d-1)y (m(1-P, Yd -1)y 
for 0 =i <d. (We assume that a message does not go back to 


the node it came from.) Considering generated messages also, 
the probability of getting i messages at an output line is given 


by: 
(1 — m,/d)q (0) i=0 
(1 — m,/d)q (i)+(m,/d)q(i-1) 0<i<d 
(m,/d)q (d -1) i=d 
0 i>d 


Given these arrival rates at an output queue, the state distri- 
bution for the queue are 


i 


i 
b; = D 49; -) +1 + a;bo : 
j 20 


Then the mean number of messages in the system can be shown 
to be 
b =m + {m*(d(1 — P,”) — 2(1 — P,))} / {2(d — 1)(1 — m)}. 

Using Little’s identity, the mean time a message spends at an 
intermediate node is given by b/m. As an average message goes 


L $ J (7) = nodes, the delay to reach the desti- 
N-1 jah J P, ‘ 
nation node is b/(mP,). Including an additional cycle for the 
message to be transferred to the PE from the switch, the total 


delay is obtained as 
T =1+d/(mP,). 


through 


4.3. Single-accepting model 

Now we consider the case where only one message can be 
accepted at the PE at a time. We need to determine the length 
of the PE’s buffer and this can be done in the same manner as 
before. For the PE’s queue, the arrival rates are given by 


aj = (‘Ja = mP,)*~' (mP,)' 


for O<isd, and a; =0 otherwise. The probability of being in 
state i is b;. Using this and following the same procedure as 
before the mean message delay can be derived as: 


T = b/(mP,) + m,(d—-1)/(2d(1—-m,)) + 1. 


4.4. Buffered network results and discussion 

Figure 4 presents the results in the infinite buffered model 
and compares them with the same model for indirect binary 
cube networks. Figure 4a verifies the accuracy of the analytical 
model vis a vis simulations. We notice that with the multiple- 
accepting model, about a 30% increase in delay results from a 
fully saturated network. With a single-accepting PE model how- 
ever, (Figure 4b), the receiving queue at each PE does blow up 
in size with increasing generation rates just as in the case of the 
indirect network. However, at high request rates, the difference 
between the direct and indirect networks is substantial. As in 
the case of the unbuffered networks, we notice that a switch size 
of at least 8 is needed before the indirect networks can exceed 
the performance of the direct network. The cost-performance 
discussion in Section 3.3 for unbuffered networks holds for buf- 
fered networks also. 


4.5. Finite buffers 

We now consider buffered systems where the queue lengths at 
the outputs of switches are finite. We define Q to be the length 
of the queue. Recalling our definition of state from section 4.2, 
such a queue and associated output line has Q +1 states. The 
expressions for a; and b; in section 4.2 still apply (6; =0 for 
i > Q); however, the rest of the analysis in that section is partic- 
ular to the infinite buffer case. For the finite buffer case, we 
still must derive an expression for m (and P,). Since the associ- 
ated output line emits a message whenever the system is not in 
state 0, m =1-— bo. bo may be calculated by solving the system 
of linear equations described by the state transitions. (When 
Q =0, m = 1 — ao which corresponds to section 3.2.) Given this 
value for m, P, and P, may be calculated using the expressions of 
section 3.1. 

For the multiple-accepting PE model, m, = dmP, just as in sec- 
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tion 3.2. When using the single-accepting PE model, 
m, =1-— bo, where {b;} is the state distribution for the PE’s 
queue. (As before, when Q =0, m, = 1 — ag which corresponds 
to section 3.2.) 

Figure 5 displays the results for the finite buffered network 
for varying values of Q. Note that fairly small buffer sizes yield 
more than 90% of infinite buffer performance. The perfor- 
mance difference between the single-accepting and multiple- 
accepting models is due to a bottleneck at the node to PE con- 
nection in the single-accepting model. This bottleneck is most 
constraining when the message generation rate approaches 1 and 
message delays increase (refer to Figure 4b). The behavior of 
the finite buffered direct cube is similar to that of the indirect 
binary cube network, as presented in [DiJu81]: small buffers cap- 
ture the effective performance of the infinite buffers, in all but 
the saturated mode of network operation. 


4.6. Saturation value of message generation rate, m, 

In Section 3.3 we observed that while a direct binary n-cube 
network had more hardware then an indirect network, there was 
not a corresponding performance’gain. If we let m, =1, we 
obtain m = N/(2N — 2) (Section 4.1). For maximum utilization 
of the network links, m should approach 1, and this happens 
when m, approaches 2(N —1)/N [AbPa86]. Thus, a multiple- 
accepting PE requires a multiple-generating PE to allow full utili- 
zation of all the network links. 


, 5. General Distributions 
It is possible to generalize our expressions in the previous sec- 
tions to arbitrary reference patterns [AbPa86]. Let h,; be the 
probablility that a message generated at a node will have hop- 
count /. Figure 6 shows the message delay for three distributions 
of hopcount. 4; = 1/d corresponds to a uniform distribution of 
hopcounts, while the harmonic reference pattern is given by 


h; = Vai, where a = S1//i. Clearly, exploiting the locality of 
i=l 

reference in a program results in substantial performance 

increases. 


6. Summary and Conclusions 

We have presented in this paper an analytical model and simu- 
lation results for the performance of the direct binary n-cube 
interconnection scheme for multiprocessors. The results show 
that with the equiprobable reference model for interprocessor 
communication, the direct cube performs better than the 
indirect cube networks constructed using switches of size less 
than 8 for all modes of operation that we have considered. The 
indirect networks require fewer switches, especially as the 
switch sizes begin to increase. However, it is possible in the 
direct case, for each processing element to take on some or all 
of the switching function at each node. In the latter case, the 
direct binary n-cube performs especially well in a mode called 
the multiple-accepting PE mode of operation. Furthermore, this 
multiple-accepting PE mode, coupled with a multiple message 
generating PE will fully utilize all the network bandwidth; this 


bandwidth is greater than that of the indirect cube networks, 
due to the presence of more links in the direct cube. 

Results regarding buffering of packets at nodes generally are 
similar to known results in the indirect network case: not buffer- 
ing packets leads to a significant loss of packets in the network, 
and a small buffer size (1 or 2) dramatically reduces loss of 
packets. When processor allocation is done intelligently to 
exploit locality of communication, our analysis shows that signi- 
ficantly better performance can be realized. This will have to be 
done for the performance/cost ratio to be better in the direct 
cubes than in their indirect counterparts. 
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Abstract 


High communication bandwidth in standard technologies 
is more expensive to realize than a high rate of arithmetic 
or logic operations. The effective utilization of communica- 
tion resources is crucial for good overall performance in highly 
concurrent systems. In this paper we address two different 
communication problems in Boolean n-cube configured multi- 
processors: 1) broadcasting, i.e., distribution of common data 
from a single source to all other nodes, and 2) sending person- 
alized data from a single source to all other nodes. The well 
known spanning tree algorithm obtained by bit-wise comple- 
mentation of leading zeroes (referred to as the SBT algorithm 
for Spanning Binomial Tree) is compared with an algorithm 
using multiple spanning binomial trees (MSBT). The MSBT 
algorithm offers a potential speed-up over the SBT algorithm 
by a factor of logy N. We also present a balanced spanning tree 
algorithm (BST) that offers a lower complexity than the SBT 
algorithm for Case 2. The potential improvement is by a fac- 
tor of 4 log. N. The analysis takes into account the size of the 
data sets, the communication bandwidth, and the overhead in 
communication. We also provide some experimental data for 
the Intel «PSC/d7. 


1. Introduction 


Broadcasting of data from a single source to all other 
nodes in a multiprocessor system is an important operation. 
It is used in many parallel algorithms, for instance, in matrix 
multiplication, the solution of irreducible linear systems, and 
forming transitive closure. Examples of a variety of algorithms 
using specific forms of communication are contained in [6, 12, 
11]. The reverse operation, reduction, occurs, for example, in 
computing inner products, solving linear recurrences, [14], and 
parallel prefix computation. A different situation occurs if the 
source node distributes personalized information to all other 
nodes. In this case no replication of information takes place 
during distribution (or reduction in the reverse operation). The 
collection of data to a single node and distribution of person- 
alized messages to all other nodes is a useful operation for the 
solution of tridiagonal systems under certain combinations of 
start-up times for communication, communication bandwidth, 
and problem sizes [13]. Matrix transposition is another exam- 
ple of personalized communication in that every node sends 
different data to every other node [11]. 

Data communication in Boolean cubes has received signifi- 
cant interest recently due to the success of the Caltech Cosmic 
Cube project [20] and the availability of Boolean cube con- 
figured concurrent processors from Intel Scientific Computers, 
NCUBE, Ametek, Floating-Point Systems and Thinking Ma- 
chines Corp. [7]. The embedding of complete binary trees is 
treated in [22, 11, 18, 3, 2]. Wu also discusses the embedding 
of k-ary trees, and Bhatt the embedding of arbitrary binary 
trees. Efficient routing using randomization for arbitrary per- 
mutations has been suggested by Valiant [21] . Broadcasting of 
data from a single source to all other nodes is studied in [18]. 
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We propose a lower bound algorithm that offers a speed-up of 
a factor of log N} over the algorithm in [18]. We also present 
lower bound algorithms for personalized communication. We 
give routing algorithms, and analyze the complexity in detail. 
The analysis is compared with experimental data. 

A Boolean n-cube has N = 2” nodes, diameter log N, 
(ee) nodes at distance 7 from a given node, and log N disjoint 
paths between any pair of nodes. The paths are either of the 
same length as the Hamming distance between the end points of 
the paths, or the Hamming distance plus two [19]. The fanout 
of every node is log N, and the total number of communication 
links is 5N log N. Any spanning tree can be used to broadcast 
data from a single source to all other nodes.” A node replicates 
the data as many times as correspond to the out-degree of the 
node in the spanning tree. In broadcasting one element (or 
packet), the minimum number of routing steps is log N. Any 
spanning tree with height log N can achieve this lower bound, 
if each node can send out data through all the links connected 
to it during one step. In case each node can send or receive 
data through only one link during one step, then only the class 
of spanning binomial trees can attain the lower bound, since 
after each broadcasting step the number of nodes that own the 
desired data is at most twice that of the previous step. The 
log N lower bound is attained only if the number of nodes that 
own the desired data doubles at each step. This is exactly the 
definition of a binomial tree. A 0-level binomial tree has only 
1 node. An n-level binomial tree is constructed out of two 


'(n — 1)-level binomial trees by adding one edge between the 


roots of the two trees, and by making either root the new root, 
[1, 4]. It follows from this recursive construction that: 


") nodes at level 7. 


2. The n-level binomial tree is composed of n subtrees® each of 
which is a binomial tree of 0,1,...,n—1 levels respectively. 
The k-level subtree has 2* nodes. 


3. An n-level binomial tree can be obtained from a k-level 
binomial tree, k < n, by replacing each node of the k-level 
binomial tree by an (n — k)-level binomial tree. The child 
nodes of a node in the k-level tree become children of the 
root of the replacing (n — k)-level binomial tree. 

Since an n-level binomial tree can be embedded in an n-cube 
as a spanning tree, we call it a Spanning Binomial Tree (SBT). 
Note that the number of nodes at each level 7 of the binomial 
tree is equal to the number of nodes at distance 7 from a node 
in an n-cube. 

In broadcasting M elements using.a packet size of B el- 
ements, and by pipelining the communication from the root 
towards the leaves along any log N height spanning tree, the 
number of routing steps becomes [Af] + log N — 1, which is 
not optimal. Since each node has a fanout of log N, a lower 


1. An n-level binomial tree has ( 


a a er 
liog N = log, N throughout this paper. 
21n particular, a Hamiltonian Path is also a spanning tree. 
3In this paper “subtree” refers to “subtree of the root” unless stated otherwise. 


bound for the number of routing steps is lew + log N —1. 
In order to achieve this lower bound, the data set has to be 
split into log N subsets, each of which is communicated over 
a distinct communications link from the source node. It fol- 
lows that the nodes adjacent to the source node must be roots 
of subtrees spanning all but one node of the cube (the source 
node). The depth of the subtrees is log N, and a tight lower 
bound for the number of routing steps, assuming concurrent 
bi-directional communication, is [g W447] + log N. 

In sending personalized information from a single source 
to all other nodes, no replication of information takes place 
and the total number of packets that the source must send 
is pom) for M elements per destination node. The num- 
ber of routing steps for the Spanning Binomial Tree algorithm 
(SBT) is log N if the maximum packet size is sufficiently large 
(NM/2); even so, the communication bandwidth is poorly uti- 
lized since the transfer time is at least proportional to NM/2. 
In the SBT, half of the nodes belong to one subtree, one quarter 
to another subtree, etc. A Balanced Spanning Tree (BST) is de- 
fined in that each subtree has approximately Gen nodes. The 


data transfer on any link is limited to approximately lew M , 
The BST algorithm offers a potential log N speed-up over the 
SBT algorithm in sending personalized information from a sin- 
gle source to all other nodes. In fact, lower bound algorithms 
for broadcasting from every node to every other node and send- 
ing personalized data from every node to every other node on 
a Boolean cube can be attained by using N BST’s rooted at 
each node concurrently. See [8] for details. 

In section 2 we introduce the notation and some definitions 
used throughout the paper. Section 3 considers broadcasting 
from a single source to all other nodes, and section 4, person- 
alized communication from a single source to all other nodes. 
Experimental results on the Intel ~PSC/d7 are presented in 
section 5. 


2. Notation and Definitions 


In the following, N denotes the number of nodes in the 
Boolean n-cube and n = log. N the dimension of the cube. 
Nodes in the cube are assigned binary addresses such that ad- 
jacent nodes differ in precisely one bit. Address bits are num- 
bered from 0 through n — 1 with the lowest order bit being bit 
0. Node 2 is the node that has a binary address equal to 2, 
i.e., 7 = (Gn-1Gn-2...a9). Let @ be the bit-wise exclusive-or 
operation. The 7*" port of a node? connects to the node k that 
differs from 7 in the j’* bit, i.e., i @ k = (00...01,0...0). There 
is a port for each address bit, and ports are numbered from 0 
through n — 1. Let |z| denote the number of bits with value 
one in the binary number 7; hence |: @7| denotes the Hamming 
distance between the binary numbers: and 7. 

Let i = (@n-1@n—2...@0). Define R to be the right rotation 
function, i.e., R(t) = (do¢n—14@n—2...@1), and R? = RJ—!oR to 
mean a right rotation of 7 steps. The rotation of a graph with 
binary node addresses is accomplished by applying the same 
rotation function to all its addresses. This is similarly the case 
for the translation of a graph. Clearly, adjacency is preserved 
under rotation and translation. The period of a binary number 
1, P;, is the least 7 such that 1 = R/(z). For example, the period 
of (011011) is 3. A binary number is cyclic if its period is less 
than its length; otherwise it is non-cyclic. A relative address of 
node in a spanning tree rooted at node sisz:@s. A cycle node 
is a node with cyclic relative address.4 If one binary number 


4A cyclic node is defined in terms of a spanning tree. 
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can be derived by rotating another binary number, then they 
are in the same generator set G (or necklace [15]). For example, 
(001001), (010010) and (100100) are in the same generator set. 
The number of elements in the generator set G; of 2 is F;. 

In the graph model of the Boolean cube there is a node 
(vertex) for each node (processor with local memory) of the 
cube, and a pair of directed edges for each pair of nodes that 
differ in precisely one bit. The directed edges between a pair 
of nodes form a communication link. A source node, or a root 
node, is a node that only has edges directed away from it. A 
sink node, or a leaf node, only has edges directed to it. Nodes 
that are neither source (root) nor sink (leaf) nodes are internal 
nodes. The root of a tree is at level 0 and traversing the edge 
away from the root increases the level by one. The height of 
a tree is equal to the label of the last level. The MSBT and 
BST are constructed out of n subtrees labelled 0 through n— 1 
(from left to right in the Figures of this paper). 

The communication is assumed to be packet switched. M 
denotes the number of elements to be received by a node, ¢, 
the transfer time for an element, and 7 the start-up time for 
the communication of a packet of maximum B elements. With 
concurrent bi-directional communication we assume that a pair 
of adjacent nodes can exchange a pair of messages during the 
same communication step, or cycle. In a port-oriented routing 
algorithm, all information to be communicated over a port is 
sent before any communication is performed on any other port. 
In packet-ortented communication, a piece of information cor- 
responding to a packet is communicated on all ports before a 
second packet is sent on any port. 


3. Broadcasting 


In this section, we will describe and compare routing al- 
gorithms for broadcasting based on a Spanning Binomial Tree 
(SBT) and Multiple Spanning Binomial Trees (MSBT). We 
first define the SBT and MSBT topologies, then state the com- 
munication complexity of the routing algorithms for the two 
topologies, assuming communication on only one port at a time 
(which, effectively, is the case with the Intel «PSC), and on all 
log N ports concurrently. 


3.1. Spanning Binomial Trees 

The familiar spanning tree rooted at node 0 of a Boolean 
n-cube contains the edges that connect a node 2z with the subset 
of its neighbors having addresses obtained by complementing 
any bit of leading zeroes of the binary encoding of 7, [5, 11, 
16, 18, 19]. For an arbitrary source node s the spanning tree 
is simply translated by a bit-wise exclusive-or operation on all 
addresses with the address of the source node; i.e., c= 2108 
is formed. Complementation of those bits of z that correspond 
to the leading zeroes of c defines the edges of the translated 
spanning tree. More precisely, let s = (Sp-1Sn-2...50), 7 = 
(@n—1@n—2...€9), and c = (Cn-1Cn—2...C9), Where Cm = Sm ® Gm. 
Let cz = 1 and cm = 0,Vm > k with k = —1 for c= 0, i.e., k 
is the highest order bit of c that is 1. Let children(t,s) be the 
set of child nodes of node z in the SBT rooted at node s and 
Mspr(t®s) = {k+1,...,n—1}. Then, 


childrengpr(i,s) ={(@n—14¢n-2---Tm---a0)}, 
Vm € Msar(t @ s) 


In implementing the routing algorithm for the SBT topol- 
ogy it is also convenient to introduce the inverse function, i.e., a 


function that for each node defines its parent. Let parent(7, s) 
be the parent of node 7 in the spanning tree rooted at node s. 
Then 


?, 


(Qn—14n—2 cele aa 


c=: 


parents pr(?,s) = { a9), i#s. 


It is easy to verify that the parent and children functions 
are consistent. Figure 1 shows a spanning tree generated by 
the children (or parent) function for the root located at node 
O in a 4-cube. 


3.2. Multiple Spanning Binomial Trees 
_ The Multiple Spanning Binomial Trees (MSBT) graph can 
be viewed as being composed of log N SBT’s with one tree 
rooted at each of the nodes adjacent to the source node. The 
SBT’s are rotated such that the source node of the MSBT 
graph is in the smallest subtree of each SBT. The MSBT graph 
is then obtained by reversing the edges from the roots of the 
SBT’s to the source node. After the edge reversal each SBT 
becomes an ERSBT (Edge Reversed Spanning Binomial Tree). 
The MSBT graph is not a tree. The diameter of the MSBT 
graph is log N + 1, since the source node is adjacent to all the 
roots of the SBT’s used in the definition of the MSBT graph, 
and each SBT is of height log N. The total number of edges in 
the log N SBT’s is (N — 1) log N, which is log N less than the 
total number of directed edges in the cube. Hence, if the log N 
SBT’s are edge-disjoint, then all edges are used, except the 
edges directed from the roots of the SBT’s towards the source 
node. The SBT’s used for the construction of the MSBT graph 
can be obtained by translation and rotation of the SBT defined 
before. We refer to the SBT rooted at node (00...01;0...0) as 
the j‘* SBT of the MSBT graph. The j** ERSBT is obtained 
from the j'* SBT by reversing the edge directed to node 0 (the 
source). 
Let ¢ = (@n-14n—2...a9) and k be such that a, = 1, and 
Qm = 0,Vm € Muspar(t,j) where Mysar(i,j) = {(k+1) mod 
n,(k+2) mod n,...,(7—1) mod n}. Hence, k is the first bit to 
the right of bit 7, cyclically, which is equal to one, if k #7. For 
the special case of 1 = 0 we define kK = —1. For the j‘* ERSBT 
of the MSBT graph with source node 0, the set of child nodes, 
and the parent node, of node: are defined as follows: 
childrenusar(t,J; 0) = 
(an—14n-2...4;...d0), if k = —1; 
{(@n-14n—2---Gm...a0) }, 
Vm € Musar(t,j) U{3}, 
{(dn—14n—2-.-m.-.a9) }, 
Vm € Musar(t,J); 
?, 


parentusprT(t,j,0) = 


?; ifk = —1; 
(Qn—1@n—2.-.4;...d9), ifa;=0,k # —1; 
(@n—14n—2..-Gg...d0), if a; = 1. 
All nodes with bit j equal to zero are leaf nodes of the 
j** ERSBT, except node 0. Conversly, all nodes with a; = 1 
are internal nodes of the 7#* ERSBT. Figure 2 shows an MSBT 
graph with source node 0 in a 3-cube. 
It can be shown that the log N directed ERSBT’s are edge- 


disjoint and the height of the MSBT graph is minimal among all 
possible configurations of log N edge-disjoint spanning trees[8]. 


ifa;=1,k#9; 


if a; = 1, k = j; 
if a; = 0,k # —1. 
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Figure 1: A spanning tree in a 4-cube. 
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Figure 2: Three edge-disjoint directed spanning trees in a 
3-cube. 


For an arbitrary source node s an MSBT graph is defined 
by translating the MSBT graph rooted at node 0. The only 
difference in the definition of the parent and children functions 
is that k is determined from c = i@s. Hence, for a source node 
s, k is such that c, = 1, and ¢m = 0,¥m € Musar(i © 5,3). 


For the special case of c = 0, k = —1. 
childrenusart(t,J, s)= 
(An—14n—2-.-;...00), _itk =-1; 
{(an—14n-2---Om---20) }> ; ; : 
Vm € Musar(i® s,j)U{s}, ife; = 1, k #9; 
{(Gn-14n-2 +O.) ’ : . 
Vm € Mussar(t@s,J7), if Cj = lk=9; 
d, if c; =0,k # —-1. 
parentusBr(t,j,s) = 


d, if k = —1; 
(@n—14n—2-.G;...A0), if cj = 0,k 4-1; 
(An—10n—2---Gg---40); if c= Ks 
3.3. Communication Complexity of SBT- and MSBT-based 
Broadcasting 


3.3.1. Spanning Binomial Trees 

With the communication restricted to one port at a time 
the data is first sent to the node that is the root of the largest 
subtree. Since the binomial tree is composed of two (n—1)-level 
binomial trees, the broadcasting operation is now reduced to 


the broadcasting of data in two same-sized, disjoint, subtrees 
of the cube. The process is repeated log N. times and.the com- 
plexity is T = [A¢](7 + Bt-)logN. Clearly Bopp = M and 
T min = log N(7+ Mt,). The data transfer time is independent 
of the packet size, but the number of start-ups decreases. 


With a capability of concurrent communication on all ports, 
pipelining can be employed extensively. The propagation time 
to the node farthest away from the source is at least log N(7 + 
Bi,). When this node has received all packets the broadcast- 
ing is terminated. Hence, T = ([#] + logN — 1)(7 + Bt), 


Bopt = \/ TARE N=A)> and fa = (/Mt, + /7 (log N - 1) -. 


The communication complexity estimates for the SBT are also 
given in [18] and are included here for easy reference. 


3.3.2. Multiple Spanning Binomial Trees 

We consider the cases with communication restricted to 
one send or one receive operation at a time, one send and one 
receive operation concurrently, and concurrent communication 
on all ports. The minimum number of routing steps to broad- 
cast log N packets is 2log N with communication restricted to 
one send and one receive operation at a time. To realize this 
lower bound it is required that a routing algorithm be found 
that allows concurrent communication within all subtrees with- 
out violating the constraint on concurrent communication. We 
describe such a routing algorithm in terms of labelling the 
MSBT graph with the least label being 0. A valid labelling 
for the restriction of one receive and one send operation con- 
currently (per node) and allowing pipelining every log N cycles, 
requires that the following three conditions be satisfied: 


1. For any node of each subtree the least label on the output 
edges is greater than the label on the input edge. 


2. For any cube node the labels on its input edges are distinct 
modulo log N. (If there is more than one packet per subtree 
then the root can send out a new packet to every subtree 
every log N cycles.) 

3. For any cube node the labels on the output edges are dis- 
tinct modulo log N. 

Let ¢ = (adn_1@n-2...a9) and f(i,7) be the label of the input 
edge of node i in the 7*" subtree for an MSBT graph with source 
node s. Letc = 7@s, cg = Landcm = 0,Wm € Musar(t@s,)). 
If c= 0 then k = —1. Define 


dg, if k = —1; 

a jtn, ifce; =0,k ~ —-1; 

FUN= 5%, foe 1k > J; 
k+n, ife;=1,k <j. 

It can be proved that function f satisfies these three con- 
ditions[8]. From the labelling scheme, the largest label of all 
the input edges is 2n — 1, i.e., broadcasting the first log N pack- 
ets (one packet per subtree) can be done in 2log N steps. The 
MSBT graph allows M elements to be broadcast in [4#/]+log N 
routing steps under the constraint of one receive operation con- 
current with one send operation. This is a strict lower bound 
for yt > 18]. Figure 3 shows an MSBT graph for a 3-cube 
labelled by the function f defined above. 
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Figure 3: Routing in an MSBT graph with communication 
on one port at a time. 


For communication restricted to one send or one receive 
operation per node, we can transform each previously defined 
cycle into two cycles. Notice that in the previous routing al- 
gorithm, all the communication links are used in only one di- 
rection during the first n routing steps, and in the last routing 
step. Hence the MSBT graph allows M elements to be broad- 
cast in 2/4] + log N — 1 routing steps under the constraint 
of at most one receive or one send operation during each step. 
This is a strict lower bound for 4 > 1[8]. 

With a maximum packet size of B and one receive opera- 
tion concurrent with a send operation the communication com- 
plexity for MSBT-based broadcasting is T = ([%{]+log N)(7+ 


Bt.), which is minimized for Bopt = \ [ian- Tmin = (/Mt.+ 
Jr log N)?. Restricting the communication to one send or one 
receive operation at a time, T = (2[4]+ log N — 1)(r + Bt) 
and Tinin = (\/2Mt, + \/r(log N — 1))?. 

For the case in which communication on all ports can 
take ae concurrently the communication complexity is T = 


(Iattnl + log N)(7 + Bt,.). With optimal packet size Bop: = 


wan VME, Trin = (y/ tet + VrTOE NM)? 


3.4. Comparison and Conclusion 

In the following, we also compare the MSBT algorithm 
with the one based on a Two-rooted Complete Binary Tree 
(TCBT) [2] or a Hamiltonian Path (HP) as a broadcasting tree 
respectively. Table 1 shows the propagation delay of various 
algorithms. Interestingly, broadcasting through a Hamiltonian 
Path on a hypercube may be faster than broadcasting based on 
the SBT or even the TCBT, depending on the values of M, t,, 
7 and N. With communication on all the ports concurrently, 
the MSBT-based algorithm can send out log N distinct packets 
every cycle while the SBT- and TCBT-based algorithms can 
only send out one distinct packet every cycle. Table 2 compares 
the number of cycles per distinct packet for various algorithms. 
Some variations exist, such as using two Hamiltonian paths 
with opposite directions sending distinct data, or using one 
Hamiltonian path such that the source node is at the center of 
the path. However, these variations only affect (either increase 
or decrease) delays, and the number of cycles per packet, by at 
most a factor of two. 

The complexity estimates are summarized in Table 3. A 
potential for concurrent communication on all ports reduces 
the number of sequential start-ups and the bandwidth require- 
ment by a factor of approximately log N for an arbitrary packet 
size in both SBT- and MSBT-based broadcasting. TCBT- 
based broadcasting does not fully utilize the bandwidth of a 
cube. The reduction in communication complexity for concur- 


| HP {| ON-1 | ON 1 
TCBT Dog N 2 
MSBT log +1 


Table 1: Propagation delays. 


Algorithm = 


Table 2: Number of cycles per distinct packet. 


HP,ls&r 


SBT, log N ports 
TCBT,ls&r 


Communication 
Assumption 


TOBT/MSBT 
SBT/MSBI 


[ar reer | lew arene) | Jie | Aes VO 


(M]+N-3)(r+Btc) | V/oSe |  (VMi+ V(N-3)r)? 
[ | log N(r + Bte) log N(Mé, + 7) 
(TM +logN—1)(r+ Bt) | Jai, | (VM + Vr(logN —1))? 


TCBT, lsorr (3[ 4] + 2log N = 5) (7 = Bt.) = 
2([4] +logN ~2)(r+ Bt) | \/aMtoe 


([M]+logN—-1)(r+ Bt) | Vegiay | (VM + Vr(logN - 1)? 


MSBT,1sorr (2[4¢] + log N — 1)(r + Bt,) (/2MEt, + \/r(log N — 1))? 
MSBT, 1s & r ([AZ] + log N) (7 + Bte) 
(I giogw | + log N)(r + Bee) eS a + /rlog N)? 


3M 


(/3Mt, + /r(2 log N — 5))? 
2(/Mt. + y/7(log N — 2))? 


B= Bopt; 
tlogN > Mt, 


B= Bopt, 
tlogN < Mt, 


Table 4: Communication complexity compared to the MSBT routing. 


rent communication on all ports is a factor of 2 or 3. Opti- 
mizing the packet size for each situation brings the number of 
start-ups to O(log N), irrespective of whether communication 
on one port or log N ports at a time is possible. 

The MSBT-based broadcasting always offers a reduction 
in the bandwidth requirement for individual communication 
links by a factor of approximately log N over SBT-based broad- 
casting. With communication on all ports concurrently, the 
MSBT-based broadcasting has a communication complexity 
that is lower than that of TCBT-based broadcasting by a fac- 
tor of log N. Even with communication only on one port at 
a time, MSBT-based broadcasting still is faster than TCBT- 
based broadcasting by a factor of 1.5 or 2. The communication 
complexities of broadcasting based on the SBT and the TCBT 
are compared with that based on the MSBT in Table 4.° 


4. Personalized Communication 


In personalized communication no replication of informa- 
tion takes place during distribution, nor is there any reduction 
during the reverse operation. In broadcasting, the bandwidth 
requirement grows with the distance 7 from the source node 
precisely as the number of nodes grow. In personalized commu- 
nication the bandwidth requirement instead decreases in pro- 
portion to the number of nodes less than or equal to distance 7 
from the source. The root is the “bottleneck” in personalized 
communication. In this section we define a pruning strategy 
for the MSBT graph that generates a balanced spanning tree 
(BST) of height log N. If concurrent communication on all 
log N ports (of the root) is possible, then a lower bound for 
the transmission time is wy M t-, and a lower bound for the 
number of start-ups is log N. The BST makes possible per- 


5 Notice that the entry for the last column and the last row in the table is 
based on the assumption that B = Bop:,7 log? N < Mtg. 
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sonalized communication in a time corresponding to this lower 


bound. 


4.1. A Balanced Spanning Tree 

In the SBT topology a node z belongs to the 7!" subtree iff 
a; = 1,a, =0,k < 7. Inthe MSBT graph a node is an internal 
node of the j‘* ERSBT if a; =1. Bit 7 can be considered as a 
base for j*" subtree. For the BST we define the base as follows: 

Let Ji; = {J1,J25--+;Jm}, where j) < jo <...jm, R“(t) = 
R°(i), u,veJ;,and R(t) < R(t), wed, LE Jj. |Ji| = 
n/P; where P; is the period of t. Then base(t) = 7; and node 
2 1s assigned to subtree 7, 1.e., the value of the base equals 
the minimum number of right rotations such that the rotated 
number has a minimum value among all the rotated values.® 
For example, base((011010)) = 3 and base((110110)) = 1. The 
period of (011010) is 6 and the period of (110110) is 3. For 
ease of notation we omit the subscript on 7 in the following. 
For the definition of the parent and children functions we first 
find the position k of the first bit cyclically to the right of bit 7 
that is equal to 1, ie., a, = 1, and am =0,Vm € Musar(t,7), 
(k = 7, if every bit but 7 is 0). For: =0, k =—1. Then 


parent psr(t,0) = 
?, if 1 = 0; 
(Gn—14n—2...Gx...€9), otherwise. 


childrengsr(t,0) = 
{(Gn—1@n-2---Om.---d9)},Wm € {0,1,...,n—1}, if i =0; 
{dm — (Qn—14n—2--.Gm.-..ao) }, ; i 
Vm € Muser(t,j) and base(qm) = base(t), if i #0. 


6 The notion of base is similar to the idea of distinguished node used in {15] in 
that base = 0 distinguishes a node from a generator set (necklace). 


The parentgsr function preserves the base, since for any 
node t with base 7, a; is the highest order bit of R/(i). Comple- 
menting this bit cannot change the base. It is also readily seen 
that the parentgsr and childrengsr functions are consistent. 

Figure 4 shows the spanning tree generated by the algo- 
rithm above for the root located at node 0 in a 5-cube. 

For an arbitrary source node s we translate the BST rooted 
at node 0 to node s by performing for each node the bit-wise 
exclusive-or function of its address and the address of the source 
node. The base of a node is determined from c = 1 @ 5, and 
the children and parent functions are readily modified. 

Let Jie = {31,J2,--+,Jm}, where j1 < j2 <...Jm, R*(c) = 
R%(c), u,v € Jj,, and R“(c) < R(c), ue Jia, LE Tze. 
Then base(c) = j1. Then k is defined by cg = 1 and cm 
0,¥m € Musar(t ®s,7) with k = —1 if c=0. 


parentpsr(t,s) = 


te if c = 0; 


(Qn—14n—2...0,...49), otherwise. 
childrengsr(t,s) = 


{(@n—14n—2...dm...d9)}, Vm € {0,1,....n-1}, ifc=0; 
{dm = (@n—14n-2...Gm...ao) }, 
Vm € Musatr(t® S,j) 


and base(q¢m @ s) = base(i @ s), if c #0. 


Lemma 4.1. The number of nodes in a subtree is of order 
O( teen): 

Proof. With A cyclic nodes there are at least (N — A)/logN 
nodes in a subtree. Denoting the number of generator sets for 
cyclic nodes by B it follows that the maximum number of nodes 
in a subtree is (N—A)/log N+B-1. To derive bounds on A we 
use the complex plane diagram used by Hoey and Leiserson [9] 
in studying the shuffle-exchange network. Leighton[15] shows 
that B= O(\/N). 

Full necklaces, i.e., non-cyclic nodes, are mapped to cir- 
cles. Degenerate necklaces, i.e., cyclic nodes, are mapped to the 
origin. In the context of the shuffle-exchange graph each node 
that is mapped to the origin of the complex plane is adjacent 
(via an exchange edge) to a node at position (1,0) or (—1,0). 
Hence, for every full necklace of log N nodes there are at most 
2 cyclic nodes. Node 0 is adjacent to a node of a full necklace, 
and so is node N — 1 (for log N > 2). It follows that an upper 


bound on A is 2 and the number of nodes in a subtree 


° N+2 . . e 

is at least Tog N- The relative difference in the number of 
nodes in the maximum and minimum subtrees approaches 0 
for N — oo. 
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Figure 4: A balanced spanning tree in a 5-cube. 
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Table 5: A Comparison of maximum subtree sizes of the 
Balanced Spanning Tree and values of (N — 1)/log N. 


od Mood fond ed ae 
(o>) wa] CO 


Table 5 gives the sizes of the maximum subtrees generated 
according to the definition of the BST for up to 20-dimensional 
cubes. The relative difference approaches 0 rapidly. The last 
column contains the ratio of BST(max) to (N — 1)/log N. 


Some properties of the BST are listed below. For detailed 
proofs see [8]. 


1. The height of one subtree is log N, and the height of all 
other subtrees is log N — 1. 


2. ee maximum fanout of any node at level i in a BST is 
[etry for 1 <i <logN. 


3. Let $(7,7) be the number of nodes at distance 7 from node 
t in the subtree rooted at node j. Then, ¢(,7) > $(k,7) 
where node k is a child of node 7.” 


4. Excluding nodei@s = (11...1), all the subtrees of the 
BST are isomorphic if log N is a prime number. 


5. Subtrees P to log N—1 contain no cyclic nodes with period 
P. 


6. Any cyclic node is a leaf node of the BST. 


4.2. The Complexity of Personalized Communication Based on 
the SBT and BST Topologies 


4.2.1. Spanning Binomial Trees 

For SBT-based distribution restricted to communication 
on one port at a time, the communication complexity for a 
maximum packet size of B is T ~ (N — 1)Mt, + T(NM/B+ 
lbg/Z]1-1, M<B< NM/2, which is minimized for 
B= NM/2 yielding T = (N —1)Mt,+rTlogN. For B < M, 
T ~ (NM/B — 1)(Bt, +7). There exist several algorithms of 
this complexity. One such algorithm sends the cumulative data 
for the largest subtree to the root of this subtree first. Both 
nodes then recursively execute the same algorithm in their own 
(n — 1)-subcube. 


Lemma 4.2. In distributing data from the root in a level-by- 
level order starting from level log N, the time to complete the 


7 This property is required in deriving the communication complexity for the 
BST routing. 


distribution is determined by the root, which terminates the 
distribution in a time proportional to the lower bound for a 
sufficiently large packet size. 


Proof. With potential concurrent communication on log N 
ports, the root can send data for level log N — « during step 
i, O<# < logN —1, assuming a sufficiently large packet 
size. The amount of data sent from the root during step 2 
to the largest subtree is (lee : “")M . The amount of data sent 
during the same step from the node at level 7 to its largest sub- 
tree, which is the largest subtree at that level, is ghar )M, 


where 1 < 7 < logN — 1. Since (eee) > oa) for all 
i,j > 1, and any other subtree rooted at level 7 is isomorphic to 


a subgraph of the highest subtree rooted at level j, the transfer 
time is determined by the transfer time of the root. For a suf- 


s ba ba log N—1 ~ __NM____ 
ficiently large packet size, i.e., B > (log waa )M & Van(log N=)’ 
Tmin = N/2Mt, + log Nr. 
| 


With a potential for concurrent communication on log N 
ports, a reduction in the transfer time by a factor of 2 is 
possible compared to communication on one port at a time. 
We conclude that for SBT-based algorithms the packet size is 
of greater importance than concurrent communication on all 
ports. 


4.2.2. Balanced Spanning Trees 

With BST-based personalized communication restricted to 
one port at a time the root can send data to the subtrees cycli- 
cally. With a maximum packet size B > M, data for several 
nodes can be merged into one packet. The receiving node has 
sufficient time to retransmit pieces on all its ports, should that 
be required, since a new packet only arrives every log N cycles. 
The root requires a time of T ~ Geer (7 + Bt.) log N, which, 
if the data to the most remote nodes Is transmitted first, is also 
the time to completion. For B= M, T=(N-1)(r+Mt,), 
i.e., the same as in the SBT algorithm. For B > cenM 
the root of the BST need only perform one communication 
per subtree, and it completes the communication in a time 
of T = rlogN+(N —-1)Mt,. But, unlike the SBT algo- 
rithm, the communication is not terminated when the root is 
done. The message to the last visited subtree needs to traverse 
log N — 2 communication links. The bandwidth requirement 
of each subtree can be shown to be 2Nrene® (see [8] for 
detailed proof). An upper bound on the time for personalized 
communication based on the BST with unbounded packet size 
is T = (2logN — 2)r+ N(1 + amet”) Mte. The number of 
start-ups is almost twice that of the SBT-based personalized 
communication, and the total transfer time is higher by a lower 
order term. The time for personalized communication based on 
the BST is minimized for B > ion M : 

With a potential for communication on log N ports at a 
time, the time for personalized communication based on the 


BST topology is T ~ Oe (r + Bt.) for B< M andT ~ 


log N sr (log N . log N M 
Drie (F((°% ) BrioeW| + B;t,), By = min(B, (8 ) Briogw l) 
for B > M. This complexity estimate is valid if data for nodes 
at distance 7 are sent during step log N —1, 1 <7 < logN. If 
B=M thn T ~ cow (T + Mt,). The communication time is 


minimized for B > ot nM by using a level-by-level algorithm 


as in lemma 4.2. By property p3 of the BST, it follows that 
the amount of data sent from the root to any subtree during 
step 1 is no less than the amount of data sent from any node 


during the same step. Hence, Tinin = 7 log N + faye t., the 
minimum possible. 


4.3. Comparison and Conclusion 

With communication on one port at a time, if the fixed 
maximum packet size B < M holds, then the complexity of 
SBT- and BST-based personalized communication is the same. 
For B > M, the SBT-based routing algorithm yields a lower 
complexity than the BST-based routing. For a sufficiently large 
maximum packet size the SBT-based algorithm has log N start- 
ups compared to 2log N — 2 start-ups for the BST-based al- 
gorithm. The transmission times are comparable, though the 
transmission time for the BST routing is higher. Note that, 
as in broadcasting, the minimum number of start-ups can be 
accomplished for sufficiently large maximum packet size. 

With concurrent communication on log N ports the num- 
ber of start-ups and the transmission time of BST-based rout- 
ing is lower than that for the SBT by a factor of 5 log N for a 
maximum packet size B < M. With a sufficiently large packet 
size all routings yield a minimum of log N start-ups, but the 
BST routing has a total transmission time that is lower than 
that of SBT by a factor of 5 log N. Moreover, it is achieved at 


a maximum packet size of ogy M , compared to a maximum 


ines for the SBT routing. We conclude 


that if communication can be performed on all log N ports con- 
currently, the communication complexity of the BST routing 
may be lower by a factor of 5 log N compared to the SBT rout- 
ing. Table 6 lists the communication complexity for optimal 
packet size. The timing for TCBT routing is also included for 
easy comparison. See [8] for detailed analysis. 


packet size of 


< (2N — 2log N — 1)Mé, + (2log N — 2)r 
(EN — 1)Mi, + log Nr 

< N(1+ =9ES#") Mt, + (2log N — 2)r 
~ (N — 1)/log NMt, + log Nr 


Table 6: Communication complexity of personalized commu- 
nication. 


5. Experimental Results 


5.1. Single Source Broadcasting Based on the SBT and MSBT 

Figure 5 shows the measured time to completion of a 
broadcasting operation based on the SBT topology for cubes of 
various dimensions and a number of different external packet 
sizes. As expected, the communication time increases almost 
linearly for external packet sizes below 1k bytes. Figure 6 shows 
the measured time of SBT- and MSBT-based broadcasting for 
a message of 60k bytes with each packet being 1k bytes and 
for cube dimensions ranging from 2 to 6. Figure 7 shows the 
speed-up of broadcasting based on the MSBT topology over 
the SBT topology. The measured speed-up is approximately 
log N, as predicted. 


5.2. Single Source Personalized Communication . 

In implementing the SBT routing on the Intel iPSC, the 
root processes the data in descending order starting with the 
relative address N — 1. This order implies that data is trans- 
mitted over ports in an order corresponding to the transition 
sequence in a binary-reflected Gray code{17]. Hence, port 0 1s 
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Figure 5: Broadcasting using SBT. 
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Figure 6: Broadcasting using SBT and MSBT. 
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used every other cycle, port 1 every fourth cycle, etc. Internal 
nodes retransmit a message on a port chosen from among the 
ones that correspond to the leading zeroes in its relative ad- 
dress. The choice is made according to a binary-reflected Gray 
code on the leading zeroes. 

For the implementation of the BST-based algorithm the 
routing order needs to be determined for each subtree of the 
root. Excluding cyclic nodes, the subtrees are Poona The 
root only needs to keep one table of length = oN with each 
entry of size log N bits. The order of the entries corresponds 
to the transmission order for each port. The table entry points 
to the messages transmitted over port 0. The pointers for the 
other ports are simply obtained by (right) cyclic shifts of the 
table entries. The cyclic nodes can be handled by finding the 
period P for each cyclic table entry, and not transmitting the 
message corresponding to this table entry for ports with index 
j2P. 

For each subtree a depth-first or a reversed breadth-first 
order are viable transmission orders. With reversed breadth- 
first order we mean a breadth-first traversal of the subtree 
starting from the last level (log N —1 or log N —2 depending on 
subtree). The source node determines the order, and internal 
nodes can either route according to the destination address if 
it is included, or by the use of tables. If tables are used, then in 
the case of depth-first communication order it suffices for each 
internal node to keep a count for each port. Since the number 
of ports used in each subtree is at most og N and the number 
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Figure 7: Speed up of MSBT vs. SBT. 
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Figure 8: Personalized communication using BST and SBT. 


of nodes in the entire subtree is approximately isen> a bound 


on the table size in each node is log? N bits. A breadth-first 
communication order can be implemented by internal nodes 
keeping a table of how many nodes there are at a given level in 
each of its subtrees. The table has at most log? N entries. An 
upper bound for the number of nodes in a subtree at any level 
is agi? re and the total table size in a node is approximately 
log? N bits. Hence, without a more sophisticated encoding the 
depth-first communication order is more effective with respect 
to table space. The measurements presented in Figure 8 are 
based on an implementation using a depth first order. 

With communication on one port at a time the expected 
time for personalized communication based on the SBT topol- 
ogy or the BST topology is the same. The observed advantage 
of the BST- over the SBT-based communication is due to the 
fact that the BST can take better advantage of the overlap be- 
tween communication on different ports. In the SBT case, the 
node with relative address (00...01) is not yet finished retrans- 
mitting the last packet received when a new packet arrives. 
In the BST a subtree receives a packet once every log N cy- 
cles, and full advantage of the 20% overlap in communications 
actions is taken. 


6. Conclusion 


We have shown that the Boolean n-cube topology allows 
for the embedding of n edge-disjoint binomial trees, and we 
presented routing algorithms for broadcasting that have a com- 
plexity equal to the lower bound both for communication re- 
stricted to one port at a time and for concurrent communica- 
tion on all n ports of a node. We have also defined a balanced 
spanning tree for personalized communication. Each subtree of 
the balanced spanning tree we have defined has approximately 
oon nodes. For communication on one port at a time, per- 
sonalized communication based on the balanced spanning tree 
has the same complexity as personalized communication based 
on a binomial tree for certain maximum packet sizes, and has 
at most a factor of 2 higher complexity in other cases. With 
concurrent communication on all ports, the routing based on 
the balanced spanning tree is superior by a factor of 4 log N 
for a variety of combinations of maximum packet sizes, start-up 
times, transfer rates, and data sizes. 

Experimental results on the Intel 2P.SC/d7 confirm the re- 
sults of the analysis. 
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The Architecture of a Homogeneous Vector Supercomputer 


John L. Gustafson, Stuart Hawkinson, and Ken Scott 
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Beaverton, 

Abstract 
A new homogeneous computer architecture 
developed by FPS combines two fundamental 


techniques for high-speed computing: parallelism 
based on the binary n-cube interconnect, and 
pipelined vector arithmetic. The design makes 
extensive use of VLSI technology, resulting in a 
processing node that can be economically 
replicated. Processor nodes incorporate high- 
speed communications and control, vector-oriented 
floating-point arithmetic, and a novel dual-ported 
memory design. Each node is implemented on a 
single circuit board and can perform 64-bit 
floating-point arithmetic at a peak speed of 16 
MFLOPS. Eight nodes are grouped together with a 
system node and disk support to form modules. 
These modules, housed in cabinet-sized packages, 
are capable of 128 MFLOPS peak performance and 
make up the smallest homogeneous units of larger 
systems. The new FPS system achieves a careful 
balance between high-speed communication and 
floating-point computation. This paper describes 
the new architecture in detail and explores some 
of the issues in developing effective software. 


introduction 


The quest for increased computational power in 
scientific computing and the limits of physical 
electronic devices have lead to the exploration of 
new architectures as alternatives to traditional 
monolithic designs [9,2]. Multiprocessor designs 
hold the promise of tremendous performance in- 
creases, provided the interconnection network can 
support the parallelism inherent in the compu- 
tation. Vector pipelines provide significant 
performance increments, exploiting finer grainec 
parallelism. Further advantage is gained by usinc 
parallel functional units to overlap address 
calculations with memory references, floating- 
point adds, and floating-point multiplies [1]. 


Large scientific applications are sometimes 
easily partitioned among processors using a shared 
memory, yet most are just as amenable to 
distributed memory designs [3,4]. Shared memory 
systems are expensive when scaled to large 
dimensions because of the rapid growth of the 
interconnection network; the distance from memory 
to the processing elements also degrades per- 
formance by increasing latency [8]. Large system 
configurations are most readily realized with 
distributed memory based on a limited form of 
interconnection, such as the pyramid or binary 
n-cube [5]. Memory latency can be greatly reduced 
when each processor has its own high-speed store. 
Moreover, the cost of switching and the time to 
route messages is much smaller on such statically 
configured systems. With this view, much current 
computer architecture research has focused on the 
use of ensembles of identical processors in homo- 
geneous configurations that employ message passing 
over limited forms of static interconnects [7,8]. 
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Floating Point Systems (FPS) has developed a 
homogeneous computer, the FPS T Series, based on 
the binary n-cube interconnection scheme. The in- 
dividual nodes are 64-bit floating-point computers 
that combine vector arithmetic, dual-port memory, 
and fast communications links between nodes. The 
peak performance of these nodes is 16 MFLOPS. The 
FPS T Series is built from modules containing 
eight of these nodes connected to each other and 
to a system support ring. These modules, with an 
aggregate performance of 128 MFLOPS, may be 
combined to form even larger systems that promise. 
orders of magnitude increases in computing speed 


per dollar over today's supercomputers. 


Processor Node Architecture 


An individual processor element is called a 
node. It contains a control processor, floating- 
point arithmetic, dual-port memory, and communi- 
cation links to other nodes. The FPS T Series 
design provides all of these functions on a single 
printed-circuit board. Each of the major elements 
of the node has been implemented with advanced, 


cost-effective VLSI technology, in contrast with 


more traditional bit-slice designs. 


Control 


The ability to interpret and execute programs 
resides in the central control unit. The T Series 
control unit is a 32-bit CMOS microprocessor with 
the following functional features: 


¢ 7.5 MIPS instruction rate 
¢ Byte addressability (4 GByte address space) 


¢ 2048 bytes of on-chip RAM (single processor 
cycle) 


* 3-cycle minimum access time for off-chip 
memory 


* Four bidirectional serial communications 
links 


¢ Stack-oriented instruction set with variable 
operand sizes 


* Two-level process priority and interrupt 
services 


The control processor executes system and user 
applications code and it also serves to arrange 
vector operands to be sent to the vector arith- 
metic hardware. The control processor can execute 
integer arithmetic and gather/scatter operations 
in parallel with the vector unit, and it provides 
inter-node communications via the serial links. 


RAM Bank 
64K Words 
ee 


vector Rea |S [vector feat 


Arithmetic Controller 


The FPS T Series processor. 


Figure 1. 


All features of the microprocessor are 
directly accessed through a high-level language 
called Occam. Occam differs from languages like 
Pascal or C in that it directly provides for the 
execution of parallel, communicating processes, 
Channel commands can make direct data transfers 
between concurrent processes. A single process 
can be constructed from a collection by specifying 
sequential, alternative, or parallel execution of 
the constituent processes. This combination of 
program structure and integrated communication 
allows Occam to describe the control and data flow 
for virtually any scientific computing algorithm, 
and to control the high-level operation of the 
vector arithmetic unit (see below). 


Memory. 


An essential feature of a computer's archi- 
tecture is its central memory, which supplies both 
instructions and operands to the processing units. 
The main memory of each FPS T Series node consists 
of 1 MByte of dual-ported dynamic RAM. The con- 
trol processor and communications links read and 
write 32-bit words through a conventional random- 
access port, while the vector arithmetic unit 
makes use of a collection of vector registers 
closely coupled with main memory. A _ vector 
register can be loaded with an entire 1024-byte 
row of memory, in parallel (see Figure 1), in the 
same time that it would have taken to read or 
write a single 32-bit word. There is one parity 
bit for each byte in memory. 


The control processor views the memory as a 
single bank of 256K words (32-bit). The vector 
arithmetic unit views memory as two banks of 
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vectors, with 256 vectors in one bank and 768 vec- 
tors in the other, 
aries. Thus, for 32-bit operations, 
are 256 elements long, while for 64-bit opera- 
tions, the vectors are 128 elements in length. 
The division of memory into two banks permits two 
inputs in parallel to the arithmetic unit on each 
cycle (125 ns). The output of the arithmetic unit 
shifts results into either or both banks. Hence, 
operations such as SAXPY, Vector Add, 
Multiply proceed at the full speed of the 
arithmetic components, without being limited by 
available memory bandwidth. This dual-bank memory 
organization allows the node to function without 
the need for auxiliary data registers or cache. 


The control processor can access a 4-byte word 
in 400 ns. Its effective bandwidth is therefore 


(4 bytes) / (0.4 wts) = 10 MB/s 


A primary use for the control processor is to 
gather operands into a contiguous vector, and 
scatter results back to random locations in 
memory. To move a 64-bit operand from one memory 
location to another requires two 32-bit reads and 
two 32-bit writes, which take a total of 1.6 wus. 
This is the gather-scatter time within a node. 
For 32-bit operands, it is 0.8 [Ws per element. 


An entire row of data can be moved to or from 
a vector register in only 400 ns; this means that 
the effective bandwidth between memory and a 
vector register is 


(1024 bytes) / (0.4 ts) = 2560 MB/s 


An application might make use of this 
extraordinary speed by moving data physically, 
rather than keeping linked lists of pointers to 
vectors, as for example, in pivoting rows of a 
matrix or sorting records. 
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The vector registers each supply data to the 
arithmetic unit at a maximum rate of one 32-bit 


aligned on 1024-byte bound- - 
the vectors 


and Vector 


word every 62.5 ns, or one 64-bit word every 
125 ns. The vector register bandwidth supports 
two vector inputs and one vector output every 
125 ns in 64-bit mode. Thus, its bandwidth is 


(3 words)x(8 bytes/word) / (0.125 pus) 192 MB/s. 


Arithmetj 


The ability to perform high-speed arithmetic 
is essential in scientific computing. The arith- 
metic hardware in the FPS T Series consists of a 
floating-point adder, floating-point multiplier, 
interconnection hardware, and some sequencing 
hardware. The adder and multiplier each can pro- 
duce a 32- or 64-bit result every 125 ns, yielding 
peak performance of 16 MFLOPS per node. 
Floating-point operations are performed using the 
proposed IEEE floating-point standard format; 
however, gradual underflow is not supported. In 
64-bit mode, the mantissa has approximately 15 
decimal digits of precision and a dynamic range of 
roughly 107398 to 101398, 


The arithmetic units operate in pipelined 
mode. The adder has a six~stage pipeline. It can 
perform floating-point addition and subtraction in 
32- and 64-bit modes, comparisons and data con- 
versions. The multiplier is five-stage in 32-bit 
mode and seven-stage in 64-bit mode. These pipe- 
line lengths are appropriate for the vector access 
flescribed above. Scalar operations can be effi- 
tiently performed by grouping like operations for 
level-order evaluation. 


The arithmetic parts are supervised by a pre- 
programmed micro-sequencer, that implements a 
collection of vector arithmetic operations re- 
ferred to as vector forms. The programmer only 
needs to describe the input and output vectors and 
the vector form desired. This frees the control 
processor for other tasks while vector operations 
are being executed. Scalars can be held in the 
input registers on each floating-point part, and 


outputs from the parts can be fed directly back as . 


inputs to perform operations such as dot products 
and sums. This provides a wide range of useful 
vector forms without memory reference limitations. 
The complete arithmetic unit operates in parallel 
with the node control processor. The arithmetic 
unit only interrupts the controller when a vector 
operation has completed, or an error has occured. 


communications 


In a distributed computer system, communi- 
cations channels are required for passing data 
between processors participating in a common 
computational process. The control processor of 
the FPS T Series contains drivers for four serial, 
bidirectional communications links each with a 
nomi- nal rate of 0.9375 Mbyte/s. Every 8-bit 
byte is sent with two synchronization bits and one 
stop bit, and requires two acknowledge bits from 
the receiver. This results in a maximum effective 
unidirectional bandwidth of over 0.5 MB/s per 
link. The total bandwidth of the four links is 
thus over 4 MB/s. With.all links operating, the 
control processor performance is degraded only 
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slightly. The links operate via DMA transfers 
with a startup time of about 5 us. 


Each link is multiplexed four ways to provide 
a total of 16 bidirectional sublinks per node. 
With software support, these sublinks divide the 
available bandwidth. Two sublinks are used for 
system communication, and two will often be 
utilized for mass storage and/or external I/O. 
This will typically leave 12 sublinks available 
for connection to other compute nodes. 


A convenient way to interpret the relative 
bandwidths is with respect to the arithmetic 
processing time for 64-bit operations: 


(ArithmeticTime) : (GatherTime) : (LinkTransferTime) 
.125 ws 1.6 ps 16 ps 

that are in the approximate ratios 1 : 13 : 130. 
ae a vector should enter into about 13 
perations while gathering the next vector into an 
aligned, contiguous order. With this provision, 
the control processor can completely overlap the 
yather time with vector arithmetic, and the node 
san approach peak speed. Of course, if vectors 
are always aligned and elements contiguous, no 
such restriction applies. Similarly, roughly 100 
operations should result from every 64-bit word 
shat must be moved between nodes over a link. 


Susten D iti 


The T Series consists of a number of node pro- 
cessors connected as a binary n-cube. There are 
2" processors, with n connections per node. Num- 
bering the processors from 0 to 21, each pro- 
cessor is directly connected to all others whose 
numbers differ in only one binary digit. The 
binary n-cube can be mapped onto many important 
applications topologies, including meshes (up to 
dimension n), rings, cylinders, toroids, and even 
FFT butterfly connections of radix 2 [5,6]. Since 
the maximum number of connections between any two 
processors is n,* long-range communication costs 
grow only as O(log>n). 


Meshes 
Figure 3. Binary n-Cube Mappings. 
Modules and System Ring 


A processor node is constructed on a single 
etched circuit board. Eight nodes are combined 


with disk storage and a system board to form a 
module. Such a module has 128 MFLOPS peak 
floating-point performance, and 8 MB of user RAM. 
The local inter-node communications bandwidth is 
over 12 MB/s, while the system board can support 
0.5 MB/s to an external connection. 


The system board provides input/output and 
management functions. It is connected to the 
nodes by a thread of communications links that 


traverses the eight processor nodes. The svstem 
boards are directly connected by communications 


Jinks to form a system ring that is independent. 


of the binary n-cube network (connecting the pro- 
cessor nodes). The primary function of the system 
g@disk is to record memory “snapshots” which 
checkpoint computations for error recovery, and to 
backup snapshots from other modules. The user is 
able to specify the interval between snapshots. 
About 10 minutes provides a good compromise 
between time spent to record memory and interval 
between restart points. It takes about 15 seconds 
to take a snapshot, regardless of configuration. 


The module requires three links for intra- 
module hypercube network communications, while the 
system board connections require two links from 
each processor node. This reduces remaining links 
to 11 for hypercube and external communications. 


Two modules (16 nodes) form a cabinet, or 4- 
cube (a tesseract). A cabinet is modular and 
self-contained in a standard 19-inch rack-mounted 
assembly. With power supplies, system disks, and 
air cooling fans, it does not require any special 
“computer room” facilities. Larger systems are 
simply assembled from these units by 
interconnecting cables. Connections up to 40 feet 
can be made without special consideration. 


I : C f a t e 


A four-cabinet (64-node) system has an aggre- 
gate peak speed of 1 GFLOP and total user memory 
of 64 MBytes. Eight system-disk units provide 
backup and restart capability. This configuration 
of the FPS T Series can be located in many labor- 
atory and computing facilities. The air-cooled 
unit requires no special facilities beyond normal 
air conditioning, and the power requirements are 
supplied by typical 220 VAC services. 


There are enough links per node to permit a 
14-cube to be constructed as the largest T Series 
configuration. Using two links per node for exter- 
nal I/O and mass storage systems, a maximum-sized 
12-cube consists of 4096 nodes arranged as 256 
cabinets (4-cubes). Such a system has over 
65 GFLOPS peak processing performance and 4 GBytes 
of primary RAM storage. Special facilities would 
be required to house the largest configurations. 


Because the system is homogeneous, i.e., each 
module is identical and contains identical con- 
nections to other modules, programming is greatly 
simplified. This homogeneity also insures that 
the balance between computing speed, main storage, 
mass storage, and external I/O can be preserved as 
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configurations become large. The specifications 
of any sized FPS T Series can be derived from the 
properties of the individual modules. 


Conclusion 


The incorporation of high-speed vector pro- 
sessing into a homogeneous parallel architecture 
aas resulted in a scientific computer with per- 
Formance scalable over three orders of magnitude. 
The use of a dual-ported dynamic RAM achieves a 
aew level of processor integration that eliminates 
the need for separate data registers or cache. 
Parallel floating-point adders and multipliers, 
accessed by standard vector operations, provide a 
close match to scientific computing algorithms. 


With nodes organized into eight-processor 
modules, each with system disk and I/O services, 
the new architecture can be viewed as a truly 
homogeneous system. The FPS T Series, incor- 
porating VLSI components to make the system both 
cost-effective and compact, provides a careful 
balance between processor speed, memory access, 
and interprocessor communications bandwidth. 
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Abstract: The implementation of massively parallel com- 
puters based on hypercube architectures is discussed in this 
paper. It is argued that such machines offer an alternative to 
traditional supercomputers at far lower cost. The rationale 
for using hypercube machines for supercomputing applica- 
tions is examined, including cost, node performance, com- 
munication speed, packaging, reliability, and programming 
requirements. These issues are illustrated by a recently 
introduced commercial hypercube supercomputer, the 
NCUBE/ten. The major design decisions underlying the 
NCUBE/ten’s implementation technology, system architec- 
ture and operating system are described. 


1. Introduction 


A hypercube or (binary) n-cube computer is a multipro- 
cessor characterized by the presence of N = 2” processors 
interconnected as an n-dimensional binary cube. Each pro- 
cessor P; forms a node (vertex) of the cube and is a self- 
contained computer with its own CPU and local main 
memory. FP, has direct communication paths to n other 
processors (its neighbors), which correspond to the edges of 
the cube that are connected directly to P;. 2” distinct n-bit 
binary addresses or labels may be assigned to the processors 
so that each processor’s address differs from that of each of 
its n neighbors in exactly one bit position. Figure 1 illus- 
trates the hypercube topology for n < 4; note that a zero- 
dimensional hypercube is a conventional SISD computer. An 
n-dimensional hypercube Q, for n > 2 can be defined 
recursively in terms of the graph product operation X as 
follows [4], where K .>=Q, is the complete 2-node graph: 


Qn =K2X Qn 
As illustrated by Fig. 1, Q, is composed of two copies of 
Q,-1- Every node Po, in one copy of Q, _, is the neighbor 
of a node P,, in the other copy. 


For some time, it has been known that the hypercube 
structure has a number of features which make it a useful 
architecture for parallel computation. For example, meshes 
of all dimensions and trees can be embedded into a hyper- 
cube so that neighboring nodes are mapped to neighbors in 
the hypercube. The communication structures used in the 
Fast Fourier Transform and Bitonic Sort algorithm can simi- 
larly be embedded into the hypercube. Since a great many 
scientific applications use mesh, tree, FFT, or sorting inter- 
connection structures, the hypercube is a good candidate for 
a general-purpose parallel architecture. Even for problems 
with less regular communication patterns, the fact that the 
hypercube has a maximal internode distance (the graph 


diameter) of n=log, N means any two nodes can communi- 


cate fairly rapidly. This diameter is larger than the unit 
diameter of a complete graph Ky, but is achieved with 
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nodes having only degree or fanout of log.N, as opposed to 
the N- 1 degree of nodes in Ay. Other standard architec- 
tures with small degree, such as meshes, trees, or bus sys- 
tems, either have a large diameter (N'/? for a 2-dimensional 
mesh) or a resource which becomes a bottleneck in many 
applications because too much communication must pass 
through it (as occurs at the apex of a tree, or at a large 
shared bus). Thus, from general topological arguments it 
can be concluded that hypercube architectures offer a good 
balance between node connectivity, communication diame 
ter, algorithm embeddability, and programming ease. This 
balance makes them suitable for an unusually wide class of 
com putational problems. 


Based on various considerations of the foregoing kind, 
proposals to build large hypercube computers have been 
made for more than twenty years. In 1962, Squire and Palais 
at the University of Michigan, motivated by the hypercube's 
rich interconnection geometry and programming ease, carried 
out a detailed paper design of a hypercube computer [13,14]. 
They estimated that a 4096-node (12-dimensional) version of 
their machine would require about 20 times as many com- 
ponents as the IBM Stretch, one of the largest and most 
complex contemporary computers. Around 1975 IMS Associ- 
ates, an early manufacturer of personal computers, 
announced a 256-node commercial hypercube based on the 
Intel 8080 microprocessor, but its design details were not 
published and the machine was never produced. In 1977, 
Sullivan et.al. presented a thorough analysis of hypercube 
architectures, and a proposal to build a large hypercube 


n=2 


Fig. 1. n-dimensional hypercube for n=0,1,2,3. 


calicd. CHOPP (Columbia Homogeneous Parallel Processor) 
containing up to a million processors [15,16]. In the same 
year, Pease published a study of the ‘‘indirect” binary n- 
cube architecture, in which a multistage interconnection net- 
work of the omega type is suggested for implementing the 
hypercube topology [8]. A number of other interesting archi- 
tectures closely related to the hypercube have also been pro- 
posed, for example, the cube-connected-cycles structure [11]. 


It is clear that the early hypercube designs were 
impractical because of the the large number of components 
(logic and memory elements) they required using the then 
available circuit technologies. The situation began to change 
rapidly in the early 1980’s as advances in VLSI technology 
allowed powerful 16/32-bit microprocessors to be imple- 
mented on a single IC chip, and RAM densities moved into 
the 10°-10° bits/chip range. A working hypercube com- 
puter was not demonstrated until the completion in 1983 of 
the first 64-node Cosmic Cube at Caltech [12]. For the 
hypercube node processor, it uses a single-board microcom- 
puter containing the Intel 8086 16-bit microprocessor and 
the 8087 floating-point co-processor. Since then, Caltech 
researchers have built several similar hypercubes, and suc- 
cessfully applied them to numerous scientific applications, 
often obtaining impressive performance improvements over 
SISD machines of comparable cost [3]. 


Influenced primarily by the Caltech work, a number of 
commercial hypercubes have been developed since 1983. In 
July 1985, Intel delivered the first production hypercube, the 
128-node iPSC (Intel Personal Supercomputer) which has a 
16-bit 80286/287 CPU as its node processor. Assuming a 
peak performance of 0.07 MFLOPS per node, the 128-node 
iPSC has a potential throughput of about 8 MFLOPS, far 
below that of a traditional vector supercomputer such as the 
Cray-1 (160 MFLOPS). Other commercial hypercubes were 
also introduced in 1985 by Ametek Inc. and NCUBE Corp. 
The Ametek System/14 hypercube can have up to 256 
nodes, which employ an 80286/287-based CPU similar to 
that of the iPSC, with the addition of an 80186 processor for 
communication management. The NCUBE/ten can accom- 
modate up to 1024 nodes, each based on a VAX-like 32-bit 
custom processor with a peak performance of 0.6 MFLOPS. 
Thus a fully configured NCUBE system has a throughput 
potential of around 500 MFLOPS. This high performance 
level is supported by extremely fast communication rates 
(both input/output and  node-to-node) making the 
NCUBE/ten a true supercomputer. NCUBE machines have 
been installed at several beta test sites, including the Univer- 
sity of Michigan, since early 1985, and have been in general 
production since December 1985. Several other hypercube- 
style machines with supercomputing potential are presently 
under development, including the Caltech/JPL Mark III [9] 
and the Connection Machine [5]. Much faster successors to 
the current commercial hypercubes can also be expected to 
appear over the next few years. Because of the effort being 
devoted to the development of hardware and software for 
these machines, and their relatively low cost, hypercube 
supercomputers seem likely to provide an increasingly attrac- 
tive alternative to conventional pipelined supercomputers for 
many applications. 


This paper explores the architectural and technological 
issues influencing the design of supercomputing hyper- 
cubes, with the NCUBE/ten serving as an example. Particu- 
lar attention is devoted to the influence of component pack- 
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aging, reliability, communication speed, and the operating 
system environment on the system implementation. Section 
2 discusses the general design requirements of hypercube 
supercomputers, while the specific design decisions made for 
the NCUBE/ten are covered in Sec. 3. Software issues are 
discussed in Sec. 4. 


2. General Design Issues 


Supercomputing performance requires extremely high 
integer and floating-point execution rates, as well as 
extremely high I/O throughput. Very large primary (RAM) 
and secondary (disk) memory spaces are also usually 
required. For the principal user base of = scientific 
programmers, the programming environment needs to pro- 
vide FORTRAN, and a powerful operating system such as 
UNIX. Low cost and high reliability imply minimizing the 
component count at all levels, particularly the numbers of 
chips and boards used. Some degree of fault tolerance is also 
very desirable. Since a very large amount of RAM storage is — 
needed, memory fault detection and correction via an error- 
correcting code (ECC) is an important consideration, despite 
the fact that it increases the chip count. Reliability is 
increased, and operating cost decreased, by employing an 
air-cooled configuration suitable for a standard office 
environment. Based on an examination of various existing 
computer systems, it can be concluded that the air cooling 
limits the machine complexity to under 50,000 chips. Off- 
the-shelf parts decrease costs and usually increase reliability. 
If custom chips are needed, then computer manufacturers 
who rely on outside suppliers should use conservative design 
rules which will be accepted by multiple silicon foundries. In 
large-scale numerical calculations, the possibility of large 
error accumulation forces the individual calculations to be as 
accurate as possible. Numerical accuracy can be increased 
by adhering to the IEEE 754 floating-point standard and 
providing double-precision floating-point operations. 


A key decision in the design of a parallel computer is 
the choice of the interconnection network to be used. Mul- 
tistage interconnection networks have been advocated as 
simplifying the programming process by providing a global 
shared memory, but it did not seem possible to build a large 
multistage network using the technology available in 1983 
without suffering significant delay in passing information 
through the network. Since this strongly affects perfor- 
mance, it was felt that to achieve supercomputer perfor- 
mance it would be necessary to use a direct connection net- 
work with local memory at every node. Many direct inter- 
connection schemes have been analyzed and implemented 
but, as discussed in Sec. 1, the hypercube structure has a 
number of inherent advantages. The ease with which effi- 
cient application programs were developed for the hyper- 
cubes at Caltech has also shown the hypercube to be supe- 
rior to alternative architectures such as meshes or trees. The 
neighbor-to-neighbor links of the hypercube provide almost 
the same communication capabilities as a complete graph, 
while using nodes with only a logarithmic degree. The 
achievable degree is constrained by a variety of packaging 
considerations, but with current technology it is possible to 
build hypercubes with thousands of nodes. In contrast, a 
complete graph connection of a few tens of nodes may not be 
possible. 

There are additional features of the hypercube that are 
particularly useful in designing a supercomputer, but have 
not been previously exploited. For example, the hypercube 


is homogeneous in that all nodes look the same, hence it is 
natural to attach an I/O channel to each node. This pro- 
vides the potential of extremely high system I/O rates. Also, 
since there are numerous ways to divide a hypercube into 
subcubes, it is easy to support multiprocessing where each 
user has a dedicated subcube. These subcubes can be allo- 
cated so that all processor-to-processor and I/O communica- 
tions occur without using processors or communications lines 
in other subcubes. Further, by writing programs in which 
the size of the subcube is a user-defined parameter, it is pos- 
sible to develop programs in small subcubes and then do 
production runs in larger subcubes. This partitionability 
also makes it easier to tolerate faults, since the operating 
system can allocate subcubes which avoid faulty processors 
or faulty communication lines. 


As discussed in Sec. 1, technology developments have 
enabled a hypercube computer to be built reliably with a 
large number of processors. A fine-grained supercomputer 
architecture, i.e., one with a large number (over 1000) of 
very simple processors, has a high ratio of communication to 
computation. The Connection Machine is an example of a 
fine-grained hypercube-class computer, but its suitability for 
scientific computations is unclear. A very coarse-grained 
architecture with, say, 10 to 100 large and fast processors, 
requires that the nodes achieve extremely high performance. 
For example, to achieve 10° instructions/sec with 10 proces- 
sors requires processors capable of 10° instructions/sec. The 
Caltech/JPL Mark III will be an example of a coarse-grained 
_ hypercube. It was felt by the designers of this machine that 
achieving 10° instructions/sec is best done with 1000 proces- 
sors running at 10° instructions/sec. 


Experience with the Caltech machines has demon- 
strated that a medium-grain MIMD hypercube architecture 
can attain high efficiency on a variety of scientific problems, 
with a tolerable amount of revision of serial code and algo- 
rithms [3]. This can be contrasted with the much greater 
amount of program and algorithm redesign required of users 
of fine-grained SIMD machines such as the MPP [10]. 
MIMD machines require each node processor to perform 
instruction fetching, decoding, and other functions that are 
snot performed by SIMD nodes. Distributed-memory MIMD 
machines must supply a program to each node, so that 
MIMD machines may pay a large penalty in chip area and 
chip count. In general, for the same chip area and number of 


chips one can build more SIMD processors and have a 
greater potential system throughput; however, the gain in 


programming simplicity obtained by using an MIMD 
machine more than compensates for this, except for a narrow 
range of applications in which almost any penalty can be 
tolerated if it yields the required speed. Furthermore, MIMD 
machines can accommodate multiple independent users, 
while SIMD machines cannot. 


Since there may be hundreds or thousands of nodes in a 
hypercube supercomputer, their chip count is the most signi- 
ficant component of the total system chip count. Using the 
densest possible memory chips available is the key factor in 
decreasing the number of chips. The NCUBE/ten, for exam- 
ple, uses 256K DRAM chips to implement the local memories 
of the hypercube nodes. The next most significant reduction 
in chip count can be achieved by putting all node functions 
onto a single chip. This implies that the processor chip must 
perform all communication, memory management, floating- 
point operations, and other data-processing functions. Unlike 
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RAMs, there is not yet widespread market pressure to pro- 
duce standard processor chips of this type, consequently, 
they are not available off the shelf. In 1983, when design of 
the NCUBE/ten started, the only way to achieve supercom- 
puter performance with a one-chip node processor was to 
undertake the risky step of custom-designing such a complex 
chip. INMOS made a similar decision with the Transputer 
processor chip, with the important difference that the initial 
version of the Transputer does not provide floating-point 
operations, and has four rather than eleven I/O channels [6]. 
The performance and functionality demands on the NCUBE 
processor chip are quite severe, and numerous tradeoffs were 
needed to enable it to be built with current technology. 


3. NCUBE/ten Architecture 


The overall goal of the NCUBE designers was to use 
massive parallelism to build an inexpensive and reliable 
range of software-compatible machines achieving supercom- 
puter performance at the high end. The largest model in the 
series, the NCUBE/ten, is a 10-dimensional hypercube con- 
taining 1024 powerful 32-bit processors of custom design, 
each with a 128-Kbyte local memory. Up to eight front-end 
host processors are used to manage I/O operations under 
control of a multiuser UNIX-based operating system. An 
unusually high level of system integration is employed that 
allows 64 processors with their memories and interconnec- 
tions to be placed on a single printed-circuit board. A 
maximum-sized NCUBE/ten system is composed of 16 pro- 


cessor and 8 I/O boards (including host processors) and i< 
housed in a small air-cooled enclosure. 


The NCUBE node processor provides the functions of a 
32-bit supermini-class CPU, including a full floating-point 
instruction set, and all the logic needed for memory manage- 
ment and interprocessor communication on a single VLSI 
chip; see Fig. 2. The design of the processor was started by 
NCUBE in 1983, constrained by the desire to use conserva- 
tive design rules acceptable to several silicon foundries. The 
chip was designed using 2 xm (approx.) nMOS design rules, 
and contains about 160,000 transistors. It is housed in a 
pin-grid-array package with 68 pins. Combined with six 
256K-bit DRAM chips (each of which is organized as 64K x 
4 bits), an entire NCUBE/ten node requires only seven chips. 
Because of this, a 6-dimensional hypercube with 64 nodes 
and 8 Mbytes of memory can be packed into a single 
16" x 22" board, a photograph of which appears in Fig. 
3. The backplane connections are rather formidable since 
each node has off-board bidirectional channels to four more 
processors of the hypercube, plus one bidirectional channel 
to an I/O board, resulting in 640 backplane connections just 
for communication channels. 


The instruction set of the NCUBE/ten is conventional 
and quite orthogonal, being similar to the VAX instruction 
set without the latter’s 3-address addressing modes [7]. 
There are three main classes of information: addresses 
(unsigned integers), integers and floating-point numbers 
(reals). Addresses are 32 bits long, but the current node 
implementation only supports a 17-bit physical address 
space. Integers can be 8, 16 or 32 bits long. Floating-point 
numbers can contain either 32 or 64 bits, and conform to the 
IEEE 754 floating-point standard. There are 16 general- 
purpose registers of 32 bits each. A variety of addressing 
modes are available, including literal (immediate), register 


11 serial 
{/0 channels 


execution 
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Clock Reset Error Memory Address Data and 
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Fig. 2. Organization of the NCUBE processor chip. 


Fig. 3. The 64-node NCUBE processor board. 
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direct, autodecrement/increment, autostride, offset, direct, 
indirect, and push/pop. The instruction set contains a full 
complement of logical, shift, jump, and arithmetic operations 
(including square root). One instruction of particular use in 
hypercube routing is Find First One, which finds the bit 
position of the first 1 in a word, via a right-to-left scan. 
Using a 10-MHz clock, nonarithmetic instructions can be 
executed at about 2 MIPS, with single-precision floating- 
point operating at 0.5 MFLOPS, and double-precision at 0.3 
MFLOPS. (These performance figures assume that register- 
to-register operations predominate.) A 32-byte instruction 
cache allows loops of up to 16 bytes to be executed directly 
from the cache. The node processor has a vectored interrupt 
facility, and generates different interrupts to indicate pro- 
gram exceptions such as numerical overflow or address 
faults, software debugging commands such as breakpoint 
and trace, I/O signals such as input ready, and hardware 
errors such as correctable or uncorrectable memory errors. 


Pin and silicon space limitations forced a number of 
design compromises in the selection of the width of various 
system data paths. The node memory supplies data in 16-bit 
halfwords, plus an extra byte containing ECC check bits. 
The processor performs single-error correction and double- 
error detection (SECDED) on all memory words, generating 
an interrupt in case of an error. This is an example of a 
situation where the pin limitations affect performance, for it 
requires two memory fetches to obtain a full 32-bit word. It 
also increases the number of memory chips required, since 
the SECDED code used for 32-bit data could be supplied by 
five RAM chips organized as 32K X 8 bits, if such chips 
were available. 


Communication with other nodes is performed via 
asynchronous DMA operations over 22 bit-serial I/O lines. 
The I/O lines are paired into 11 bidirectional channels, 
which permit formation of a 10-dimensional hypercube, and 
also allow one connection to an I/O board. Each node-to- 
node channel operates at 10 MHz with parity check, yielding 
a data transfer rate of about 1 Mbyte/sec per channel in 
each direction. A channel has two 32-bit write-only registers 
associated with it: an address register for the message buffer 
location in the node RAM, and a count register indicating 
the number of bytes left to send or receive. There is also a 
ready flag and an interrupt enable flag for each channel. 
Once a send or receive operation has been initiated by a pro- 
cessor checking its flags and setting the appropriate regis- 
ters, the processor can continue with other operations while 
the DMA channel completes the internode communication 
operation. Interrupts can be used to signal when a channel is 
ready for a new operation. For general applications, this 
requires less processor overhead than would occur in a pol- 
ling communication protocol. An interrupt is also generated 
if there is a channel overrun, which can occur only on an 
input operation if more than 9 channels are transmitting 
data into the node. To reduce DMA activity, a broadcasting 
feature is supported which transmits the same data word 
along an arbitrary set of output channels in a single DMA 
operation. 


The NCUBE/ten’s I/O boards provide the connections 
between the hypercube and the external world. Each system 
must have at least one host board, and may have as many 
as eight. The host board uses an Intel 80286 to run the 
operating system, and has 4 Mbytes of RAM used as a 
shared memory by the various processors on the host board. 
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It has support for a variety of different peripherals, including 
eight ASCII-standard terminals, four SMD disks (which can 
currently be as large as 500 Mbytes), and three Intel iSBX 
connectors that can accept daughter boards for functions 
such as graphics control or networking. Miscellaneous func- 
tions found on the host board include a real-time clock, and 
temperature sensors for automatic shutdown on overheating. 
Besides the host board, other I/O boards currently available 
are a graphics board with a 2K X 1K X 8-bit frame buffer, 
an intersystem board to connect two NCUBE systems, and 
an open system board with about 75% of the board left for 
custom design. 


A distinguishing feature of the I/O boards in the 
NCUBE/ten is the fact that each has 128 bidirectional chan- 
nels directly connected to a subcube of the hypercube; see 
Fig. 4. This permits extremely high I/O data-transfer rates 
into the hypercube enabling, for example, a single I/O board 
to transfer 1024 X 1024 X 8-bit images at video rates (30 
frames per second). To accomplish this, each I/O board con- 
tains 16 NCUBE processors chips, each of which serves as an 
I/O processor and is connected to eight nodes in the main 
hypercube. Like the hypercube node processors, an I/O pro- 
cessor has a 128-Kbyte RAM which occupies a fixed slot in 
the 80286 host’s 4-Mbyte memory space. An input operation 
from the outside world, e.g., a disk read, is performed by 
first transferring the input data to the host’s 4-Mbyte 
memory. Then the data is transferred through the DMA 
channels of the I/O processors directly to the target hyper- 
cube nodes. Output operations are handled in a similar 
fashion. In addition to sharing access to the host’s memory, 
the 16 I/O processors on each I/O board are interconnected 
as two disjoint 3-dimensional cubes, (this disjointness occurs 
because each node has only 11 bidirectional channels.) Note 
that in a maximum-configuration NCUBE/ten system, the 
hypercube nodes do not have to redistribute I/O data to 
other nodes. This is not the case in smaller NCUBE systems 
where fewer channels are available for external I/O opera- 
tions. 


An NCUBE system has from one to eight I/O boards 
(at least one of which must be a host board), from one to 16 
processor boards, and attached peripheral devices. All the 
I/O and processor boards of a fully configured system, along 
with their fans and power supplies, fit into a single enclosure 
that is less than 3’ on each side. A full-sized system dissi- 
pates about 8 kW, and can be housed in a standard air- 
conditioned environment. A peripheral enclosure is about 
3' & 2! x 3! and contains a 65-Mbyte cartridge tape drive 


and up to four disk drives. A minimal standalone NCURF 
system consists of one host board and one processor board 


containing a 6-dimensional hypercube, and can handle up to 
8 user terminals. By adding a second processor board, one 
obtains a 7-dimensional hypercube. Since the operating sys- 
tem can allocate subcubes of arbitrary size, it is possible to 
have a number of processor boards which do not form a 
complete hypercube. For example, three boards provide a 
7-dimensional and a 6-dimensional hypercube, which could 
also be allocated as three 6-dimensional hypercubes or as 
numerous smaller hypercubes. A full-sized system (Fig. 4) 
contains a 10-dimensional hypercube (which explains the 
“ten” in NCUBE/ten). The 1024 processors of such a sys- 
tem together have a potential instruction execution rate of 
about 2 billion instructions/second, or about 500 MFLOPS. 
The total amount of memory in the nodes is 128 Mbytes. If 


Sixteen processor boards (1024-node hypercube) 
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Fig. 4. Maximum-configuration NCUBE/ten system. 


all of the I/O boards are host boards, it is possible to sup- 
port 64 terminals, and provide as many as 16 billion bytes of 
storage. A host board can provide input or output at up to 
90 Mbytes/sec, giving a system input or output rate of 
about 720 Mbytes/sec. 


Figure 5 summarizes the results of some performance 
experiments designed by D. Winsor at the University of 
Michigan, which compare the NCUBE node processor to two 
other representative CPU's with floating-point hardware: the 
Intel 80286 (the NCUBE host processor served for this) and 
Digital Equipment Corp.’s VAX-11/780. The measurements 
were made with the NCUBE node and host processors run- 
ning under 8 MHz clocks. Extrapolated figures for the 10 


MHz version of the NCUBE node processor now nearing pro- 


duction are also given, assuming no wait states. Two widely 
used synthetic benchmark programs were employed in this 
study, the Dhrystone and the Whetstone codes [2,18]. The 
Dhrystone benchmark is intended to represent typical sys- 
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NCUBE node processor at 8 MHz 


NCUBE node processor at 10. MHz (est.) 


Intel 80286 (NCUBE host) at.8 MHz with 


80287 floating-point co-processor 


DEC VAX-11/780 with floating-point 
accelerator 


tem programming applications and contains no floating- 
point or vectorizable code. The original Dhrystone Ada code 
[18] was translated into a FORTRAN 77 version with 32-bit 
integer arithmetic that attempts to preserve as much of the 
original program structure as possible. This entailed changes 
such as replacing Ada records by FORTRAN arrays which 
produce a substantial performance degradation compared to 
Dhrystone benchmarks in Ada, Pascal or C. However, any 
such degradation here appears to apply uniformly to all pro- 
cessors considered, since all were given the same FORTRAN 
source code and used very similar FORTRAN compilers. 
The Whetstone benchmark, which aims to represent scien- 
tific programs with many floating-point operations, was used 
in a single-precision FORTRAN 77 version that closely 
resembles the original ALGOL code [2]. The Dhrystone 
results in Fig. 5 are reported in “‘Dhrystones per second,” 
each of which corresponds roughly to one hundred FOR- 
TRAN ‘statements executed per second. The Whetstone fig- 
ures represent the number of hypothetical Whetstone 
instructions executed per second. It can be concluded from 
the data of Fig. 5 that the NCUBE node processor is quite 
fast and fully meets its performance targets cited above. 


4. System Software 


The emergence of several commercial hypercube com- 
puter has demonstrated the feasibility of constructing low- 
cost massively parallel machines. The focus of research can 
now be expected to shift to the issue of how these machines 
can be programmed effectively. Indeed, the recent report on 
the Supercomputing Research Center concludes that the 
absence of appropriate parallel programming languages and 
software tools is the single biggest impediment to the suc-. 
cessful use of parallel machines [1]. The operating system is 
also a major design issue, since memory management and 
interprocessor communication are critical to the functioning 
of the programming languages. Three software issues need to 
be considered. The first is the operating system that is used 
for developing application programs for the hypercube. The 
second is the operating system that provides run-time sup- 
port for application programs running on the hypercube 


nodes. The third is the set of application languages to be 
used. 
FORTRAN FORTRAN 
Dhrystones/sec Whetstones/sec 
999 381,000 
1,249 476,000 
510 ~ 101,000 
741 426,000 


Fig. 5. Summary: of processor benchmark results. 
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An attractive choice for a development operating sys- 
tem that provides the kind of environment associated with a 
‘“‘programmer’s workbench” is UNIX. Unfortunately, there 
are two different versions of UNIX (System V and bsd 4.2), 
and a large number of lesser-known variants. This leaves 
the system designer with something of a dilemma: on the one 
hand UNIX offers a proven development environment that is 
widely known; on the other hand a UNIX standard has yet 
to emerge. The solution chosen by NCUBE was to develop a 
UNIX-like operating system called AXIS [7] that embodies 
the features common to the major UNIX dialects. Subse- 
quent change or additions can be readily made to AXIS 
when a true UNIX standard is agreed upon. There are two 
features of AXIS that we shall elaborate on here because 
they are pertinent to the management of a very large hyper- 
cube. The first is the ability to share files, and the second is 
the way in which the main cube array is managed. 


AXIS runs on the 80286 host processor that acts as the 
CPU for each I/O board. (Recall that up to eight I/O sub- 
systems can be accommodated in a_ 1024-processor 
NCUBE/ten). It provides the large number of utilities for 
editing, debugging and file management that one has come 
to expect in a UNIX-like operating system. Consistent with 
the UNIX philosophy, the file system is the most prominent 
feature of AXIS, and almost all of the system resources are 
treated as files. Massively parallel systems require high I/O 
bandwidth if they are to be useful for applications that are 
not simply computation-intensive. This problem of I/O 
management was not foreseen in the earlier generation of 
massively parallel machines, and has proved to be a great 
limitation [10]. The ability to incorporate up to eight I/O 
subsystems in the NCUBE/ten is intended to avoid this 
problem. However, it introduces the potential for eight 
separate file systems. To avoid this, AXIS provides the 
capability to organize the eight file systems as one distri- 
buted file system; AXIS further allows complete systems to 
be networked through iSBX connections to provide a single 
multiuser file system. The principal mechanism for doing 
this is the device directory pointer (ddir). ddirs are items 
that can be placed in a file directory. Instead of represent- 
ing a file name, they are pointers to disk drives. Each disk 
drive has a unique device identifier, which includes a system 
number, an I/O board number, and a drive number. Within 
a disk drive, the files are organized as a typical UNIX tree. 
ddirs can be placed in the root directory of a disk to point to 
the root of the other drives. This permits file sharing across 
all physically connected NCUBE systems. Figure 6 illus- 
trates how the directory structure for two host boards, each 
having two physical disk drives, might be organized, if logi- 
cal diskO and disk4 are shared systemwide. Typically, the 
device directory of a physical disk0O (called ‘‘//’’) will contain 
the following: 


1. The name of another directory which acts as the real 
root of the file structures on diskO (‘‘cb0” on host 0) 


2. ddir’s for the other drives connected to the same host 


ddir’s of disk directories on any other host board in the 
system 


4. ddir’s for disks on any other physically connected 
NCUBE system. Since system number is one of the 
components of a ddir, it can refer to disks on other 
NCUBE systems. 


AXIS manages a hypercube of node processors as a dev- 
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Host board 0 


All direc- 
tor les on 
disk 1 


Fig. 6. Distributed files on the NCUBE/ten. 


ice, which is simply one type of file. It can be opened, 
closed, written to, and read from as if it were a normal file. 
AXIS permits users to allocate subcubes that have the 
appropriate size for their application. Thus, one or two 
large problems, or several small problems may share the 
hypercube. This flexibility greatly increases the system effi- 
ciency, and gives a hypercube supercomputer a significant 
advantage over conventional supercomputers. Partitioning 
the main hypercube into subcubes is simplified by the fact 
that each subcube is protected from access by any other sub- 
cube. : 


VERTEX, the operating system for the hypercube node 
processors, is a small nucleus (less than 4K bytes) that is 
resident in each of the NCUBE/ten nodes. Its primary func- 
tion is to provide communication between the nodes, in the 
form of send and receive functions that transfer messages 
between any two nodes in the hypercube. The node proces- 
sor has instructions that are used as primitives in the VER- 
TEX communication calls, nwrite and nread, which imple- 
ment the internode send and receive functions, respectively. 
The messages transferred by nwrite and nread are arrays 
of bytes having four associated attributes: source, destina- 
tion, length and type. The first two attributes are numbers 
in the range 0 to 1023, and indicate the logical nodes being 
used for source and destination. The length attribute is the 
number of bytes in the message; messages as long as 64K 
bytes are supported. Finally, the type attribute can be used 
to distinguish messages, and so permit their selective recep- 
tion at a destination node. 


The subroutine nwrite may be represented as 

nwrite (length, messages, dest, type, status, error) 
where length is length of the outgoing message in bytes, mea- 
sage is the name of the buffer from which the message is to 
be taken, dest is the logical number of the node in hypercube 
that is to receive the message, type is the type number of the 
message, status indicates when the message has left the 
buffer, i.e. when the buffer is reusable, and error is an error 
code. Message transmission breaks the message into packets 


of 512 bytes (or some other user-defined size), and sends 
them to the destination node using the following routing 
algorithm. Assume that in an n-dimensional cube, the logi- 
cal number of the source node is 8, 8, _1...898, and that of 
the destination is d, d,_,...d9d,. The bit-wise exclusive-or 
Yq Tp_j-+-Lo%}1, Of the two numbers is formed as follows: 
z,=3,Qd; for # = 1, ... ,n. The values of the z;’s are used 
to control the routing process. Those values of ¢ for which 
z;=1 indicate the dimensions that must be traversed to 
transfer a message from source to destination. The routing 
algorithm was chosen for its simplicity; however, as noted by 
Valiant {17], there is a potential for congestion in some situa- 
tions. He defines an alternative routing algorithm that 
avoids congestion by routing each message to a randomly 
chosen node; from there the message is forwarded to its ori- 
ginally intended destination. The randomization assures 
that message congestion at nodes will be dispersed. Unfor- 
tunately, Valiant’s router does not perform as well as the 
straightforward algorithm in many routine parallel process- 
ing tasks, and its more complex implementation require- 
ments discouraged use of it in the initial NCUBE/ten design. 
Future insights into the behavior of parallel algorithms may 
change this, however. 


In addition to determining the routing path, VERTEX 
must perform the store-and-forward function at each node 
along the path. At the destination node the message is 
placed in a queue that is allocated from a heap of 20K bytes. 
The receive function, which can be represented by 

nread (length, message, source, type, status, error) 
looks for the first message from source of type type in the 
input queue, and copies it to buffer message. Don’t care 
conditions are indicated for source or type by setting these 
parameters to -1. This allows the next message from a par- 
ticular source to be received regardless of type, it allows the 
next message of a particular type to be received from any 
source, and it allows the next message of any type from any 
source to be received. Messages with negative types other 
than -1 are system messages for VERTEX and are used for 
process control at a node, e.g., for node program debugging. 
In summary, the calls nwrite and nread provide a fast 
internode message communication mechanism. The main 
contributers to this speed are the machine instructions pro- 
vided explicitly for internode communication, and the fact 
that messages enter nodes through DMA _ channels. 
Measurement of the internode communication performance 
of the NCUBE system is presently under way. 


The current NCUBE/ten application languages, apart 
from the node and host assembly languages, are FORTRAN 
77 and C. The choice of FORTRAN 77 and C was made 
because the computer is targeted for a user community 
interested primarily in scientific problems; this group has 
traditionally programmed in FORTRAN. Compilers for 
other languages, including Occam, are presently being 
developed. The programming model adopted for the initial 
set of languages (FORTRAN and C) is a simple extension of 
the conventional uniprocessor model. Each node is treated 
as a separate processor. No symbols are shared between 
nodes: the naming scope is contained within a node. Values 
of variables are shared by calls to the VERTEX subroutines 
nwrite and nread. 


5. Conclusion 


Hypercube architectures are well suited to implement- 
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ing massively parallel supercomputers, given the constraints 
imposed by current technology. They offer an unusually 
good combination of high node connectivity, software flexi- 
bility, and system reliability. The NCUBE/ten is an exam- 
ple of a new generation of low-cost and compact hypercube 
machines with the capability of supercomputing perfor- 
mance. Unlike earlier machines, it exploits the inherent 
homogeneity of the hypercube to provide a multiuser UNIX- 
like programming environment, along with support for 
extremely high I/O data-transmission rates. 
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Abstract 


The concept of scalability is important for the next 
generation of super-multiprocessors. Two aspects of 
scalability of an architecture are resource scalability and 
application scalability. Both types of scalabilities are 
qualitative measures of goodness of an inductive architec- 
ture [4]. The application scalability measures the utilization 
of resources and the efficiency of execution of application. 
The resource scalability is a measure of growth rate of ar- 
chitectural properties. 


In this paper, the resource scalability characteristics of 
the hypercube architecture are assessed and the concept of 
application scalability is applied to it in the context of a bi- 
nary tree structured application graph. The criterion used 
is the utilization of processing nodes. Results are also 
derived for a modification of the application graph. A 


logN! time complexity distributed algorithm that can be 
used to set up the modified structure is outlined. This algo- 
rithm is shown to be useful for handling a single node fault 
in the architecture as well. 


1 Introduction 


An important consideration in designing an architec- 
ture for a super-multiprocessor system is its scalabilities 
[4,5]. We recognize two types of scalabilities: resource 
scalability and application scalability. 


By resource scalability we imply the asymptotic 
growth rates of architectural properties and their associated 
costs. The smaller the cost, the better scalable is the ar- 
chitecture. For example, the multistage interconnection net- 
works such as the omega, the banyan and the baseline are 
preferred for multiprocessor systems over the crossbar for 
the reason that these networks scale better than the 
crossbar. The hardware cost increases as NlogN for these 


networks compared to N? for the crossbar, for N intercon- 
nected resources. 


Application scalability is a measure of utilization of 
architectural resources and efficiency of execution for a par- 


1 Throughout this paper, unless otherwise specified, logN is used 
as short for log.N. 


0190-391 8/86/0000/0661 $01.00 © 1986 IEEE 


ticular application under asymptotic growth in their size. In 
other words, it is a measure of optimality of mapping be- 
tween the application and the architecture for all sizes. 
This concept is of relevance for inductive architectures [4], 
which are regular in structure and have the number of 
processing resources as a parameter. Some examples of in- 
ductive architectures are TRAC [13], PASM [14], Hypercube 
[3], CHiP [15] etc.. For these, the mappings can be 
evaluated as a function of size. By evaluating their applica- 
tion scalability, we can predict the asymptotic execution be- 
havior of an application. 


Consider an application mapped on a given architec- 
ture. The mapping results in certain utilization of resources 
and execution efficiency. If, when the architecture and the 
application are equally increased in size, the same utilization 
and execution efficiency is achieved, the architecture is con- 
sidered to be well scalable with respect to that application. 
Of course, in some cases, a drop in execution efficiency can 
be allowed if it is the result of increased hardware delays. 


It is not sufficient for a super-multiprocessor architec- 
ture to have good resource scalability; it must also have 
good application scalability for a number of applications. 
For instance, a bus oriented architecture has excellent 
resource scalability, but its bandwidth is exceeded for even 
modest communication requirements per processor, as the 
number of processors is increased beyond a certain value. 
The evaluation process should therefore account for ar- 
chitectural constructs embedded in software as well as 
hardware. These include, but are not limited to, software 
communication switches, polling routines, priority encoders, 
data translators, etc.. If these constructs are not accounted 
for, they may result in bottlenecks and degraded perfor- 
mance as the application grows in size. 


The evaluation of application scalability of an ar- 
chitecture involves mapping of an application graph on the 
architectural topology. Conventionally, the mapping im- 
plies embedding of an application graph within the resource 
graph of the supporting architecture. The mapping problem 
arises because the two graphs are topologically non- 
isomorphic [6]. The solution results in assignment of one or 


‘more computation nodes of the application graph to a 


processing node in the system such that the processing and 
storage requirements are satisfied. 


We have chosen an application graph which has the 
structure of a balanced binary tree. Binary tree graphs are 
fundamental structures and appear in several computation 
graphs. They naturally result from divide and conquer 
methodology of formulating parallel algorithms for sorting, 
merging and min-max problems. A binary tree based algo- 
rithm is used for the recursive doubling method of comput- 
ing a global histogram {7|. The Dictionary Machine of Atal- 
lah and Kosaraju uses a binary tree structured machine ar- 
chitecture [8]. Binary tree structures result from search 
trees in inference systems. Horowitz and Zorat suggest ap- 


plication of binary tree shaped interconnection networks for 


multiprocessor systems [9]. It would therefore be interesting 
to see how such an important computational topology is 
supported on a popular architecture. 


Recently, substantial attention has been given to the 
hypercube architecture for multiprocessors (1, 2]. Many ap- 
plications have been successfully implemented on it and im- 
pressive speed-ups have been obtained [3]. Most previous 
architectures have failed to furnish respectable speed-ups for 
more than just a few applications. On the other hand, the 
hypercube architecture seems to have lived up to its promise 
of being a good supporting architecture for a wide variety of 
application areas. This effectiveness of the hypercube ar- 
chitecture is seen to derive from the rich internode connec- 
tion topology afforded by the underlying graph. But, before 


hypercube topology can be considered as a defining topology 
for a future generation super-multiprocessor, its resource 
scalability should be evaluated. Also of consequence to its 
applicability for a dedicated high performance computing 
engine is its application scalability for the targeted use. Bi- 
nary tree structured algorithms being so common in the area 
of parallel processing, we focus here on their scalability on 
the hypercube. 


2 Resource Scalability of Hypercube 
Architectures 


A boolean or binary hypercube graph of order n 


(alternately called n-cube) has 2" nodes connected with 
edges as defined in Appendix I. Figure 1 shows an example 
of a hypercube of order 3. In a hypercube based mul- 
tiprocessor, nodes of the graph are occupied by independent 
processing elements. The edges between the nodes represent 
the point-to-point communication links between the proces- 
sors. Each link or edge is dedicated to the corresponding 
node-pair. In some architectures, like the Intel’s iPSC Sys- 
tem, there is a separate processor called the Cube Manager 
to coordinate the parallel execution of a job (10|. The Cube 
Manager is connected to the processors in the cube by a 
broadcast bus for global communication, I/O and control. 
However, the concepts of Cube Manager and the global bus 
are not inherent to the hypercube architecture, and we will 
therefore ignore them. We will use the well-known rectan- 
gular, multistage, single-sided interconnection networks 
(MSINs) such as banyan, omega and baseline as a reference 
when comparing properties. These networks have been 
shown to be topologically isomorphic in [18]. We assume in 
this paper that the MSINs have logN stages of N/2 switches 
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each, and that switches allow "turn-around". Thus a mes- 
sage going from one processor to another travels a certain 
number of levels into the network and then turns around 
and travels to the destination in the reverse direction. 


2.1 Communication Interface Costs 


Each processing element in an n-cube has communica- 
tion links to n other processors. If we focus on the number 


of processors, N = 2”, in the cube, then there are logN 
communication interfaces on each processor. Thus the total 
communication interface cost of the network is NlogN. This 
is the same as the cost of the MSINs, although there is one 
minor difference: when the size of the hypercube is doubled, 
each of the original nodes has to be modified to allow one 
more communication link. The nodes of the original cube 
could be provided with ports for extra links, and when the 
cube is doubled, one of these could be utilized. However, it 
is important to remember that modification is necessary 
throughout the original system in the case of hypercube; 
whereas in the multistage network architectures, the 
modifications are localized to the boundary of the original 
system. 


2.2 Communication Reliability 


The communication reliability of the hypercube 
(n > 3) is better than that of the MSINs. For them, un- 
less extra stages are specifically provided for [16, 17], there 
are a constant number of disjoint paths (= out-degree of a 
node) between two processing nodes. On the other hand, in 
the hypercube architecture, there are (logN) disjoint paths 
available for any given pair of nodes [11]. 


Note that, in the hypercube architecture, the number 
of contingent paths and the reliability of interprocessor com- 
munication increases with logN, while those for the MSINs 
remain constant. Thus for the hypercube architecture, the 
communication reliability also scales better than that for 
the multistage networks. 


2.3 Interprocessor Distance 


The diameter of a network can be defined as that of 
the underlying graph structure in which the processing 
resources and switches are represented as nodes and the 
communication paths are represented as edges. For a graph, 
the diameter is the maximum distance between any pair of 
nodes. For the MSINs, the diameter is equal to twice the 
number of stages in the network; that is 2logN. For a hy- 
percube network, the message passes at most logN links be- 
fore reaching the destination [12], since for any given node, 
there is one node which is at distance logN from it. So for 
the hypercube, the diameter scales as logN. 


It can be shown that the average distance from a given 
node, of all nodes including itself, is 2(logN-1) + 2/N, for 
MSINs. The same is logN/2 for a hypercube. Thus, under 
asymptotic growth, although both scale as logN, the hyper- 


cube has a better scaling coefficient”. 


3 Application Scalability for a Binary Tree 


For the sake of this analysis, we will assume that the 
binary tree application is the sole task being executed in the 
system. Under this condition, we would like to execute the 
largest possible size of the application. Ideally, we would 
like to utilize the hypercube architecture fully, so that when 
the architecture and the application are scaled, no wastage 
results. If that is not possible, a constant resource overhead 
is allowable since the percentage of overhead would diminish 
with equal increase in the application size and architecture. 
The overhead that grows linearly with the number of nodes 
in the hypercube is unacceptable. 


Let us consider mapping a binary tree on an order n 
hypercube. A balanced binary tree with n levels of nodes 
(called in this paper an n level tree), has a total of N-1l 
nodes of which N/2 are at the leaf level. It should therefore 
be possible to map an n level binary tree on an n-cube. This 
would result in all but one of the nodes being utilized and 
would represent constant under-utilization. If such mapping 
is possible for all n, the binary tree application could be con- 
sidered well-scalable on a hypercube architecture. We ask 
precisely this question: Is such mapping possible? 


It is possible to map a two-level binary tree on a 2- 
cube that has a square topology (See Figure 2). For all 
n > 3, it is impossible to map an n level binary tree on a 
hypercube of order n. Appendix I contains the proof for this 
result. For n > 3, the largest sized binary tree that can be 
mapped on an n-cube is of n-1 levels. Thus the tree oc- 


cupies only about half (28-11) nodes out of the 2” available. 


Resultant under-utilization of computing resources is gn-liy 
nodes, which is slightly more than 50%. In terms of the to- 
tal computing resource available in the system, the overhead 
is linear and therefore unacceptable. 


Proof that an n-1 level binary tree can be mapped on 
an n-cube forms a part of Appendix II. In fact, two n-1 
level binary trees can be simultaneously mapped on an n- 
cube. 


The result obtained in Appendix I tends to question 
the suitability of hypercube topology for super- 
multiprocessor architecture on the grounds that it cannot 
support the scaling of a fundamental application. A ques- 
tion we may ask is: Is it possible to modify the application 
graph such that the new structure scales well on the hyper- 
cube architecture without introducing overheads that grow 
inordinately? 


Appendix II shows that it is indeed possible to slightly 
modify the original tree graph and make it well-scalable on 
the cube. That is, a modified n level tree is optimally map- 


Certain other multi-tree networks, like the KYKLOS [19], are 
known to offer average interprocessor distances which are better 
than that for the MSINs. 
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pable on an n-cube. By the introduction of a single two- 
degree node as a son of its root, and thereby stretching (or 
equivalently double-rooting) it, the tree can be made to util- 
ize the cube completely (see Figure 3b). The extra node so 
introduced is used only for communication between the root 
and one of its sons. It is interesting to note that only one 
such node is required for any n. This node therefore 
represents a constant overhead for all n > 3 which 
diminishes in percentage as n becomes larger. 


Since each processing node in the cube accommodates 
a single computation node of the tree, equal increase in sizes 
of both application and architecture does not change the 
computational load on an individual processor. Upon scal- 
ing, the links between the processors continue to correspond 
to edges between neighboring nodes in the tree on a one-to- 
one basis and handle the same level of traffic as before. 
This ensures that the computational efficiency of the ap- 
plication too does not drop significantly, although a con- 
stant time is added because of the communication through 
the spacer node. 


4 A Distributed Tree Set-Up Algorithm 


In the previous section we saw that an n_ level 
stretched binary tree can be mapped onto a hypercube of or- 
der n. In order to prove that result, we have introduced 
transformations of the node labels. It is possible to use 
these transformations, namely FT3 and BT3, to obtain a 
distributed algorithm to set up the referred mapping. This 
algorithm has practical importance for executing a binary 
tree shaped application on a hypercube. It is used first to 
set up the communication topology among the processors of 
the cube, and then the application is executed. The algo- 
rithm does not require that the entire system be configured 
as a tree. It works as well within any subcube of the overall 
system. The algorithm is explained below. 


The algorithm is executed by each processor in the hy- 
percube. It is divided into two phases; namely compute and 
distribute. During the first phase, the nodes of the cube 
compute the loci of the mapped trees as they are trans- 
formed and merged to form larger trees. In the second 
phase, the port configurations of the nodes forming the final 


‘tree are sent to the appropriate nodes of the cube. 


The algorithm starts by setting up initial 29-3 three 
level trees having a predetermined configuration. That is, 


every node in the cube is a member of a three-level tree; its 
position in the tree is determined by the least significant 
three bits of its ID. Each three-level tree is defined over a 
set of eight nodes which have identical n-3 most significant 
bits. The configuration of these trees are shown in Figure 5. 
The labels of the nodes of these mapped trees are henceforth 
referred to as virtual addresses. At each successive itera- 
tion of the algorithm, trees are merged to form larger trees 
until eventually the final tree is attained. These inter- 
mediate trees are not realized by activating links between 
appropriate nodes in the cube. However, the cumulative in- 
formation available in the cube is sufficient to make this 
possible. 


At the outset, the virtual address of a tree node cor- 
responds to the label of the physical node in the cube to 
which it is mapped. After this, the physical nodes compute 
the loci in the n-cube of these initial tree nodes as the map- 
pings are transformed and merged to form larger trees. At 
the end of the compute phase of the algorithm, each physi- 
cal node holds the node label to which the virtual address 
would eventually be mapped, and also the port assignments 
required at this final node to configure the tree properly. 
These port assignments are relayed to the final node during 
the distribution phase of the algorithm. 


For an n level tree in an n-cube, O(logN) iterations are 
required, and if we assume no collisions during the distribu- 
tion phase, it too is O(logN) long, since each node in the 
cube emits one message. It therefore implies that the algo- 
rithm is of O(logN) time complexity. 
given in Appendix III. 


The algorithm is 


5 Tree Mapping in Presence of a Node 
Fault 


We consider here the mapping of a stretched binary 
tree topology in presence of single node faults. We make an 
assumption while formulating the strategies: the knowledge 
of this failure is global. That is, all processing nodes know 
before hand the label or Id of the failed node. This prior 
knowledge is not essential, and an algorithm to distribute 
the labels of the failed nodes can be formulated if necessary. 


We will first consider a multiple step approach where 
each step represents one mapping of a portion of the com- 
putation graph and its subsequent execution. Clearly, an 
n-1 level sub-tree with a spacer node occupies 29-1 nodes 
and thus is mappable into half of a hypercube of order n. 
By utilizing this partitioning of the cube, the corresponding 
computational graph must be segmented into three com- 
ponents: the root, the left sub-tree and the right sub-tree. 
In this way, each sub-tree segment of the graph can be 
mapped onto the valid half partition of the cube in which no 
nodes are faulty, while the faulty node is isolated to the 
other half. 


Because the proof in Appendix II works inductively for 
successively higher order cubes, the reverse fragmentation to 
lower cubes follows directly, provided the cube partition is 
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done on a dimensional boundary. Remembering that each 
successively higher order cube adds one bit to the node ad- 
dresses, a mapping partition can consist of all nodes whose 
high order bit is the complement of that of the faulty node. 
In this way the problem is reduced to that of mapping a 
sub-tree to a subcube. Figure 6 shows the valid half of the 
cube as those nodes with the high order bit in their labels to 
be 1 since it is assumed that the faulty node has a 0 as the 
high order bit of its label. 


The disadvantage with this approach, however, is that 
it may lead to a three-step process of map-and-execute (left 
tree, right tree and then root), and additionally, because of 


a single faulty node, 2°-1_1 nodes are left unutilized. This 
seems a high price to pay to avoid a fault. We now address 
the question of whether one can map-and-execute in a single 
step. 


A three-level tree with the spacer node is shown in 
Figure 5. In this figure, we can see that the root of the tree 
occupies physical node ID 000 and the spacer node ID is 100. 
After mapping, only one node within the cube has no cor- 
responding computational requirement in the stretched tree. 
We observe that the spacer node (S100 in Figure 4a) has as 
its only responsibility, the communication of results from 
the right sub-tree to the root. Certainly, this communica- 
tion is an essential part of the computation graph. But, 
S100 is not the only node linking S000 and S110 within the 
hypercube, and because the node S010 likewise links them, 
it could be used instead. 


Let us assume for the moment that we have a means 
to "force" the faulty node to be S100. In this way, all com- 
putation nodes are mapped to working processors and S010 
doubles as both a computation node and an alternate link 
from the right sub-tree to the root (see Figure 9). In all of 
the tree algorithms discussed, at the time when this com- 
munication is needed all members of the right sub-tree have 
completed their computation and gone idle. This implies 
that no timing conflict arises as a result of this dual role for 
S010. 


The remaining open question concerns the ability to 
"force" the faulty node to the spacer node. We answer this 
question via an example of a 3-cube. The method can be ex- 
tended easily to an n-cube. 


We will assume a single fault at node 001, which ac- 
cording to the mapping in Figure 7, is a computational 
node. We assume that the faulty node address (001) is 
available to all nodes in the cube. We now provide a means 
to remap the initial three level tree such that the fault will 
map to a logical node 100 and all other nodes will be trans- 
formed through three dimensional space conserving their ad- 
jacency relationships. 


Figure 8 shows that an XOR bit mask controls the 
reflection or non-reflection of each bit in the node label. If 
the bit in the mask is 0, no reflection occurs; but if it is a 1, 
then a pair of node labels are swapped in that dimension. 


Thus, a remapping of the labels occurs, but adjacency is 
preserved. This mask can be generated by taking the ex- 
clusive OR(XOR) between the fault node id and the spacer 
id (100). The XOR of this mask and each node id create 
virtual addresses which force the fault to be mapped to the 
spacer node as shown in Figure 9. 


The above method can be extended to an n-cube if we 
can designate a particular label as that of the spacer node. 
The tree set-up algorithm of Appendix III always produces a 
spacer node at S100, where S is a string of n-3 zeros. Thus, 
in the case of an n-cube, all nodes use their own physical 
labels as virtual addresses to start with. Subsequent to the 
tree set-up algorithm, the resultant virtual addresses are 
XORed with the mask to obtain an appropriate tree with a 
spacer node mapped onto the faulty physical node. 


6 Summary and Conclusions 


In this paper the concept of application scalability was 
introduced and applied to an example of a hypercube ar- 
chitecture and a binary tree application. For inductive ar- 
chitectures like the hypercube, application scalability is an 
indicator of how well a given application will be supported 
with increasing sizes. The parameters of interest, while 
evaluating application scalability, are utilization of 
hardware resources and efficiency of execution. The first of 
the parameters is evaluated by mapping a maximum sized 
application graph onto the architecture graph and perform- 
ing a calculation of the utilization as a function of size. The 
second is analyzed by measuring the growth of computation 
and communication delays as a function of size. 


It was concluded that a balanced binary tree is not 
well scalable on a hypercube. It was also proved that a 
stretched binary tree results in a scalable mapping giving 
near 100% utilization of processing resource in the architec- 
ture. In addition, an O(logN) distributed algorithm was 
derived to generate a stretched binary tree in the hypercube. 
The tree construction algorithm was also shown to be 
adaptable to generate a mapping in presence of a single 
node fault by augmenting it with a subsequent fault- 
avoidance step. 


Of the two approaches evaluated for executing the tree 
application in the presence of a single node fault, one 
provides complete utilization of fault-free nodes for an n 
level tree in an order n hypercube. Thus, even with single 
node fault, the binary tree application can be mapped and 
run with virtually no degradation in performance. 
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Definition: 


Appendix I 


Definition: A hypercube of order n (an n-cube) is an un- 
directed graph with 2" vertices labelled 0 through 
2"-1. There is an edge between a given pair of ver- 
tices if, and only if, the binary representation of their 
labels differ by one and only one bit. Hypercube of or- 
der 3 is shown in Figure 1. 


Definition: Parity of a node is odd or even and is deter- 
mined by the number of 1’s in the label of the node. 


Lemma 1: If nodes a and b are connected by an edge, a and 
b have opposite parities. 


If we try to map a binary tree on a hypercube, we can 
observe the following lemma: 


Lemma 2: The node parities alternate over the levels of a 
mapped binary tree. 


We now count the number of odd and even nodes 
necessary to map a binary tree, and the number of odd and 
even nodes available in a hypercube: 


By symmetry, the number of odd and even nodes available 
in the hypercube is the same and is half the total number of 


_ 1 n n-1 . 
nodes, that is 32 ) = 2”. For a binary tree, we get the 
following values: 


Case 1: n is even. 


1 
Number of even nodes = 2°+27+....--20°? = (2-1) 

9 
Number of odd nodes = 2!+2°+....49%! — 3(2-1) 
Case 2: n is odd. 

0452 nt gnt1 

Number of even nodes = 2°+2°+....+2°0 °° = 32 -1) 

2 
Number of odd nodes = 2142°4....490% = =2™ 1.1) 


In both cases the number of even and odd nodes required by 
the tree do not match those of the cube. This completes the 
proof. 


Appendix I 


Figure 3a shows a binary tree of n=3. The figure also 
shows the even and odd levels of the tree, assuming level 0 
of the tree is at an even level. Figure 3b shows the con- 
struction of a 4-level tree out of two 3-level trees using an 
extra node at level 1 of the right hand subtree. The resul- 
tant tree has equal number of odd and even nodes which can 
be matched by those in a 4-cube. The following is a con- 
structive proof that such a mapping is possible forn > 3. 


An n-cube is said to be k-dimensionally trans- 
formed when k bits in the label of each node of the 
cube are transformed according to a transformation. 
(Note that the bits undergoing transformation need 
not be contiguous.) 


We define two 3-dimensional transformations, FT3 
and BT3, as shown in Table 1. The XX; and x, are the 


original values of the bits, and Yi Yj and Y, are the cor- 


responding resultant values.® 


Definition: Distance between a pair of nodes in a graph is 


the minimum number of edges that have to be. 


traversed when going from one node to the other. For 
an n-cube, this is also equal to the number of bits by 
which the binary representations of labels of the two 
nodes differ {11]. The adjacency in an n-cube is the 
distance relationship of a node with the rest of the 
nodes in the cube. 


We state the following two theorems proved in [12]. 


Theorem 1: The FT3 preserves cube adjacency. 


Theorem 2: The BT3 preserves cube adjacency. 


We can now state the following corollaries: 


Corollary 1: Nodes with distinct labels map to distinct 
resultant labels. 


Corollary 2: If two nodes are neighbors and thus have 
labels differing in one bit position, their new labels 
after the transformations also differ in one bit posi- 
tion. Thus, neighborhood between the two nodes is 
preserved. 


Theorem 3: A graph G mapped on an n-cube remains un- 
changed in structure after the transformations. 


Proof: The proof follows as a direct consequence of corol- 
laries 1 and 2, and is given in [12]. 


We can therefore state the following corollary: 


Corollary 3: The transformations FT3 and BT3 preserve a 
tree structure mapped on an n-cube. 


Theorem 4: A stretched binary tree (of the form shown in 
Figure 3b) of n levels can be mapped on an n-cube, for 
n > 2. | 


Proof: The case for n=2 is trivial. The case of n=3 is 
shown in Figure 5. The construction of a four level 
tree from a pair of 3-cubes can be found in [12]. We 
will now describe the inductive step. We assume that 
an n_ level stretched binary tree, as shown in 
Figure 4a, is mapped on an n-cube. To obtain an n+l 
level tree in an n+1-cube: 


e Duplicate the cube and_ the 


Figure 4b. 


mapping. See 


3The two transformations define two of the many rigid trans- 
formations of a 3-cube. The rigid transformations include rota- 
tions and reflections of the original cube structure which preserve 
the adjacency between its nodes. 
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e Apply FT3 to the duplicate mapping using bit-2 
as X,, bit-1 as x; and bit-0 as x, to obtain a new 
mapping as in Figure 4c. 

e Form an n+1-cube by connecting the nodes with 
like labels in the two n order subcubes. Append a 
0 to the left of labels in the original subcube and 
a 1 to the left of those in the duplicate subcube 
(see Figure 4d). 

e Deallocate links: 0S100-0S110 and 1S100-1S000, 
and allocate links: OSO000-1S000, 0S110-1S110 and 
0S100-1S100. See Figure 4e. | 

e Apply the BT3 to the n+1 cube to obtain an n+1 
level stretched binary tree rooted at OS000. 
While applying the BT3, use the most significant 
bit as x,, the third least significant bit (bit-2) as 
Xj, and the least significant bit. (bit-0) as x,. 
Replace the string OS by S’ to obtain a structure 
similar to the base structure we started with. See 
Figures 4f and 4g. 


Appendix II 
The tree set-up algorithm uses the following variables: 


current-port: Every node in the cube has logN ports, each. 
corresponding to a bit position. This variable keeps a 
running pointer to the current bit position being con- 
sidered. | 

physical-id: Original label of a node. 

current-td: A node label during current iteration. 

port-relation(1..logN): An array of values specifying the 
current active connections of a node. All are initial- 
ized to "null" (inactive). 


Following values are assigned to port-relation(i) vari- 


able: 

“null”: No active connection. 

wfe, Connection to "father" node. 
Lr ha Connection to "son" node. 


The following is the compute phase of the tree set-up algo- 
rithm. This is followed by a distribution phase during 
which the connectivity information is sent to the ap- 
propriate physical node specified by the current-td variable. 


for-all nodes do 
begin 
current-id = physical-id; 
{Set up gus 3 level trees. } 
case current-id( bits: 2..0) of 
0: port-relation(0) = port-relation(2) = "s"; 
1: port-relation(0) = "f"; 
port-relation(1) = port-relation(2) = "s"; 


2: port-relation(2) = "f"; 

3: port-relation(1) = "f"; 

4: port-relation(1) = "s"; 
port-relation(2) = "f"; 

5: port-relation(2) = "f"; 

6: port-relation(0) = port-relation(2) = "s"; 
port-relation(1) = "f"; 

7: port-relation(0) = "f"; 


end-case 
for current-port = 3 to logN-1 do 


begin 
{Form larger trees iteratively. } 
if (current-id(current-port) = 1) then 
begin 
Apply FT3 to current-id’s bits 2, 1 and 0; 
end 
if (Bits 3 through current-port-1 are 0) then 
begin 
case current-id( bits: current-port, 2, 1, 0 ) of 
0: port-relation(2) = "f"; 
port-relation(current-port) = "s"; 
4: port-relation(2) = "s"; 


port-relation(current-port) = “s"; 
port-relation(1) = "null"; 
6: port-relation(1) = "null"; 
port-relation(current-port) = "f"; 
8: port-relation(2) = “null"; 
port-relation(current-port) = "f"; 
12: port-relation(2) = “null"; 
port-relation(current-port) = "f"; 
14: port-relation(current-port) = "s"; 
end-case 
end 


Apply BT3 to bits current-port, 2 and 0 of current-id; 
end 


end 


(1) 


[2] 
[3] 


[4] 
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ABSTRACT 


We present systolic tree architectures for data structures 
such as stacks. queues, priority queues, deques and dictionary 
machines. Except for the dictionary machine, all data structures 
have a unit response time. In each node of the tree the mechanism 


for controlling the transmission and distribution of data is a finite. 


state machine. The depth of a tree with n data elements is O(log 
n). For stacks, queues. and deques. no compress signals are 
required. 


1. Introduction 


Several researchers have explored systolic machines for 
implementing data structures. Leiserson |LEI79!. and Guibas and 
Liang |GU182] have given linear systolic array implementations of 
priority queues, stacks and queues. Ottman et. al. 'OTT82), Atal- 
lah and Kosaraju'ATA&5. and Somani and Agarwal |SOM85: 
have shown that dictionary operations can be implemented on a 
tree. These schemes require additional clock cycles|LEI79|iGUI82}. 
or extra connections'|OTT82.. or storage of several addresses 
i[ATAB85. 

This paper presents a relatively uniform approach to han- 
dling several types of data structures. Al] the data structures use 
finite state control. and with exception of the dictionary machine, 
have a unit response time. The response time of the dictionary 
machine is O(logn}. The input and output in these machines is 
done through the root of a tree Throughout this paper, n is the 
number of data elements in the tree We will first present a tree- 
based implementation of stacks. queues and deques. Using a dif- 
ferent approach, a priority queue and a dictionary machine are 
developed with a wide repertoire of operations. Compress cycles 
are necessary in the second approach to allow successive deletions. 


2. Stack, Queue. Deque 


An output restricted deque(Knuth|KNU68]), shown in Fig- 
ure 1, is a data structure which permits insertions to both the front 
and the rear. Deletions, however. are only permitted from the 
front. The output-restricted deque is a generalization of both the 
stack and the queue since restricting insertions from the rear 


makes this a stack while restricting deletions from the front makes 
this a queue. 


Consider a binary tree each of whose nodes has a data 
register for storing data. a buffer for pipelining instructions to its 
children, and two flags. The root of the tree corresponds to the 
front of the output. restricted deque. An instruction entering a 
node is executed. and the instruction to be pipelined is stored in 
the buffer of the node In the next cycle, the buffer contents is 
sent to either the left child or the right child. 


eee 
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There are two insert instructions- a Stack Insert (SI) and a 
Queue Insert(QI). A Stack Insert into a node causes the inserted 
value to be stored in the node. while the previous contents are 
directed to one of the children. A Queue Insert causes the inserted 
data to be pipelined to one of the children while retaining the con- 
tents of the data register. Each node has two binary flags, a queue 
flag(QF) and a stack flag(SF). When an empty node receives an 
insert instruction. its flags are assigned 0. In every non-empty 
node a stack insertion is directed towards the left/right) subtree if 
the SF is 1/0). A queue insert is directed left/right) if the Queue 
Flag(QF) is 0/1). A deletion is from the left/right) subtree if the 
SF is 0/1). The SF is complemented if either a Stack Insert or a 
Delete enters a node. while the QF is complemented if there is a 
Queue Insert. The behaviour of the output restricted deque is 
given in Table 1. Figure 2 shows the evolution of the output res- 
tricted deque under a sequence of instructions. A deque can be 
implemented by combining two output restricted deques |CHA85j. 
The complete implementation of the stack and queue is also 


described in |CHA85!. 


3. Priority Queue 


In our tree implementation of a priority queue and a dic- 
tionary machine, each node stores upto three data elements. Data 
is ‘snaked’ as in |ATA85). Balancing the tree is achieved by a flag 
at each node. All operations can be pipelined. 


First let us consider the priority queue. Our implementation 
supports Deletemax and Deletemed in addition to Insert and 
Deletemin. Deletemin, Deletemax and Deletemed applied to a 
node x remove the smallest. largest and the [n/ 2|th smallest (i.e. 
median) elements from the list of data elements contained in the 
subtree generated by x. 


Every node x has three storage registers (L(eft), M(iddle) 
and R(ight)), the contents of which are denoted by x.L, x.M and 
x.R. Each non-leaf node x has x.L, x.M and x.R as the smallest, 
median and largest data elements respectively in the subtree gen- 
erated by x. A non-leaf node is called "balanced" if the number of 
data elements of its left subtree is equal to that of its right sub- 
tree. A node is called "right biased" if its right subtree has one 
more data element than its left subtree. Every node is either "bal- 
anced" or "right biased" which is denoted by a flag. If a node x 
does not contain three data (i.e. it 1s a leaf node), insert and delete 
operations at that node are obvious. For example, if x has two 
data elements, x.L and x.M, and if I/Y} is to be executed, we only 
need to adjust x.L, x.M, and x.R. accordingly. No instruction will 
be pipelined further. 


Consider operations on a node x which has three data ele- 
ments. First consider I/Y| applied at a node x with the left child y 
and the right child z. Temporarily. we ignore the balancing. 

(case 1) If Y- x.L. x sends I|x.L! to its left child y and sets x.L to 
Y. 

(case 2) If x.L<\~<.x.M, x sends IY} to its left child y. 

(case 3) If x.M< Y- x.R. x sends I/Y| to its right child z. 

(case 4) If Y>x.R. x sends Iix.R/ to its right child z and sets x.R 
to Y. 


Now we consider the balancing. If a node is balanced and 
sends an insert instruction to its right child, or if it is right biased 
and sends an insert instruction to its jieft child the node simply 
changes its flag. Otherwise, the tree needs to be balanced. To sim- 
plify the description, we assume that if a leaf node w has only one 
element. then w.L=-w.M. Also, if a leaf node has two data ele- 
ments, they are stored in the registers L and R, accordingly. 

(i) If node x is balanced before the insertion J|Y] is applied and 
Y<x.M, an insert instruction must be sent to its left child (as in 
case 1 and 2) above. The following modification is therefore 
required. 
1. x sends I|x.M] to its right child z and sets x.M to 
max(Y,y.R). 


2. If Y<x.L, x sends an instruction IR|x.L] to the left 
child y and sets x.L to Y. (The instruction IR|b] 
when applied to a node y will delete y.R and insert 
b at y). 


If x.L<Y<x.M and Y--y.R, then no instruction is sent 
to y (note that x.M is set to Y). 


If x.L< Y<x.M and Y<y.R, then x sends IR|Y] to its 
left child y (note that x.M is set to y.R) 


3. x changes its flag to 1. 


(ii) If x is right. biased before I|Y]| is applied and Y>x.M, then the 
following modification is necessary because an insert instruction 
must be sent to its right child. 
1. x sends ]|x.M] to its left child y and sets x.M to 
min(Y,z.L). 


2. If Y>x.R, x sends an instruction IL|x.R] to its right 
child z and sets x.R to Y. where an instruction IL! W! 
applied at node z will delete z.L and insert W at z. 


If x. M<Y<x.R and Y<«<z.L. then no instruction is sent 
to z (note that x.M 1s set to Y). 


If x.M<Y<x.R and Y-z.L. then x sends IL|Y] to its 
right child z. 


3. x changes its flag to 0. 


As shown above, some insert instructions may generate 
instructions JL and IR in the next stage. Let us consider the pro- 
cess of IL|Y| at node x. IR|Y] can be executed similarly. If node x 
is a Jeaf node, IL|Y| simply deletes x.L and inserts Y into node x. 
Consider an JL|Y| to a nonleaf node x with left child y and right 
child z. 

If Y<y.L, x.L is set. to be Y and no further processing is necessary. 
Otherwise, x sets x.L to y.L and does the following process: 
(a) If y.L<Y<x.M, x sends IL; Y/ to y. 
(b) If x. M<Y<z.L, x sends IL|x.Mi to y, sets x.M to Y. 
(c) If z.L<Y<x.R, x sends ILix.M! to y, sets x.M to z.L and 
sends IL|Y] to z. 
(d) If Y>x.R, x sends IL|x.Mj and IL|x.R! to y and z respec- 
tively, and sets x.M to z.L and x.R to Y. 


Note that neither IL nor IR instruction changes the flag of the 
node. The case when node x has only one child can be handled in 
a similar way. Figure 3 shows the data distributions after a 
sequence of insert operations are done. 


A Deletemin instruction at node x with left child y and 
right child z causes the following actions. First, x sets x.L to y.L. 
If x is balanced (before Deletemin is applied), x sends Deletemin 
to its left child y. If x is right biased, x sends IL|x.M] to y, sets 
-x.M to z.L and sends Deletemin to z. Then, x changes its flag. 


Similar operations can be derived for Deletemax and 
Deletemed. {CHA85] Note that each delete operation must be fol- 
lowed by a compress cycle so that when a delete operation is 
applied at a non-leaf node x, x must. have three data elements 
after the operation. . 


4. Dictionary Machine 


The dictionary machine, in addition to the above priority 
queue operations, permits Delete/Y;, which removes the key Y 
from the dictionary and is denoted by DY]. First, we discuss the 
action taken at each node as a result of each operation under the 
assumption that no redundant operation is allowed. Later the issue 
of redundancy is considered. D/Y! applied at a leaf node can be 
handled in an obvious way. Suppose that D/Y] is applied at node x 
with left child y and right child z. 

(i) If x.L=Y, x sets x.L to y.L and sends down Deletemin to its 
left child y. If node x is balanced (before D|Y] is applied), no 
further processing is necessary for balancing. If node x is right 
biased, then x sends Iix.M! together with Deletemin to its left 
child y. sets x.M to z.L and sends Deletemin to its right child z. 


(ii) If x.L< ¥<x.M. DIY! must be sent to the left child y. If x is 
balanced, no further process at node x is necessary for balancing. 
If x is right biased, x sends I/x.M: together with D/Y] to its left 


child y, sets x.M to z.L and sends down Deletemin to its right 
child z. . 


(111) If x.M=Y and x is balanced, x sets x.M to y.R and sends 
Deletemax to its left child y. If x.M=Y and x is right biased, x 
sets x.M to z.L and sends Deletemin to the right child z. 


(iv) If x.M<Y<x.R, D/Y| must. be sent to the right child z. If x 
is right biased, no further process is needed at node x for balanc- 
ing. If x is balanced, x sends I|x.M] together with D/Y] to its 
right child z, sets x.M to y.R and sends Deletemax to its left 
child y. 


(v) If x.R=Y, x sets x.R to z.R and sends Deletemax to its right 
child z. If x is right biased, no further process is needed at node 
x for balancing. If x is balanced, x sends I|x.Mj(together with 
the Deletemax) to its right child z, sets x.M to y.R and sends 
Deletemax to its left child y. 

In each of the above cases, node x changes its flag. 


As described above, a delete instruction at a node may 
require the node to send a pair of instructions to its child in the 
next stage. The execution of an instruction pair is very similar to 
the execution of the instructions IL and IR generated in a priority 
queue. Also, the processes that take place for an instruction pair 
arriving at a leaf node or a node having only one child can be 
described in an obvious way. Figure 4 shows the configurations of 
a dictionary machine before and after a deletion. 


When redundant. insert and delete are allowed, we need 
extra O(log n) time cycles to check the redundancy. Each instruc- 
tion has three phases: broadcasting phase, verifying phase and exe- 
cution phase. During the broadcasting phase, the given instruction 
will be broadcasted down the tree. The verifying phase starts when 
the instruction arrives at the leaf nodes. During the verifying 
phase, the instruction arriving at a leaf node moves up to the root 
by checking data stored in nodes and the instructions which are in 
the execution phase. Thus, at the root node, the redundancy of the 
instruction is finally verified. For example, the instruction IX] 
which follows 1/X}. and D;X; which follows D/X/ should be flagged 
as redundant. Special care must be taken to consider the two 
cases. 

(i) I|X} is in the execution phase and two D!X; instructions are in 
the verifying phase. The first D|X; should be non-redundant. 

(ii) D,X, is in the execution phase and two I/X] are in the verifying 
phase. The first I|X] is non-redundant. 
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Figure 2. Evolution of the Output Restricted Deque 
after Successive instructions. 
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Figure 3. Data distribution of the priority queue 
after successive inserts. 
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Figure 4. Configuration of the dictionary machine 
before and after a deletion. 
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Abstract 

This paper identifies equivalences between two 
systematic methodologies for the design of systolic arrays 
and illustrates the benefits of understanding these 
relationships. The methods are the parameter method of Li 
and Wah and the dependency method of Moldovan and 
Fortes. After a review of the core ideas, models and 
parameters of each method, mathematical relations between 
them are derived. The usefulness of these relationships is 
illustrated by showing how (1) optimization procedures for 
the parameter method suggest similar procedures for the 
dependency method, and (2) new systolic equations for the 
parameter method result from the knowledge of equivalent 
equations in the dependency method. Also, systolic designs 
for convolution, obtained through different methods, can be 
mathematically proven to be identical [5]. 


I - Introduction 

In this paper, two proposed methodologies for the 
systematic design of systolic arrays are comparatively 
studied. They are the data dependency method of 
Moldovan and Fortes, [1]-[3], and the parameter method of 
Li and Wah [4]. We find and expose the recondite 
relationships and equivalences between the two 
methodologies and use this information to improve them 
and verify similar designs. 

Section II provides a short description of both 
methodologies. Section III establishes equivalences between 
the mathematical expressions used to systematically design 
systolic arrays in the two methods. The equivalences of 
section III are used in section IV to propose optimization 
procedures and improvements for both methodologies. 


II - Introduction to the Parameter and Data 
Dependency Methods 


2.1 Parameter Method [4] 

This methodology considers the design of optimal pure 
planar systolic arrays for a class of linear recurrences which 
take the general. form 


ai =f ak ve » XZ(i.k) » Ye] ,6= +1 (2.1.1) 


where f is the function to be executed by each cell of the 
array and x(i,k), 9(k,j) are linear indexing functions for the 
two-dimensional input variables X and Y. In the following 
presentation, the coefficients of i,j, k are either 1 or -1. 
One-dimensional recurrences have the general form 


2 = flak** ’ X8(i,k) ’ a) ’ 6=+1 (2.1.2) 


Three sets of parameters are used to characterize a 
systolic array: velocities of data flow, data distributions, and 
periods of computation. The velocity of a datum z is the 
directional distance passed by that datum in one clock cycle 
and is denoted by Xy. The distance between two PEs is 
defined to be one. Thus, X; must be less than or equal to 


* This work was supported in part by the National Science Foundation 
under Grant DMC-8419745 and in part by the Innovative Science and 
Technology Office of the Strategic Defense Initiative Organization and 
was administered through the Office of Naval Research under contract no. 
00014-85-k-0588. 
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one because broadcasting is not allowed in pure systolic 
arrays. 

Data distributions are defined using row and column 
displacements. For two-dimensional input and output 
matrices, the elements along a row or column are arranged 
in a straight line and the distance between adjacent 
elements in a row or column remains constant as the data 
flows through the array. To define the row displacement of 
array X, suppose that the row and column indices of X are 
t and j, respectively. The row displacement of X is the 
directional distance between x, i,j) and Xgi+1j) and is 
written as X,,. Similarly, the column displacement is the 
distance between xX, ;) and X9 +1) and is written as X;,. 

Periods of computation are described using two 
functions, r, and 7,. 1, is defined as the time at which a 
computation is performed, whereas 17, defines the time at 
which a variable is accessed. The periods of ¢ and j for 
two-dimensional outputs are defined as 


ty = re(aiesj) — Tek) (2.1.3) 
tj = ret 41) — rah) (2.1.4) 
ty = re(ziit')— (ak) (2.1.5) 


It will be assumed that t, is positive. If this is not true 
for a given recurrence, the recurrence can be rewritten to 
satisfy this condition. In computing ik, Xg(i,k) and Y (kj) 
are accessed and two additional periods can be included to 
describe this interaction. They are 


te = Ta(X (1, +1)) = T(X 91 4)) (2.1.6) 
thy = ny $k +1,3)) & TAY 5(%,)) (2.1.7) 


Depending on the order of access, t,, and ty may be 
negative. Since operands to be used in a computation must 
arrive at a PE simultaneously, the magnitude of the periods 
must equal t,, i.e. ,it must be true that 


t. = | tice| = | tiy| (2.1.8) 


The periods are independent of the indices t, 7, and k, 
and they must be greater than or equal to one to prevent 
broadcasting. 

These parameters (velocity, data distribution, and 
periods) can be combined into a set of equations which 
describe the operations of a systolic array. These equations, 
for the two-dimensional case, are 


tix%a 5 Xk5 => tha (2.1.9) 


thy¥a + Vis = tryta (2.1.10) 
txy + X, = ta (2.1.11) 
tiga + wi, = ti¥a (2.1.12) 
ti¥a + Yj, = ta (2.1.13) 
tZ, + , = ty (2.1.14) 


For the one-dimensional case, the last two equations are 
removed from the set of systolic equations given above. 


The optimal systolic array for a given recurrence can 
be found by systematically enumerating the possible 
solutions using a search order that guarantees that the first 
feasible solution found is, in fact, the optimal one. Consider 
optimizing T, the total time needed to complete the 
computation. First set the magnitudes of the directional 
distance traversed by the variables equal to 1, i.e., 
k, =k, =k, =1 [4]. Then, set the magnitudes of the 
periods t,t; , and t, equal to one and determine if a 
feasible solution exists. If a feasible solution is found it is 
the optimal solution for T because T is a linear function of 
the periods that increases monotonically with increases in 
the magnitude of these periods. If no feasible solution is 
found with t; = t; = t, = 1, one of the periods is increased 
by one and the search for a feasible solution repeats. If no 
feasible solution can be found with k, = k, =k, =1, one 
of the ki 1<i<3, is increased by one and the search begins 
again. A flowchart describing this optimality procedure is 
shown in [5]. A similar procedure is used to optimize the 
AT” measure [4],[5). 


2.2 Data Dependency Method [1]-[3] 

The essence of the data dependency method is the 
representation of the dependency structure of an algorithm 
in concise, matricial form. That is, the dependencies 
between computations are condensed into an matrix that 
can be easily understood and manipulated. 

Let Z™ denote the nth cartesian power of Z, the set of 
nonnegative integers. To describe an algorithm A, a five 
tuple A = (J"°,C,D,X, Y) is used where J® C Z” is 
the index set, C is the set of computations, D is the set of 
dependence vectors, X is the set of input variables, and Y is 
the set of output variables. The data dependencies describe 
the structure of the algorithm and are given as a set of 
triples (d,v,j) such that the computation indexed by j 
requires the variable v, generated at index j—d, as an 
operand. 

Linear indexing functions [3] describe how variables are 
referenced. A linear indexing function F:J" — Z™ is 
defined by an equation of the form F(j ) = C, + Cj where 
CG, € Z'™*) is called the index displacement and C € Z(™*®) 
is called the indexing matrix. For example, the variable 
aj; — Jo je + 33-1, 53 — j,) has a linear indexing function 


for which 
1 -1 0 0 
Cc =1{0 1 1} and C, = FI]. 
-1 0 1 0 


A transformation matrix T can be used to describe a linear 
bijection which transforms the dependency matrix and index 
set of an algorithm so that it can be executed in a VLSI 
array. T can be partitioned into two matrices, w and S: 


The mw matrix defines the time transformation whereas S 
defines the space transformation to be applied to the 
dependence matrix and index set of an algorithm. The time 
at which a computation indexed by j is executed is 
determined by 2 j, while Sj specifies which processor is to 
execute this computation. Of course, r and S must satisfy 
certain conditions if they are to be considered valid 
transformations. Let m be the number of columns in the 
dependency matrix. Time transformations must satisfy 
xd; > O , i=1,2,....m, where d; is a column vector in the 
dependence matrix. This constraint results from the 
requirement that a variable must be generated before it is 
used in a computation. 
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The space transformation S maps the computation 
indexed by j into processor Sj. This assumes a processor 
array model consisting of a grid which has_ the 
dimensionality of the array Each point of the grid 
corresponds to a processor and the coordinates of the point 
are the index of the processor. Certain restrictions must be 
placed on possible solutions for S due to the limited 
interconnections available in VLSI arrays. These restrictions 
can be embodied in the P and K matrices. The P matrix 
describes the interconnection primitives available within an 
array, i.e., the vector differences between indices of 
connected processors. The utilization matrix K_ describes 
the interconnections used by the transformed algorithm 
during execution. The relationship between K, P, S, and D 
is 


SD = PK (2.2.1) 


where the entries of K must satisfy the following constraint 


r 

Vk < nd; ’ i=1,...,.m (2.2.2) 
j=1 

This last constraint requires that the time between the 
generation and use of a variable must be greater than or 
equal to the number of interconnection primitives needed by 
the datum to travel from the PE in which it was generated 
to the PE in which it will be used. Optimization procedures 
for the dependency method are given in [2],[3]. 


Ill - Equivalences between the Parameter and Data 
Dependency Methods 


Lemmas 1-3 provide equivalences between the different 
parameters of the two methods, while Lemma 4 describes 
the form of the dependency matrices for algorithms 
considered in the data dependency method. These lemmas 
are then applied in Theorem 1 to show that the space 
equations and systolic processing equations are equivalent. 
The proofs are omitted here but can be found in [5]. 

The first lemma gives expressions for the data 
distribution and velocity vectors of the parameter method in 
terms of the transformations and indexing matrices of the 
data dependency method. 


Lemma 1 

Let S, mw be as defined previously in section 2.2, and let 
v be any of the variables z, y, z as given for the parameter 
method. Also, let C’ represent the indexing matrix for 
variable v. Then the following relationships hold for the 
two-dimensional case: 


+1 


For the one-dimensional the following 


relationships apply: 
ai 
NS) 
rs 


Al 

iS) 

n 
The following lemma describes the relationship between 

the w vector of the data dependency method and the periods 


ti, tj, t, of the parameter method. The relationship is 
remarkably simple. 


case, 


0 
+1 


+1 


o| =» 


= Vy 


Lemma 2 | 
r= E t; t | 


Thus, the periods of the parameter method are the 
elements of the m matrix. The next lemma relates the 
elements of the data dependency method’s K matrix and 
the constants, k, (1 <i< 3), as defined in [4]; i.e., 
lial [za] = k, , | ti| \¥a| = =k , {tl |Xa| = ks. 


et k; be the single nonzero entry of the i’th column of K. 


Lemma 8 
1<i<3 


The next lemma describes the form of the dependency 
matrices for the class of recurrences considered in the 
parameter method. 


Lemma 4 

The dependency matrices for the class of recurrences 
considered in the parameter method have the following 
structure: 


Two-dimensional Recurrence: 


+1 0 oO | 
D=|0 +1 0 | 
0 oO +1 | 


One-dimensional Recurrence: 


| & | 


P= lo 41] ¢ | 


» fal =l[e[ =1 


where d,,...,d, are dependency vectory which are a 
function of the recurrence, as is r, the total number of these 
additional dependencies. 

The following theorem states that the equations used in 
both methods to describe the operation of a systolic array 
are equivalent. 


Theorem 1 
The constraint equations (2.1.9 - 2.1.14) of the 


parameter method are equivalent to the space equations, 
SD = PK, of the data dependency method. 


IV - Optimization Procedures and Examples 


4.1 Optimization Procedures 

Optimization procedures for the parameter method 
were discussed previously in section II. By directly 
translating the parameters and constraints of this method 
into the corresponding elements of the dependency method, 
we can devise a similar procedure which is applicable to the 
recurrences considered in [4]. However, by using a slightly 
different approach, it is possible to propose a related 
optimization procedure applicable to all cases for which 
dispr=1 in the dependency method. Note that dispz refers 
to the number of parallel wavefronts in the array. It differs 
from that proposed for the parameter method in that it 
checks all possible values of K before considering longer 
execution times (i.e., different 7’s). The flowchart of Figure 
1 describes the new optimization procedure. In words, it 
starts by finding all transformations a which minimize 
execution time. This is relatively easy, since only the case 
dispt=1 is considered and execution time is therefore a 
monotonic function of the entries of 7. Hence, one can start 
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with all entries of w being zero and progressively increase 
their absolute values considering all possible combinations of 
signs and magnitudes (while, of course, checking for the 
validity of each 7). Possible 2’s, which might result from 
further increases in the absolute value of the entries of a 
particular w for which execution time is larger than the 
known minimum, need not be _ considered due _ to 
monotonicity property mentioned above. Thus, the search 
space is finite, and, in fact, rather small for most cases. 
Once the set of z’s is known, it is necessary to check if there 
exists a solution to the equation SD=PK for at least one of 
the possible values of K. If a solution is found, then the 
corresponding 7m (as well as the design determined by m and 
S) is optimal with respect to execution time. Otherwise, a 
new set of 7’s must be found which increase execution time 
by the least amount and the process is repeated again. The 
procedure always terminates, since, in the worst case, serial 
execution is reached as a feasible solution. 

A similar reasoning can be used to optimize Bncoures 
combining area and execution time, e.g., AT or AT’. Figure 
2 illustrates such a procedure. It differs from that of figure 
2 in that the search space is reduced to the set of 7’s which 
result in execution time bounded above by a constant factor 
[4]. In this finite space, all valid values of m and S are 
considered and those which optimize the combined measure 
of area and time determine the optimal solution. This is 
exactly the same approach used in the parameter method. 
The key idea consists of limiting the search space by 
choosing bounds for 7 and, thus, for the execution time. 


4.2 New systolic equations 


Convolution can be expressed as the following 
recurrence equation 
y°=0 1l<ig¢n 
ye =yht + axe 1<i<a, 
1<k<nm, x; =0 for j>n (4.1) 


Another possible description with the order of access of the 
input terms reversed is 


yi" 1<i<n, 


= yt age iXm-kti 


l1<k<m, a2>m (4.2) 


Theorem 1 showed that the systolic equations of the 
parameter method are equivalent to the space equations, 
SD = PK, of the data dependency method. Systolic 
equations _ for the one-dimensional case are equivalent to 
Sd, = Pk, and Sd, = Pk,, respectively. The subscripts on 
the vectors k and “d indicate which variable is associated 
with a particular vector. Thus, the Sd, = Pk, space 
equation is not contained within the systolic equations of the 
parameter method for the one-dimensional case. 

The systolic equations for the z dependency will take 
the general form 


{8a + 8, = fXa (4.3) 
fy¥a + ¥, = SyXa (4.4) 
where f, and f, are linear functions of t; and t,. To 


determine the functions fi,fy, the equivalences defined in 
ra 1 and 2 are applied to (4.3-4.4) resulting in 
= -(t; +t), fy a + t,) and these equations become 


—(t, + t)ay + a, = {t, + tty 


(t; + ty)¥a + ¥, = (ty + ty 


The above analysis was applied to the recurrence of equation 
(4.2). A similar analysis of the recurrence expressed in 
equation (4.1) results in f, = (t,—t;) and fy = -(t, — t)) 


in equations (4.3) and (4.4). 
equations, 


This new set of systolic 


derived from the data dependency method 


through the equivalences described in Section III, can be 
added to the set of systolic equations. From this new set, 
only four equations are needed to provide equivalent 
solutions to those derived from the original four equations. 


[1] 


Figure 1 — 
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ABSTRACT 


In this paper we propose several data 
structures partitioning and transforma- 


tion schemes, in order to get an effi- 
cient execution of various matrix 
algorithms without any size restriction. 
The following matrix operations are 


considered: 


- Matrix-Matrix multiplication 

- Triangular matrix equations. 

- L-U decomposition and 

- Inverses of triangular and dense 
matrices. 


All these algorithms 
executed on a problem-size independent 
spiral systolic array processor. The 
array topology is fixed, and a simple 
feedback and control are needed. For all 
the algorithms that have been conside- 


are to be 


red, the PE's utilization tends to the 
maximum possible value. 
1. INTRODUCTION 
With the advent of VLSI, cheap 


processors can be nowadays designed to 
solve at very high speed one subset of 
numerical or symbolic applications. An 
interesting class of systems are the 
systolic array processors (SAP) /1/,/2/. 


In a strict sense, a SAP allows the 
hardware implementation of an algorithm 
which settles the functions to be 
performed in every PE, 


and the system topology. Many systolic 
algorithm implementations have been 
proposed: /2/, /3/. Nevertheless, a 
clear trend exists towards the 


realization of SAP connected to a host, 
in order to execute several algorithms. 


This work was supported by Ministery 
of Education of Spain (CAICYT) under 
Grant Number 2906-83 C03-03 and by CTNE. 


0190-3918/86/0000/0676 $01 .00 © 1986 IEEE 


the number of PEs: 
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Two important problems arise when this 


kind of processors is considered: the 
PEs operation and sometimes the 
interconnection topology must be 
programmable, and computations on 


different sized data structures must be 
executed in a fixed number of PEs. In 
order to get the maximum efficiency of 
the global systen, the partitioning 
algorithms must impose the lower 
overhead possible with respect to the 
execution time spent by the complete 
problem, the extra control needed to 
modify the PEs functionality, their in- 
terconnections, and the access from the 
boundaries PE to the data structures. To 
evaluate the system efficiency, the 
utilization factor of the systolic 
array, or the number of time units (T) 
needed to execute the algorithm in the 
system can be used. 


In this paper, some data structures 
partitioning and transformation schemes 
are proposed, in order to get an 
efficient execution of various matrix 
algorithms, with no size restriction, on 
a fixed dimension single array 
processor. All the algorithms are to be 
executed on a contraflow hexagonal SAP 
with w-by-w PEs (see fig. 1). The array 
topology includes connections to 
recirculate and feedbacking of PEs with 
partial results (SAP with this 
interconnection topology has been named 
spiral SAP, (SSAP) /4/). Some of the 
boundaring PEs must be able of executing 
several functions as those to be 
described in section 2. 


The notation to be used in the rest of 
this paper is explained in the following 
lines: 


- Capital letters A, B, C, D, X, Y,... 
denote matrices and capital letters N, 
M, Pi aes the dimension of those 
matrices. 


- Without any loss of generality, the 
matrices dimensions are supposed to be 
multiples of w. When this assumption 
is not true, we extend the matrix with 


a sufficient and minimum number of 
rows and/or columns formed with zero- 


valued elements. Accordingly, we 
define: N = N/w, M = M/w, P = P/w. 
- Regarding the partitioning, if A is an 


N-by-M matrix, we shall consider it as 


a N-by-M block matrix. And now A; j is 
> 
the (i,j) block with w-by-w elements. 


- The following submatrices of A are 


defined: 
= ak =(A A eaek ) for 
a,* S84 51°94 52? i,k 
1<k<M and 1<i<N. at, is a 


b] 
block-row matrix with w-by-kw 
elements. 

k T 


- A .=(A .5A Cen , for 
se ede kr Pe eee 
l<k<N and 1<5<M. AX , is a 

b ] 
block-column matrix with 
kw-by-w elements. 

k k k k T 

- Ag (Ay grdg nore rAy x) = 
ek k k 
=CAy poAg g2++cAw 4) for 


1<k<min(N,M). A 
block matrix. 


» x» iS a k-by-k 
> 


Every A is, in turn, decomposed 


i,j 


into the following matrices: 


A, ,=A 


+A +A = 
i,j 


Li, j Di, j Vi,j 


SOT a Ag as a Spy 4 


where: A, (Ai) are strictly lower 


triangular matrices, A is a 


and Alp 


(upper) triangular matrices. 


(upper) 


diagonal matrix, 


D 
(Any) are lower 


The outline of this paper is as 
follows: in section 2 the concepts of 
partitioning and transformation of a 


dense problem into an equivalent band 
problem is introduced; in section 3 the 
problem D=A.Bt+tC is solved, and_ the 
solution presented there is reused in 


subsequent sections;insection 4 the 
methodology to solve triangular matrix 
equations is applied; in section 5 the 


LU decomposition without pivoting is 


solved; and in section 6 operation of 
inverting triangular matrices and 
avley ltt, is presented. Finally, in 


section 7 the main conclusions of this 


work are summarized. 


2. PARTITIONING AND DATA RESTRUCTURATION 


Several 
matrix 
proposed. 


partitioning algorithms of 
problems have been recently 
These algorithms are intended 
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efficiency is 
band contains 


several considerations 
be taken into account. 


to work on SIMD array processors, 
pipeline vector supercomputers [575 
Modular VLSI arithmetic systems /6/, or 
systolic array processors /7/, /8/. 


It is imperative to attain a good 
matching between the resulting partitio- 
ned submatrices and the array processing 
system, because the maximum utilization 
of the SAP PEs is desired. The data res- 
tructuration serves to define the se- 
quence to chain the execution of subpro- 
blems in a SAP. This function must ensu- 
re the correctness of the solution and 
must allow a simple control and a maxi- 
mum PEs utilization. Overlapping of a 
subproblem loading with the unloading of 
the precedent, is an adequate feature to 
maximize the system utilization. 


In our work a contraflow SAP is used. 
This SAP constitute a good solution to 
problems requiring the computation of 
recurrences (e.g. a linear triangular 
system solving, LU decomposition... 


/9/). 


A triangular partitioning, and various 
types of data restructurations to 
transform dense to band matrices, are 
proposed in /10/. Those transformations 
yield to maximum efficiency of the PEs 
utilization. Such a methodology is 


applied to matrix-matrix and matrix- 
vector multiplications. 

To achieve this goal, each N-~by-M 
matrix A is partitioned into N~by-M 
block Matrices. Each w-by-w block 
matrix A, j is split into two triangular 

> 
; d 
matrices AlDpiy4 and Ana, 4 (or a an 
A .)- The optimal restructuration is 
UDI, j 


attained by juxtaposing these triangular 
submatrices, by 


means of different 


algorithms, _in order to obtain a 
NM+1-by-NM (NM-by-NM+1) block lower (or 
upper) band matrix, A. Maximal 


achieved because the A 
only elements from the 


original matrix, A, with no empty 
positions. 

When problems that include inner 
product operations are appointed, 


to obtain A must 


a) For 1<k<NM, if A A then 


kk “LDi,j’ 


must be equal to A . for 
Up,j 


ALt1, kk : 
any p such that 1<p<Nn. 


b) For 2<k<NM, if A 


Ana 4 then 


LDi,p 


k,k-1— 


Mi must be equal to A 


any p such that 1<p<M. 


for 


c) Obviously, dependencies in the 
computational sequence needed by 


some algorithms must be 
respected. 
We shall name this type of 


transformations as Dense to Band Matrix 


Transformation by | Triangular block 
Partitioning, (DBT). 

The PRT /11/ and PCT /12/ 
transformations, proposed by R.W. 
Priester et al., can be regarded as 
particular cases of the DBT. The 


transformation of a N-by-N dense matrix 
into a band matrix with bandwith w=N is 
proposed in the mentioned papers; they 
attain a 504 reduction of the number of 
PEs in the SAP, with no overhead in the 
algorithm time under certain conditions. 


Our DBT method serves to transform all 
the algorithms mentioned above, and the 
obtained algorithms are to be executed 
on a single hexagonal SSAP with w-by-w 


PEs (fig. 1). The feedbacking needed to 
recirculate all partial results 
introduces a delay of w cycles, except 


in the case of the main diagonal which 
is 2w cycles. As the SSAP must execute 
various algorithms, some PEs must be 
able of executing several operations, 
according to the precise algorithm and 
the precise cycle. In figure 2, the 
elementary processor is shown, as well 
as the types of operations needed, named 
OP(A), OP(B)etc.. The PEs(i,j), 2<i,j<w, 
perform only OP(A), which corresponds to 
the type of inner product step processor 
/19/. The PEs(1,j), with 2<j<w, must 
perform OP(B) and OP(D). The PEs(i,l1), 
with 2<i<w, must perform OP(A) and 
OP(F). Finally, the PE(1,1) performs all 
types of operation. 


3. MATRIX-MATRIX MULTIPLICATION 
We consider now the operation: 
D:=A. 


B+C (1) 


to be performed on a w-by-w systolic 


array. When A,B and C are M-by-N, N-by-P 
and M-by-P matrices, the problem is 
partitioned into MP disjoint 


subproblems, according to the algorithm: 


Algorithr 1: 


For m:= 1 to M do 
For p:= 1 to P do 
N N 
D 7= A B + 
m,p- m,* *,p CHP (2) 


Each one of the subproblems in (2) is 
solved sequentially by the SAP. Every 
subproblem is of the type: 
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D:=A.Bte (3) 
where A,B and Care, respectively, 
w-by-N, N-by-w and w-by-w matrices. We 


focus our attention on the execution of 
one of these subproblems. 


By means of DBT. transformations on the 
dense matrices A and B, the problem (3) 
is mapped onto a banded problem (see 
fig. 3) 


D:=A.B+t+C (4) 
To define the transformed problen, 
partitioning of A=(A,A,+++A,-+ AQ) and 


sh 
B=(B,B,.--B,---B) 
lar submatrices is required, as follows: 
A, = ADD + Ay, rSISN 
B, = Boyy + Bry 1Si<N 
_ Applying one DBT to A, A is obtained. 
A is a (Nt+1)-by-N block lower two-diago- 


nal matrix, with w-by-w blocks Ariss} 


into w-by-w triangu- 


A A 


fe a 
ir-=4 

3 
pela 
4. 
‘ am e w 
aT Le 


+ 

7 
ay 
ron all 


NeSE*W 
NWO | 
Se -SE*W 


Fig.2. PE's ports and operations. 


for 1<i<N+1 and 1<4<N 


if i=j 
Al Dj J 
A = : if i=jt+l 
Ai; Ay. 
) if i>djtl or i<j 


In a similar way, we define another 
DBT on  B, yielding B which is a 
N-by-(N+1) block _upper two-diagonal 
matrix with blocks Beas 


for 1<i<N and 1<j<N+t1 


Ba if j=i 
Boos 77 3Bry if j=itl 
0 if j>itl or i<j 


We define now C as a 
block tridiagonal matrix: 


e C if i=j=l 
Oe ats 
ios 0 otherwise 


(N+1)-by-(N+4L) 


for 1<i,j<Nt1 


Then D is a (N+1)-by-(N+1) block 
tridiagonal matrix. 
"i “Ep ~ Bnyn * e 
for 2<k<N 
Dak > Auce-1) ° Buce-1) * Aube * Bour 
PN+i,Nti'~ “uy - Bin _ 
Pix kt1'* Anpe * Bue pee po 
PuRt1,k*~ “uk * Four poe EAEGN 
Dog 0 for j<i-1l orj>itl 
Figure 3.a shows the triangular blocks 
decomposition for problem (3). Figure 
3.-b shows the resulting band problem 
(4). Matrix D can be efficiently 
obtained on a hexagonal array with 
w~by-w PEs, similar to the one proposed 
by H.T. Kung in /19/. This can be 
accomplished in the w-by-w SSAP, by 


programming all the PEs with OP(A). The 
D matrix is the result of the original 
problem, and can be found from D by 


means of the following computation: 


N 
a zu Pie. Se wea Pee? 


© PN Wd 

This computation may be performed 
inside the array by using the spiral 
systolic array feedbacks, causing no 
calculation time overhead. 
we 


To specify the required feedback, 


define two matrices: 
(output), which are (N+1)-by-(N+1) 
tridiagonal block matrices. I contains 
data and partial results which must come 
in through the SE ports of the South and 
East PEs. O contains the partial results 
coming out through the NW ports of the 
North and West PEs. 


I (input) and 0 


We have, consequently, the following 
inputs to the array: 
os C 
for2<i<Nt+l1 
i=l... i-1l_ 
I = 0 = + D 
Li,i Li-l,i (>, Lk,k © 27, Lk, k+l 
i=). sae 
I = 0., = D + D 
Ui,i Ui,i-1l ma Uk,k x Uk+l,k 
a) B 


b) 


Fig.3.a) Particioning D =A*B+Cin triangular submatrices 
b) Band problem originated through a DBT transformation 
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i i-1 

I := 0 ==" D ie D 

Li,itl Li,i ho Lk, k = Lk, k+1 
i i-1 

fuged 4° OURS ae Punk * = PuKt1,k 


And the last output from the array is 
the desired result for (3): 


N+1_ N | _ 
Dit Oey cert Pune kD 


OWN re ia 
UN+1,N+1 k=l 


k=1 


Remark that the elements of the output 
y oF 01> 
the SE input ports w cycles later than 
their appearance at the output NW ports. 
The elements of O, must be delayed by 2w 
cycles, instead. 


submatrices, 0 are required in 


To compute D:=AB+C, when A and B are 
w-by-N and N-by-w matrices, in the SSAP, 
the elapsed time is T = 3N+4w. When A 
and B are M-by-N and N-by~-P matrices, 


respectively, the resulting MP subpro- 
blems must be chained, originating a 
computation elapsed time: T = 3MPN/w- + 
+ 3MP/wtw. The value of T can _ be 


diminished by 3(MP-w-)/w time units if 
the DBT algorithm presented in /10/ is 
used, because a total chaining of the MP 
subproblems is attained. 


4.TRIANGULAR MATRIX EQUATIONS 


One of the basic problems in linear 
Algebra is the solving of a linear 
system of equations. A matrix equation 


AX = B (5) 
is a multiple set of linear systems, all 
working with the same matrix, A, and 


If the main 
of A are all non singular, 


different column-vectors. 
submatrices 


then there exist a unique lower- 
triangular matrix L with 1y,cl> and a 
unique upper-triangular matrix so that 
A=LU /24/. As a consequence, (5) is 


equivalent to LUX=B, yielding to a 
decomposition into two triangular matrix 
equations: 


LY = B (6) 
and UX = Y (7) 
In this section we consider the 


solution of the lower triangular matrix 
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equation (6). The solving of equation 
(7) shall be briefly commented at the 
end of this section, as it resembles the 
solution of (6). Let us consider that L 
is a N-by-N lower triangular matrix, and 
Y, B are N-by-M matrices. The problem is 
first partitioned into M disjoint 
subproblems, according to the following 
algorithm: 


Algorithm 2: 
For m:=l to M do 


Compute aN from: BN ~ L yN = 


»m »m * m 

Each subproblem (8) is solved by the 
array in a sequently fashion. We focus 
our attention, now, in the resolution of 


0 (8) 


one of these subproblems, e.g. the first 
one (m:=1) 
N N 
B cae Pa 6 = 0 9 
* 1 * 1 (9) 
The direct solution of (9), by forward 


substitution, is again decomposed as 
follows: 

Algorithm 3: 
Compute ait aa Oe ay se ae (10) 
for n:=2 to N do 

n-l n-1l 

ee | Lae Ye] (11) 
Compute | from: Pat ak ee =0(12) 

The submatrices, Ya l with 1<n<N, 


6) 
appearing in (10) and (12) must satisfy 
an equation as: 
B - LY = 0 (13) 
where L, Y and B are w-by-w matrices. 


To find the Y matrix satisfying (13) 
using the SSAP, the problem (13) must be 


decomposed into subproblems with 
triangular matrices structures: 
Bou ~ “2% pW pu = 9 (14) 
By = (LY DL a LY, = 0 (15) 


If the boundary PEs are programmed in 
a way such that a) PE(1,1) performs 
OP(C) (see fig. 2), b) PEs (1,j) 2<j<w 
perform OP(B) and c) the rest perform 
OP(A), the SSAP can solve equations of 
the type 


(16) 


where L_is a w-by-w lower triangular 
matrix, Y is a w-by~-2w upper band matrix 
with bandwidth ww, and B and C are 
w-by-2w band matrix with bandwidth 2w-1. 
In order to get the desired solution, L 


must be input through the_W, ports of 
the west boundary PEs, and B through the 
SE ports of the south and west PEs. The 
result Y is obtained at the ports of the 
north PEs, and _ the following equations 


are See HeeLess Biya CL Y1 1 pu = 0 
ene Bri,2 L Yi 02 = Q. Besides, matrix 
C is obtained at the NW_ports of the 


North and West PEs, with: C 


Z i _"DUL,1__ 

Cen = eee Si ae Pe a 
By making L = L and Pina = B, You pam be 
obtained to satisfy (14), and Yy ? 
_ aan 5) 
Crist are attained as: oe You and 
a ee = ss aes CL Yow L: Its in addition, 
Bri,2 os oe the 219 resulting 
submatrix is equal to the Y, submatrix 


L 
that satisfies (15). 


To complete the solution of problem 


(9) according to algorithm 3, the 
subproblems (11) must be solved. These 
subproblems consist in a block-row ma- 


trix (i) by block-column matrix 
ie ) 

Cy, ty multiplication with accumulation 
> 

of Bio 

can be solved in the array just like has 

been described in section 3, except that 

now the PEs(1,j) 1<j<w must perform 

OP(D) to change the sign of the 


previously calculated matrix, ae 
3 


Each one of these subproblems 


The chaining of the subproblems to be 
executed in the SSAP, according to 
algorithm 3, is shown in fig. 4. This 
figure shown the I/0 block-data stream 
at the W, N, SE ports of the West, 
North, South and East PEs, respectively, 
for the case N=3. The blocks labeled as 


Fig.4. 1/0 block - data stream to solve matrix 
equations (with N= 3). 
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f indicate that the corresponding data 
come from the feedback links. 


The computational time of one 
subproblem (8) is 

T = 3N7/2w + 3N/2 + w (17) 

The computational time of the M 
subproblems needed to solve the matrix 


equation LY=B where L is a N-by-N lower 
triangular matrix, and Y, B are N-by-M 
matrices is 
2 2 
T = 3MN /2w + 3MN/2w + w (18) 


The 
substitution, 


problem (7) is solved by back 
that is through reordering 


of matrices and solution by forward 
substitution of the problem 
xt U! = y! (19) 
where X' and Y' are block-row matrices 
t = ' = = 
with x i XweGdl for 1<i<N and y i 
Yy-i41 for 1<i<N; and U' is upper 
° ' 
triangular, with u i,j Foe eee ere ol 
for 1<i, j<N. 
If equation (19) is transposed, we 
arrive to the problem 
ut xrt 2 yrl (20) 
which is of type of problem (6), 
previously analyzed in this same 
section. 
Normally, problem (7) is solved after 
performing a LU decomposition. In this 
case, as will be shown in the next 


section, the final results of matrix U 
are obtained at the North boundary PEs. 
Consequently, in the execution of (20) 


shall consider that matrix ut 


we is 
available at the North boundary PEs, and 
xt is obtained at the West boundary 
PEs. In order to get the solution, in 


those cycles where the PE(1,1) performed 
OP(C), the OP(G) must be performed now; 
and in the cycle where PEs (1,j) 2<j<w 
performed OP(B), now the PEs (i,1) 
2<i<w must perform the OP(F), instead. 
The value of T for problem (7) are the 
same as for (6). 


5. LU DECOMPOSITION 


In this section we consider the LU 
decomposition 

A= LU (21) 

for a N-by-N matrix in the w-by-w SSAP. 


The partitioning method and the 
execution sequence of the resulting 


subproblems are specified in the 
following algorithm: 
Algorithm 4: 
Compute Piet and Se from 
ap oe Ooi Oe Mee 642) 
For n:= 2 to N do 
Compute co from: ae ees = | da (23) 
> n > 9n 9m 
n-1 n-1l. n=l n-1 
Compute Lae from Ly x Uys = Ale (24) 
_ 2. gh=)-ane1 
n,n * AL on uae * sn (25) 
Compute L and U from A = 
n,n n,n n,n 
= L U (26) 
n,n n,n 


The subproblem (22) and (26) involve a 
LU decomposition of a w-by-w matrix, 


AL n? 1<n<Nn. This can be directly 
> 


accomplished in the w-by-w SSAP (without 
need of any transformation), by 
programming the boundary PEs in such a 
Manner that PE(1,1) perform OP(E), The 
PEs (1,j) 2<j<w perform OP(B), and the 
PEs (i,1) 2<i<w perform OP(F). 


Every subproblem (24) implies to find 
n-l 
* on?’ 
satisfies the lower triangular matrix 


equation 


a block-column matrix, U that 


n-l 
* on 


n-l _n-1l 


L U 
kOe en 


Every subproblem (28) implies to find 


. n 
a block-row matrix Lo x» that satisfies 
> 


the upper triangular matrix equation 


Both types of subproblems, and their 
solution in a w-by-w SSAP, have been 
presented in section 4 of this paper. 


The subproblems (25) involve a block- 

row matrix Chae s: by a block-column 
n-l : 

( Us with 


accumulation of A ‘* 
n,n 


matrix multiplication, 
This was conside- 
red in section 3. 


Therefore, all the subproblems present 
in the partitioning algorithm 4 have 
been solved. The chained execution of 
these subproblems can be seen in fig. 5. 


The computational time of the LU 
decomposition of a N-by-N matrix on the 


w-by-w SSAP is T=N°/w° + nN? /2w + N/6 + w 


6. MATRIX INVERSION 


In most cases, 
inverse matrices 
Nevertheless, 


the computation of 
can be avoided. 
in some statistical and 


682 


engineering applications, this 


calculation is mandatory. 


Let us consider the case of obtaining 


the inverse tee of a N-by-N matrix, A, 


which has previously suffered an LU 
decomposition. We have 
to ee (27) 


The matrix equation L X = I must be 
find ae I is 
N-by-N identity matrix, and ut 


solved, to where the 


is the 


unknown matrix, X. 


This computation can be performed in a 
w-~by-w SSAP, taking into account that 


ile is also a lower triangular matrix; 
in this case 


T = N°/2w? + 3N7/2w + N + w (28) 
In a similar way, by means of the 
solution of the matrix equation YU=I we 


See 
See Sl 


V 


nan 
KA 


(24) ,, 


ro 
M 


AN 


(23) 


AK 


(26) 


Nt 


c 
zr 


a 


ee te f | (25) 
Pai ee (23) 
Lele (22) n=1 


I= 
[g | 


Fig.5. /O block - data stream for a_LU 
decomposotion problem ( with N =3). 


1 


obtain Y=U; taking into account that 


ut is upper triangular, T is (28). 

To find AS from (27), 
of section 3 may be used eliminating the 
multiplications by blocks 0. Then, we 


have T = 
fore, 


calculate the inverse of a N-by-N 
matrix, A, by using the LU decomposi- 
tion, inverting the triangular matrices 


and multiplying, is: 


T = 3N°/w- + 5N°/w + 8N/3 + 4w 


7. CONCLUSIONS 

In this paper, an efficient execution 
of several matrix algorithms with 
variable dimensions on a single systolic 
array of fixed size, has been presented. 
The processor is a SSAP with w-by-w PEs. 
Each PE is of the hexagonal type, and 
some of them must be able to perform 
several different operations. The array 
topology is fixed, and simple feedback 
and control are needed. 


A partitioning method to divide the. 


problem into subproblems matching the 
array size and characteristics, has been 
developed. Further conditions have been 
defined for the transformation algorithm 
that maps the original problem onto a 
different problen, which has been 
solved. The algorithms that have been 
described in this paper are: matrix 
multiplications, matrix equations, LU 
decomposition and calculation of the 
inverse matrices. 


The array control is very simple and 
regular, indeed. Every partial result 
obtained in one boundary PE, that must 
be reused, is input through that same 


PE. The ordering of production and reuse 
of these data, follows a FIFO policy by 
blocks. No storing of intermediate 


results outside the systolic array is 
needed. The feedback networks utilizes 
storage elements to transmit, in 
pipelined fashion, the partial results. 


For all the algorithms, the PE's 
utilization tends to the maximum possi- 
ble value (1/3) and the computing time 
to the minimum. Nevertherless, with an 
adequate implementation PE's utilization 
approaching 1 can be obtained. 


the algorithm 


ae + 3N7/2w + N/2 + w. There 
the elapsed computational time to 
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Abstract -- The need for fast solutions of a 


class of linear systems has led to the design of a 
SOR. 


special purpose parallel processor. The 
(successive overrelaxation) iterative algorithm 
allows a high degree of parallelism in 
computations, and is well suited to certain types 
‘of problems. An architecture is described which 


executes this algorithm efficiently, and achieves 


parallel operation at several levels. A small 
scale prototype of this architecture has been 
constructed. Simulation results for the complete 


system and test results for the prototype are 
outlined. 
INTRODUCTION 
There are many applications for linear 
systems which are large (involving thousands or 


tens of thousands of variables) and sparse, with a 
symmetric positive definite (SPD) matrix (since 
its entries represent some physical quantity such 
as resistivity or elasticity which is isotropic), 
for which solutions must be computed quickly. For 


example, Kim et al. [1] devised a finite element 
model of the human body for use in impedance 
imaging. In this technique, an iterative process 


consisting of the solution to Laplace's equation 
along with several other steps is repeated up to 
several hundred times. The finite element method 
{2] converts Laplace's equation to the type of 
linear system described above. Since hundreds of 
linear system solutions are required for image 
reconstruction, each instance of solving Laplace's 
equation must be done quickly in order to produce 
images in a reasonable length of time. A parallel 
processor can potentially provide the required 
Speed. 


Gauss-Seidel iteration with successive over- 
relaxation, known as the SOR method [3], is a good 


algorithm for this application. In this method, 
the system Ax = b is’ solved iteratively. 
Successive iterations of each component x; are 


computed from a combination of the previous value 
‘of that component, the corresponding component of 


vector b, and products of matrix entry aig and 


x,. This algorithm allows for a high degree of 
parallelism of computation. For example, the. 
computation for a variable can begin with the 


‘available values of x; and b;. As values of x; 
become available, products involving them can be 
computed and added to the summation. With careful 
node ordering, such computations can be in 
progress for many nodes simultaneously. 


“David Arpin is now at the Dept. of Electrical 
Engineering, US Air Force Academy, CO 80840. 


U.S. Government Work. Not protected by 
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THE ParSOR ARCHITECTURE 


which 
We call 
solution) 


We have devised an architecture 
implements the SOR algorithm efficiently. 
this the ParSOR (for Parallel: SOR 
architecture. 


Since the SOR algorithm allows computation to 
proceed for many of the variables at the same 
time, separate processors for each variable in the 


system can be utilized efficiently. The value x; 


computed in each processor is needed for the 
computations in several other processors, so_ the 
processors must be arranged to allow efficient 
inter-communication. 

The "one processor per variable" assumption 
requires that each processor be inexpensive, so 
that the system can be applied to problems of 
reasonable size. Therefore, the architecture is 


designed for implementation in VLSI, with each 
processor requiring a single chip. This limits the 
number of I/O pins for each processor, as well as 
the die area which can be dedicated to various 
functions. 


Since matrix A is sparse, most elements 
(aj 3) are zeroes, so corresponding values x4 need 
not be communicated to compute x;. For example, 
in the finite element method, aij is non-zero 
only if i and j are node numbers of nodes in a 
common element. For typical two-dimensional 
finite element models with triangular or 
rectangular elements, each node is connected by 
common elements to 6 - 8 others. In such problems, 


each processor will almost always need to 
communicate with its nearest neighbors. In other 
cases, some longer communication paths will be 
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required, depending on how well the problem can be 
mapped onto the grid of processors. 


Thus, the communication scheme must be 
capable of routing data to distant processors, but 


should be optimized for communicating over’ short 
distances. To enhance parallelism, it should be 
possible to communicate with several nodes at the 
same time. Finally, since data required in a 
computation can arrive at the processor while 
another portion of the computation is in progress, 
it should be possible to perform communications 
without interfering with the computation. These 
requirements must be weighed against the limited 
number of I/O pins and die area available in the 
Single chip processor. 


Based on all these considerations, we 
selected a simple nearest neighbor rectangular 
grid for communication between PEs, as shown in 


Data to be sent to more distant PEs 


Fig. l. is 
routed through intermediate nodes. This 
arrangement provides very short data paths for 


regular finite element grids, and the flexibility 


to handle longer communications paths. In order 
to minimize the impact of inter-processor 
communication on the performance of the system, 


each processor is partitioned into an Arithmetic 
Unit (AU) and a Communications Unit (CU) (Fig. 2), 
which can operate independently. 


The Arithmetic Unit at node i receives 
consisting of values 


data 
Xx, and associated matrix 
elements ai; from the Communications Unit, 
performs the multiplications and additions 
necessary to compute the next iterated value of 
a and sends this result back to the CU. To do 
these tasks, the AU needs a multiplier, an adder, 
a few registers, anda small amount of logic to 
keep track of when it has finished an iteration. 


When an iteration is finished, the result 
(x;) is sent to the nodes which need that value. 
In order to use this value, the receiving nodes 
need to know which value it is, i.e., the 
subscript i, so this value is appended to _ the 
message. Some means of routing the message to its 
destination is also needed. In the SOR algorithm, 
the structure of matrix A and the mapping of 
variables onto processors determines where the 
messages must be sent, so a number of rows (R) and 


a number of columns (C) of displacement in the 
network can be precomputed to specify this 
routing. Thus, a complete message would consist 


of x5, j, Rand C. 


For a message arriving at node if, two 
separate activities are required. The value of 
the index j must be examined to determine whether 


the value XK is needed for the iteration 


computation at node i. Also, the values of R and 
C must be examined to determine whether’ the 
message should be sent on to other nodes, and 
updated to reflect the fact that the message has 
moved one step closer to its ultimate destination. 
These two determinations are independent of each 
other, and so can be performed in parallel. To 
prevent these activities from interfering with the 
iteration computations, they are performed by the 
Communications Unit. 


The cCU is partitioned into four’ separate 
Communication Processors (CPs) (Fig. 2), each 
dedicated to communication with one neighboring 
node in the network. This allows messages to be 
sent to or received from several nodes 
simultaneously. Each CP can determine whether the 
data is needed by the AU and/or other nodes, and 
route the data on as required, with no 
intervention from the AU. Figure 3 shows’ the 
basic architecture of one CP. This processor is 
divided into six autonomous functional units: 


External Bus Interface - This unit provides 
the communication with another processing element. 
Control lines are used to allow the processors to 
coordinate their use of the data lines efficiently 
and prevent conflicts. In order to reduce the 
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number of input/output pins, the data lines are 
multiplexed. 

Associative Memory - This unit examines’ the 
value of the index, j, of the incoming message to 
determine whether the data is needed by the AU. 
If the value of j, as received by the External Bus 
Interface, is stored in this memory, it signals 
that the data (x.) is needed by the AU for this 


node, and retrieves matrix entry a; 5° 
, 


RC Update - This unit examines the values of 
R and cC received in the message. If either of 
these is nonzero, the message must be sent to 
another node. This unit signals the fact that the 
message must be sent on, updates the value of 


either R or C to reflect the fact that the message 
has moved to this node, and determines which of 
the other 3 CPs in this node should send the 
message out to the next node in its path to the 
specified destination. 


Internal Bus Interface - This unit provides 


communication with other CPs in the same 
processing element via a single shared bus. This 
communication is required when the message 


received by the External Bus Interface of a CP 
must be sent on to another node in the network. 


AU Interface - This unit receives data 
consisting of an x value and a matrix entry from 
the Associative Memory, and sends this data to. the 
Arithmetic Unit. All the CPs share a single data 
path to the AU, and the AU uses this same path to 
send results to the CPs. In order to send data 
over this bus, the CPs must request and be granted 


access. However, the AU need not be granted 
access, Since it is not possible for the AU and 
the CPs to send data over this bus at the same 


time (this is caused by the data requirements for 
each step in the SOR iteration and the fact that 
the A matrix is symmetric). When the AU finishes 
an iteration, it sends the new value to the AU 
Interface of all four CPs. 


Message Generator - When a result is received 
from the AU, this unit generates new messages to 
carry that value to the other processing elements 
which need it. 


This architecture achieves parallelism at 
several levels - e.g., between computations at 
different nodes, between computation and 
communication, and between the various aspects of 
message routing and handling. Thus, it can 
implement the SOR algorithm with high efficiency. 


SIMULATION and PROTOTYPE 


A finely detailed, parameterized model of 
this architecture was constructed using the SIMULA 


language. We used this model to measure the 
number of clock cycles needed to complete each 
iteration, in order to obtain a _ performance 


measure which is independent of the 
used to implement the system. 


technology 


The performance of the system depends in part 


on the number of interconnections between 


variables. More interconnections require more 
computations at each node, and cause data to be 
communicated over longer paths. For most of our 
runs, we simulated a network of 15 nodes, arranged 
as a 3x 5 finite element grid with triangular 
elements, with some additional connections between 
more distant’nodes added. This model provided a 
reasonable degree of complexity, so the results 


would not just represent the best possible 
performance. 
System performance also depends on 


assumptions such as the number of cycles required 


to send a message over the external bus, and the 
time required to complete multiplications. 
Assuming nominal values of such parameters, we 
determined that this system can complete each 
iteration in approximately 330 clock cycles. 
Processors in the interior of the grid achieve 
efficiencies (i.e., percent of time when the AU is 
performing computations) of over 70%. 


A hardware prototype of the processing 
element was constructed to validate these results, 
and further refine the details of the 
architecture. The complexity of the design, and 
the size of the data paths, made it impossible to 
build an entire processing node, so the prototype 
was scaled down substantially. We eliminated the 
Arithmetic Unit completely, and constructed two 
CPs. The AU and neighboring nodes in the grid are 
simulated by a microprocessor (Motorola 68000). 
Extensive testing of this prototype verified the 
operation of the communication processors and 
supported the conclusions of the simulation 
programs. 


CONCLUSIONS 


Problems in this architecture such as how to 
load data into the processors, control the 
stopping of the iteration, and retrieve the 
results have been addressed [4]. Evaluation of 
these results suggests that this architecture is 
capable of solving linear systems involving tens 
of thousands of variables in less than one second. 
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Abstract 


A parallel algorithm to generate optimum test-and- 
treatment decision trees is presented. Constructing such trees is 
NP-hard. The algorithm is designed for a machine whose 


number of connections is 3p /2, where p is the number of pro-. 


cessing elements(PEs), and where the PEs are simple enough 
such that a machine with 2” PEs is currently implementable and 
a 2°° PE machine is feasible. A speedup of O(p /logp ) over a 
sequential dynamic programming algorithm for this task is 
achieved, by paying careful attention to the communication 
problem. This algorithm is realized on the Boolean Vector 
Machine, a cube-connected-cycle system with 27° PEs which is 
currently running in prototype form, with 512 PEs. The algo- 
rithm is concrete, in that all processor allocation and other con- 
trol issues have been solved. The particular NP-hard problem is 
of independent interest; it generalizes the binary testing problem 
by introducing treatments on an equal basis with tests. Applica- 
tions of this test-and-treatment problem are found in medical 
diagnosis, systematic biology, machine fault location, laboratory 
analysis and many other fields. 


1. Introduction 


1.1. The Test and Treatment Problem 


The test-and-treatment(TT) problem originally defined by 
D.W. Loveland is a generalization of the binary testing problem 
studied by many researchers(see [1][2]{5][6][7]). This problem is of 
independent interest since it finds applications in numerous real- 
world applications. 


The test and treatment problem requests the selection of a 
“‘best’’ test and treatment procedure under a minimum expected 
cost criteria. The problem specification consists of a universe 
U = {0,1,...,.4k-1} of k objects, each with an associated weight 
P;, and a set of tests and treatments {7;,1 <i < N}, each 
with an associated cost. The T;,1 <1 < m, denote tests, and 
the T;, m <# < N, denote treatments. We.assume that only 
one object is actually faulty, its identity is unknown, and each 
object ¢ has a prtort likelihood P; of being the faulty object. 
Each test and treatment is specified by a subset of the universe; 


if the unknown object is in the test or treatment set then the test . 
responds positively, or “‘is successful”, or the treatment is suc-. 


cessful. If the test is successful, the objects not in the test set 
are eliminated from consideration (and if negative, the test set of 
objects is eliminated), while a successful treatment ends the pro- 
cedure. A failed treatment means the processing must continue. 
A successful TT procedure must provide for each object to be 
treated; a TT problem specification is adequate if there exists a 
successful TT procedure. With each test and treatment 7; a 
cost ¢; of executing that test or treatment is given with the prob- 
lem specification. 


From the above description we see that a TT procedure is 

a binary decision tree, with both test and treatment nodes. A 
* Work reported herein is partially supported by the Air Force 
under grant number AFOSR 81-0221 and AFOSR 83-0205. This 


paper is a shortened version of a paper submitted for publica- 
tion. 


** As in the C language(4|, %, /, &, |, ~ are the modulo, integer 
division, and, or, and exclusive-or operations respectively. 


0190-3918/86/0000/0688 $01.00 © 1986 IEEE 
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typical TT procedure is given in Fig. 1, where a single-line arc is 
used for both test outcomes (the positive outcome to the left by 
convention) and a treatment failure, and a double-line arc 
denotes a treatment set. (The double-line arc is for emphasis 
only since every branch of a successful TT procedure must ter- 
minate in a treatment set.) 


U ={0,1,2,3,4,5}, P;=1/6 for all 1, 
T,={2}, To={0,4}, Ts={0,1}, T.—={0,1,3}, 
T s={4}, To={0,2}, T7={5}, m =3 
;=1 for all ¢. 


Fig. 1. 


The TT procedure tree has an expected cost defined as: 

Cost (Tree) = }) (cost of all tests and treatments encountered 
teU 
if ¢ is the faulty object) - P;. The desired solution is the pro- 
cedure which minimizes this cost. Thus 
Cost = min Cost(Tree ). 
all trees 

It has been shown|3][5] that finding optimal solution to the 
binary testing problem is in general NP-hard. Since the test- 
and-treatment problem generalizes the binary testing problem, 
the test-and-treatment problem is also NP-hard. A parallel algo- 
rithm for this problem is presented which is implemented on the 
Boolean Vector Machine(BVM). We are able to achieve time 
complexity O(ks(k+logN)) using p=O(N2*) processors, 
where k is the size of the universe of objects containing the mal- 
functioned one, s is the precision required, and N is the total 
number of tests and treatments available. This result represents 
a speedup of O(p /logp ), with regard to the known sequential 
algorithm which could be obtained by modifying the backward 
induction algorithm given by Garey[I]. 


1.2. The Boolean Vector Machine 


The BVM is a CCO[8] parallel machine. Logically the BVM 
can be viewed as a bit array. Each row of bits forms a register. | 
Each column forms a PE. The registers are denoted R(0], R[1], 
R(2],.... and are termed the PE’s “‘memory’’. Each PE also con- 
tains 2 non-memory registers A and B which act like 1-bit accu- 
mulator. Let r be a positive integer and Q =2", there are a. 
total of 27+? PE’s, as required by a complete CCC network. 
The address of a PE can be represented by (7, j) with the first 
component being the cycle number and the second the address 
within the cycle. Within cycle « PE (#, 7) is only connected to 
its predecessor (1,7 +Q-1)%Q)° and _ its successor 
(¢, (9 +1)%@Q ). In addition each PE (1,7) is connected to its 
lateral neighbor (1 “2? , 7), thus connecting the cycles together. 


The BVM is a bit-oriented machine. Only Boolean function 
operations are allowed. Each of its instructions involves possibly 
register A and B and at most one another register. Its instruction 
has the form: 


{A or R[j]}, B = f, g(F, D, B) {IF or NF} <set>; 


Two assignment operations will be simultaneously per- 
formed by executing this instruction. The first assigns f(F, D, B) 
to either A or Rj]; the second assigns g(F, D, B) to B. f and g 
are any Boolean functions of three arguments. F may be A or 
R{[j]. D may be A.N or R[j].N. N denotes a neighbor PE of PE 
(c ,p ). It can be: 


S: successor PE (c ,(p +1)%@Q ); 
-P: predecessor PE (c ,(p + Q -1)%Q ); 
L: lateral PE (c *2? ,p ); 
XS: even successor exchange PE (c ,p *2°); 
XP: even predecessor exchange XP=P if p is even; XP=S if 
p is odd; 
I: input one bit to PE (0, 0), PE (2° -1,@Q -1) outputs one bit 


at the same time. All other PEs get bits from their prede- 

cessors except PEs (., 0), which get bits from PEs 

(-1, Q —1). 

The {IF or NF} <set> denotes the activate/deactivate 
set. <set> is a subset of {0, 1, ..., 27-1}. IF <set> means all 
the PE’s (i,j), O<i <2? and je <set>, will be activated 
while the remaining PE’s will be deactivated. The meaning of 
NF <set> is just the opposite. If the part {IF or NF} <set> is 
not present in the instruction, then all the PE’s are activated. 
There is also a special memory register, E, which is used as an 
enable/disable register. PE ¢ will be enabled or disabled accord- 
ing to whether its bit of the E register is 1 or-0. The E register 
itself is always enabled. The value of a PE’s memory(except for 
register E) will not be changed while that PE is disabled. The 
entire state of a PE will remain unchanged by an instruction 
which deactivates that PE. 


For further details of the BVM, the reader is referred to [9], 
{10}. 


1.3. ASCEND/DESCEND Algorithms 
An algorithm is in the ASCEND(DESCEND)[8] form if it 


consists of a sequence of basic operations on pairs of data, where 
the addresses of the pairs differ successively in bit 0, bit 1, ..., bit 
p—1 (bit p-—1, bit p-2, ..., bit 0), here and henceforth bits are 
counted from the least significant bit. 


Preparata and Vuillemin showed in [8] that the 
ASCEND/DESCEND algorithms can be simulated on a CCC at 
a slowdown of a factor of 4 to 6, regardless of the network sizes. 
Thus designing an ASCEND/DESCEND algorithm for a hyper- 
‘cube, and transforming it into a CCC algorithm seems to be a 
reasonable way of designing an efficient CCC algorithm. The 
algorithm presented in this paper is in the ASCEND/DESCEND 
form. The processor control is largely ignored here. We will 
address this issue in the expanded version of this paper. 


2. <A Parallel BVM Algorithm for the 
Treatment Problem 


Test-and- 


Rather than enumerate all the possible TT trees and take 
the minimum cost directly as in section 1.1, we use the approach 


of dynamic programming and note that the optimal tree must. 


apply the minimum cost action (test or treatment) to already 
optimal subtrees. The optimal subtrees are obtained by begin- 
ning with the empty tree and combining trees as just described. 
Thus we start with C'(¢)=0 and C(S)=oo for S¢. For an 
arbitrary nonempty set S of objects we compute the cost C(S) 
as follows: 
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C(S) = min| 
min (4 "(S}4-C(S ():27)- Ce-7,)), 
min (ti *(S) + O(S-T;))]. 


where p(S) = )}p;. 
jes 

This definition is from first principles: the value t; is charged to 
each object subject to that action and the total weight of those 
objects to be charged is p(S). For tests, one adds in the cost 
C(S ‘a T; ) of the set S (>) T; to which the test responds posi- 
tively (the test set) plus the cost C(S—T;) of the set S—T; of 
objects not responding to the test. Treatments terminate action 
on the objects of T;, m <1« <n, (i-e., treat them) so the only 
objects needing further action are the objects in S—T;; we add 
in the cost C(S—T;). The essence of an argument by induction 
that C(S) is correctly computed uses the assumption that 
C(S () T;) and C(S-—T;) are the correct costs for the subtrees 
and then we note from the above description that the correct 
minimum is taken to compute C(S). We see that 
C(U)= Cost (tree ) as desired. 


In actual computation we will assign an array M[S, k]| to 
calculate C(S): M[S,i|=t,p(S)+C(S()T,)+C(S-T,; ), 
O<i<m, and M[S,i|=t;p(S)+C(S-T;), m<i<N, there- 
fore 

C(S )=min{M[S 7] |0<i<N}. 

The above observation is expressed in the following algo- 
rithm. 

TT() 
{ 
foreach 1: 0<1 <N do { 
TP [S ,¢]=t; *p(S), if |S | >0, 


-,_ |0,if |S | =0, 
M[S 1] = [3 otherwise. 


} 
for(j =1; 7 < | U |; 5 +4) { 
foreach (S, 1+): UDS and |S |=j andO<i <N do { 
M(S ,t|=M [S-T, it |; 
M([S ,1|+=TP [S ,¢]; 
if(t<<m) M[S ,¢]+=M[S VT, ,¢); 


foreach (Sz): UDS and |S |=j and0<i<N do 
M [|S ,t|=min(M[S,r¢]| O0<2 <N); 
} 


} 
} 


| S | in the algorithm denotes the size of set S. 


Can our parallel TT algorithm be transformed into the 
ASCEND/DESCEND form? Observe that if we assign a PE to 
each (S 2) pair with M[S,+|] and TP[S 7] placed in different 
portion of that PE’s memory, the instruction 
M([S ,t|+—TP [S ,i] can be executed in parallel by all PEs at 
once.(The necessary bit-serial addition algorithm is a subroutine, 
executed in parallel on all PEs.) The minimization part of the 
algorithm can be transformed into the following ASCEND form: 


for(t =0; t <logN ; ++) 
foreach(S ,7): UDS and |S |=j and 0<?<N do 
M{S ,i]—min(M [8 ,i],M [S ,4#t]); 


where t#¢ is the binary number obtained by complementing the 
t-th bit (from the right) of 7. 


Now consider the instruction: 
foreach (S ,7): M[S ,¢|==M [S—T; ,¢]. 


Can this operation be transformed into the 


ASCEND /DESCEND form? 


Let us begin by expanding it into its component operations 
as follows: 


foreach (S ,7) do { 
R[S ,¢|=M [S -T, ,¢]; 
M([S ,¢|=R [S ,7]; 
iS 
_ Consider now the operation 


R[S ,i]=M [S-T; 7]. 


For each S and 7, this operation is ‘well-defined, i.e.: each 
PE which receives information during this activity receives infor- 
“mation from only one PE. However, one PE must send informa- 
tion to several PEs. In general, M [S—T; ,¢] must be broadcast to 


R |(S-T;)\JV,#], for each V such that S()T;V. The fol-. 


lowing loop accomplishes the required broadcast, for all 7, 
O0<t <N: 
R[S ,¢|=M [S ,¢]; 
_for(e =0; e <k; e ++) 
foreach (5,1): UDS and 0<i <N and eeS()T; do 
R [S ,@J=R [S—{e },¢]; 
M[S ,¢|=R [St]; 


Let , ={jeU | 7 <t }. Using induction one can show that 
just before e takes on value ¢, R |(S-T;)\_)(S QT; 4-1), | 
holds M[S—T; ,¢]. After all iterations of the loop on e, , =U, 
and SVT: QL=S()\T;. Thus, for all S and ¢, 
R [S ,t|=M [S-T, ,¢]. 

Similarly, the operation 

if (¢<m) M |S ,t|+=M [S T;,¢] 
can be transformed into: 


Q [St ]=M [5,2]; 
for(e =0; e <k; e ++) 
foreach (St): UDS andO<t<WN and e€S—-T; do 
Q [5 ,7]=@ [S fe },¢]; 
if (¢<m) M[S ,¢]4+=Q [S ,¢]; 


The complete algorithm now appears as: 


TT() 


foreach +: 0<i<WN do { 
if(| S | >0) TP [S ,¢]=¢; *p (5); 
M [¢,¢ |=0; 
if( | S | >0) M[S ,t]=00; 


for(j =1; 7 Sk; 7 ++) { 
foreach (S 1): P,(S ,¢) { 
Q[S ,tJ=R [S ,¢]=M[S ,2]; 


for(e =0; e <k; e ++) { 
foreach (S ,2): P,(S,¢) and e eS (YT; { 
R [St ]=R [S—{e }, 1]; 


foreach (Sz): P,(S,¢) and e ¢S—T, { 
Q [S 7 ]=Q [S—{e be]; 


foreach (S ,i): P(S.,7,7) { 
M[S,i]=R (S,4]; 
MS ,i]+=TP [5,4]; 


if(i <<m) M[S ,¢]+=Q [S,¢]; 
} 
for(t =0; t <logN; t ++) 
foreach ($1): P(S,t,7) {. 
M[S ,i|=min(M [S ,1],M[S ,+#t ]); 
} 


} 
} 


Where P,(S ,t)=UDS and 0<i <N, and P(S,t.7 )=P,(S ,t) 
and |S |=). 

On the BVM each PE will stand for a pair (¢,7), where ¢ 
and y are binary numbers and 77, the concatenation of « and j, 
is the address of the PE. | |, the number of bits in 7, is k. 
The component ¢ denotes a subset S of U, aeS iff a-th bit of 
t is 1. g is the index of a test or a treatment. 


The predicates e€S OT: and e¢S-—T; can be imple- 
mented by using the processor-ID. The Processor-ID is a bit pat- 
tern in which the memory of processor (?, 7) contains 7j [9]. 
The processor-ID bits will let each PE know the set S it 
represents. T; should be input to the BVM. The most interesting 
part of the algorithm is the loop indexed by the variable e . Note 
that by imposing the conditions eeS()T; and eeS—T; the 
result becomes R [S ,t|=R [S-T; ,¢] and Q [S ,¢]=Q [S)T, ,7]. 


3. Conclusion 


Many NP-complete problems can be solved on the BVM 
fairly efficiently, as we illustrate using the test-and-treatment 


_ problem. Indeed, the test-and-treatment problem itself is of real 


interest as it has many important applications. A parallel algo-— 
rithm for this problem is presented and implemented on the 
Boolean Vector Machine. The communication problem and the 
PE allocation problem have been solved so that a speedup of 
O (p /logp ) is achieved. 
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ABSTRACT 


“Truth maintenance” 1s an important artificzal tn- 
telligence technique. Although some recent research 
has increased the efficiency of the status assignment 
process, it 18 still computationally expensive. Fortu- 
nately, there is a stgnificant amount of parallelism in- 
herent in the process. We present a distributed compu- 
tation for status assignment in a Truth Maintenance 
System (TMS) where each node in the system is im- 
plemented as an independent processor. 


1 Introduction 


Doyle first presented a formal Truth Maintenance Sys- 
tem (TMS) in [2]. (This was later renamed a Reason 
Maintenance System [3], but TMS has come to mean 
a class of techniques within artificial intelligence and 
we use this generic term.) One of the major functions 
of such a system is to properly assign the status of 
IN or OUT to each node in a network. The nodes 
correspond to assertions and the statuses to belief in 
the assertions. We discuss below the structure of the 
network and the criteria for proper status assignment. 

The status assignment process in a network of sig- 
nificant size requires a large amount of computation. 
This is evident not only from the algorithm but from 
general experience with systems that include a TMS, 
such as DUCK [6] and MRS [4]. Recent work has at- 
tempted to improve both the efficiency and function- 
ality of the original algorithm [7] and [5]. However, 
the process is inherently expensive. Fortunately, sta- 
tus assignment in a typical network allows the status 
of many nodes to be determined based on only that of 
a relatively few other nodes. This allows a large degree 
of parallelism in status assignment. 

Dijkstra has proposed a distributed computation 
called “diffusive computation” in [1]. In such a sys- 
tem, a computation proceeds by passing messages in 
a finite, directed graph of processors. A processor is 
activated only when it receives a message. It makes 
a calculation and sends the result in a message to its 
successor. The end of the computation is signaled by 
a system of replies. A TMS network of nodes can be 
implemented as a finite, directed graph of processors 
which seek to determine their status. We give a diffu- 
sive computation which implements status assignment 
in such a network of processors. In the discussion that 
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follows, we assume that the reader is less familiar with 
the TMS proposed by Doyle than with Dijkstra’s dif- 
fusive computation. 


2 TMS Status Assignment 


Each node in a TMS network is to be assigned a status 
of IN or OUT. A set of justifications is associated 
with each node. Each justification in the set consist of 
an ordered pair of sets of nodes. The first element of 
the pair is called the Inlist of the justification, and the 
second is called the Outlist. A justification is valid 
exactly when each element of the Inlist is IN and each 
element of the Outlist is OUT. An assignment of sta- 
tuses to a TMS network is consistent when each node 
is assigned a status of IN iff it has at least one valid 
justification and OUT otherwise. The nodes form a 
directed graph where the union of the Inlists and Out- 
lists of the justifications for a node are its predecessors, 
and those nodes which include it in a justification are 
its successors. 

It is convenient to name certain subsets relative to 
a node. Its successors are also its consequences. If 
a node is IN, the nodes in one valid justification are 
designated its supporters. If a node is OUT, its sup- 
porters will include exactly one node from each justi- 
fication: either an OUT node in the Inlist or an IN 
node in the Outlist. The affected consequences of 
a node are those consequences for which the node is a 
supporter. The transitive closure of the affected con- 
sequences constitutes the repercussions of the node. 
The believed consequences are those affected conse- 
quences which are IN. The transitive closure of these 
is the set of believed repercussions. The ances- 
tors of a node are those in the transitive closure of 
its supporters. Any of the sets mentioned so far may 
be empty. If the set of justifications is empty, then 
the node has no justification and is OUT. If the In- 
list and Outlist of a justification are both empty, the 
justification is valid and is said to be a premise. 

We consider the problem of adding a justification 
to a node in a TMS network in which the status as- 
signments are consistent and well-founded. This last 
means that no node is in its own believed repercus- 
sions. The status assignment computation should pre- 
serve the conditions of consistency and well-founded- 
ness after the justification is added if it is possible to do 
so. As an example, consider Figure 1. In this graph 
and those which follow, each circle corresponds to a 


+ 
+ 


R 


Figure 1: Unique Well-founded Consistent State 


justification, with an arrow pointing to the justified 
node, positive arcs connected to the elements of the 
Inlist, and negative arcs to elements of the Outlist. 
The requirement of consistency is met if each node in 
this graph is assigned a status of IN. However, such a 
state of assigned statuses would not be well-founded. 
The only consistent well-founded state has each node 
labeled OUT. 

We will show next how status assignment can be 
accomplished by a simple diffusive computation that 
generally follows Doyle’s original algorithm. It is im- 
portant to note that our computation, like that of 
Doyle’s, is incomplete in some respects. A new jus- 
tification may also introduce an unsatisfiable cir- 
cularity into the system by creating a graph, such 
as that of Figure 2, for which no consistent assign- 
ment of statuses can be found. We improve upon the 
original algorithm by ensuring termination if a con- 
sistent labeling of statuses is not found. However, it 
may be possible to consistently assign statuses only 
if the supporters of some of the nodes are examined, 
and our computation only examines consequences of 
nodes. Our computation will also assign statuses if it 
is possible to do so consistently even though there is no 
assignment which is both consistent and well-founded. 
Russinoff has solved these problems in [7] with a more 
complex truth maintenance algorithm. 


3 The Diffusive Computation 


Imagine now that the nodes in a TMS network are 
processors connected to each other via two-way chan- 
nels that carry messages in one direction and replies 
in the other. Each processor stores a set of justifica- 
tions and its status. It is connected to other processors 
by channels which carry messages to its consequences 
and replies to the processors named in any of its jus- 
tifications. We will describe below programs for the 


processors that implement the diffusive computation . 


of status assignment. 

Given a “root” processor which is initially OUT and 
which acquires a valid justification, we begin as Doyle 
does: the status of the processor and every element 
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Figure 2: Unsatisfiable Dependencies 


of its repercussions is set to NIL. This status assign- 
ment is special and only occurs during updating. The 
diffusive computation for this is quite simple. Each 
processor sends a message giving its status as NIL to 
each of its consequences. If a processor receives such a 
message from a processor which is not a supporter, or 


if the processor has already received a message from 
a supporter, it replies immediately. Otherwise, it sets 
its status to NIL, and sends the same message to each 
of its consequences. The originator of the first mes- 
sage is known to the processor as its “engager”. When 
each one of the consequences of the processor, if any, 
has replied, then it replies to the engager. The root 
processor recognizes the termination of the NIL sweep 
when all of its consequences have replied because it 
marks itself as its own engager. At this point, we can 
attempt an assignment of IN or OUT to the designated 
processors. 

The basic idea of the diffusive computation for sta- 
tus assignment is that upon receiving a message for the 
first time, a processor records the sender as the engager 
and checks to see if if its own status is changed by the 
new information. This is determined by whether or 
not there is now a valid justification for the processor 
based on the new status of the engager. If the status 
of the receiving processor is unchanged, it replies im- 
mediately. If the processor’s status is changed, then it 
sends a message containing its new status to each of its 
consequences. A processor also replies to any sender 
if it has already received another message to which it 
has not replied or if it has no consequences. It replies 
to the engager when it receives replies from all of its 
consequences. 


Figure 3: New Justification 


We illustrate the use of such a diffusive computa- 
tion with Figure 3. Suppose that initially P had no 
justification and was thus OUT. The justification be- 
ing added is shown in the dotted lines. Processors Q 
and S were IN and R was OUT. Thus P is an OUT 
processor which acquires an initially valid justification 
and so becomes the root node in a diffusive computa- 
tion to determine status assignments. 

After the NIL sweep, both P and Q will each deter- 
mine its status to be OUT because of the NIL status 
of S. Q will make its determination when it receives 
a message from P that it is OUT. Q then sends a 
message to R who is able to determine that its sta- 
tus is IN. R’s message to S confirms the assumption 
that the NIL status of S would eventually result in an 
OUT status. 5S sends a message to Q and P. Since they 
are already engaged, they take note of the new status 
and reply immediately. Each node in the graph then 
replies to its engager after determining that its status 
is unchanged since it first received a message. In the 
final state, only R is IN and the other processors are 
OUT. — 

The above descriptions of diffusive computations 
would suffice for status assignment were it not for un- 
satisfiable circularities. Obviously, some graphs may 
not admit a consistent assignment of statuses. In such 
a case, the diffusive computation described above would 
not terminate. The simple fix for this problem pro- 
posed here is to pass a message which contains a string 
of processor names as well as a status. The string con- 
tains the ancestors of the engager. Thus a processor 
can determine if its current status depends on itself. 
If an engaged processor switches status twice as a re- 
sult of being its own ancestor, then it is involved in 
an unsatisfiable circularity. (In [2], Doyle claims that 
an unsatisfiable circularity can be detected if a node 
is its own ancestor after finding a valid justification 
with a NIL status in the Outlist. Unfortunately, this 
is not the case.) We propose that a processor signals a 
special “trouble” reply to its engager when an unsat- 
isfiable circularity is so detected. 

Detection of an unsatisfiable circularity, given a par- 
ticular state of assignments, is by itself insufficient. It 
may be that the diffusive computation has not yet ter- 
minated for some possible supporters of processors in- 
volved in the apparent circularity. We must specify 
what action is to be taken by a processor when it re- 
ceives a trouble reply. Our solution is for the processor 
to wait until all replies have been received and then re- 
send its status to its consequences if it has received the 
trouble report for the first time from a particular con- 
sequence. If a trouble reply is received twice from the 
same consequence, then the processor replies trouble 
to its engager. 

For example, in Figure 4, only S is IN before P ac- 
quires its premise justification. After the NIL sweep, 
P determines that it is IN and sends its new status to 
Q and T with the message < P,IN >. Suppose that 
T is much slower than the other processors. While T is 
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S OQ----P 


Figure 4: Apparent Circularity 


very slowly determining its status, Q determines that 
its own status is OUT since the status of S is NIL. 
R then receives the message < QP,OUT >. R then 
sends < RQP,OUT > to S which then sends 
< SRQOP,IN > to Q. The latter records R’s new sta- 
tus and replies immediately since it is already engaged. 
Eventually, Q receives its outstanding reply from R. 
Now Q finds that its status has changed to IN. Also, 
one of its supporters includes Q in its ancestors: S is 
recorded as having the status of IN due to the message 
< SROP,IN >. Since Q is its own ancestor, it sets a 
circularity flag before it resends its status. It is easy 


to see that its status will change again due to its being 
its own ancestor if it does not receive a message from 
T before it receives its next reply from R. If T 1s so 
slow, then, since the circularity flag has been set, Q 
replies trouble to P. 

Processor P does nothing until it hears from T. At 
that point, P rebroadcasts its status regardless of the 
fact that it has not changed. This time, T will be 
IN and Q will have a valid justification regardless of 
the status of S. Thus, the final determination of status 
assignments in this example will be the inverse of the 
state before P received a justification: only 5 will be 
OUT. 

In the next section, we give the algorithm for a pro- 
cessor to carry out the diffusive computation described 
above. We omit the NIL sweep as trivial and assume 
that we start with all of the affected consequences of 
the root processor NIL and all of their consequences 
notified of their new status. We also assume that the 
root processor has already made itself its own engager 
and has initiated the message sending. 


4 Algorithm for Processor 2 


Each processor P; has the following stores: 


Status: IN, OUT, or NIL. 


Justification Set: This is is the set of justifications 
for P; and will be abbreviated 7. Each Justifica- 
tion consists of a pair of sets of processor ids. For 
a particular justification J,, we denote the first 
element of the pair as Inlist(J,) and the second 
element Outlist(J,). Both 7 or either element of 
any justification may be empty. 


‘Supporter-Status: This is a vector of pairs indexed 
over the union of the processor ids in any set in 
J. For any element 7 of the vector, the first 
element of its pair is a set of processors called 
Contributors(j). The second is Update(j): the 
last known Status of 7. Thus, each element of 
this vector has the same format as a message. 


Deficit: The obvious integer required for a diffusive 
computation. 


New-Status: IN or OUT. 


Circularity: A boolean flag indicating a possible un- 
satisfiable circularity. | 


Ancestors: The set of processors upon which the cur- 
rent value of Status depends as determined by the 
procedure Status(i). 


Consequences: The set of processors which mention 
P; in their justifications. 


Trouble: The set of processors which have reported 
an unsatisfiable circularity with a trouble reply to 


Pi, 


‘Define Processor(i): 
‘begin 
If receive message < Pjw,T' > then 
begin 
Supporter-Status(j) := < Pjw,T > 
If Engager = 0 then 
begin 
Engager := P; 
New-Status := Det-Status(i) 
Circularity := false 
Trouble := {} 
If New-Status=Status then 
Send-Reply(i,‘N’) 
else Resend-Status(i) 
end 
else Send-Reply(i,‘N’) 
end 
If receive normal reply from P; then 
begin 
Deficit := Deficit -1 
If Deficit= 0 then 
begin 
New-Status := Det-Status(i) 
If Trouble= {} then***No unsatisfiable circularities reported 
If New-Status = Status then *** No status change 
Send-Reply(i,‘N’) *** Send normal reply 
else If 2 €Ancestors then***Possible unsatisfiable circularity 
If Circularity then***Definite unsatisfiable circularity 
Send-Reply(i,“T’) *** Trouble Reply 
else 
begin 
Circularity := true 
Resend-Status(i) *** Try to resolve possible problem 
end 
else Resend-Status(i) *** P; is not its own ancestor 
else *** Trouble set is not empty 
If Circularity then Send-Reply(i,‘T’) *** Trouble Reply 
else Resend-Status(i) *** Problem may be resolvable 
end 
_ end 
If receive trouble reply from P; then 
If P; > Trouble then 
Trouble := Trouble J{P;} 
else Circularity := true *** Trouble Reply 
end. 
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Define Send-Reply(i,b): 


begin 

if Engager = P; then halt 
else 
begin 
Engager := 0 


If b = ‘N’ then send normal reply 
else send trouble reply 
end 
end. 


Define Resend-Status(i): 
begin 
Status := New-Status 
Deficit := size of Consequences 
If Deficit = 0 then Send-Reply(i,‘N’) 
else send message < w, S > to each consequence 


where w := {P;}\J Ancestors 
end. 


Define Status(i): 
If 3d, € J | Vn € Inlist( J), Update(n) =IN 
and —Jo € Outlist(J,) | Update(o) =IN, then 
begin 
Ancestors := {j} | Contributors(j)Vj € Inlist(J.) (J Outlist(Ji) 
Return IN 
end 
else 
begin ; 
Ancestors := JVJx € J (J Gontributors(;) 
for some j | either 7 € Inlist(Jx) and Update(j) # IN 
or j € Outlist(J,) and Update(j) = IN 
Return OUT 
end. 


5 Conclusion 


The amount of computation required and parallelism 
inherent in the truth maintenance process makes it a 
good candidate for a distributed computation. Dijk- 
stra’s diffusive computation is a good fit to at least 
Doyle’s original algorithm for truth maintenance. In 
fact, if, like Doyle, we do not provide for termination 
given a graph which cannot be labeled consistently, 
the diffusive computations required are very simple. 

Providing for such termination has the effect of com- 
plicating the final computation and significantly in- 
creasing the length of the necessary messages, as well 
as their number in some cases. And our algorithm 
does not properly identify just when it is possible to 
give a consistent and well-founded labeling of statuses 
to a graph. 

However, the diffusive computation presented here 
is still relatively straight-forward and accomplishes as 
much or more than all but the most recent previous 


algorithms for this task. Furthermore, the remaining 
problems seem likely to be resolved by extensions of 
the current algorithm. For example, when the root 
processor halts with a trouble reply, it seems likely 
that a diffusive computation could be devised to ex- 
amine its supporters as does Russinoff’s algorithm. We 
conclude that diffusive computations are a good model 
for distributed implementations of truth maintenance. 
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ABSTRACT 


Several Lisp application kernels were modified for parallel exe- 
cution. Their simulated behavior showed speed-ups ranging from 
one to 850 times faster, suggesting that the technique is of no value 
to a few programs, of great value to a few programs, and of 
moderate value to most programs. The modifications were minimal 


and could be done mechanically. 


OVERVIEW 


To assess how much faster Lisp programs would 
run on large multiprocessor systems, we modified 
several application kernels for parallel execution and 
simulated their behavior. We obtained speed-ups 
ranging from one to 850 times faster, suggesting that 
the technique is of no value to a few programs, of 
great value to a few programs, and of moderate value 
to most programs. The modifications were minimal 
and could be done mechanically. 


An approximation of this technique is used to 
modify several application kernels by hand to run in 
parallel. The effective parallelism obtained is meas- 
ured by means of simulation. These results are com- 
bined with results previously published [6] to identify 
areas of research needed for parallel execution of Lisp 
programs with minimal programmer intervention. 


REQUIREMENTS FOR 
“SAFE” PARALLEL EXECUTION 


Unlike many languages, Lisp contains a viable 
“pure function” subset. Moreover, applications writ- 
ten in Lisp are typically composed of many functions 
which are “pure.” Such functions can be rendered 
into parallel routines without fear of destroying the 
meaning of the program. 


Unlike the functional languages, however, Lisp 
also supports applications which are not pure, which 
make use of “blackboards,” global data structures, 
and side-effects in order to be useful. These side- 
effects must be carefully handled to insure correct 
parallel program execution. 


*On Assignment to MCC from NCR Corporation 
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The Detection of Side-effects 


Much of the effort in converting a program in 
Lisp (or any other language) to run in parallel comes 
in determining which parts may execute simultane- 
ously without interference. We propose to study the 
restricted case of executing side-effect-free functions in 
parallel. Given a “black box” model of computation, 
a side-effect-free function may be viewed as a compu- 
tational entity which accepts certain values as input 
parameters, and returns one or more values through 
the well-isolated interface of the function call. As 
such, the black box operates independently of its 
implementation -- it is behavior not structure that is 
important. Whether the function is implemented as 
software, firmware or hardware, nothing in the calling 
environment is altered, save the values returned 
through the function call interface. 


A side-effect is therefore a change to the calling 
environment not associated with the function call 
interface. Note that this definition is usually weak- 
ened somewhat to exclude changes in _ internal 
addresses, memory allocation, timers, etc., which are 
beyond the access of the program. 


There are two types of side-effect phenomena in 
Lisp: (1) functions which by their nature cause side- 
effects, and (2) functions which make use of global or 
free variables. Functions which use free variables 
have the potential for their computation being . 
affected from any other part of the computation that 
uses those variables. Side-effecting functions include 
rplaca, rplacd, set, and the Input/Output functions. 
Moreover, any function which calls a side-effecting 
function may be considered side-effecting. In addi- 
tion, any function which touches the property list of 
symbols (such as get, putprop) implicitly references 
globally available data. 


Three answers may occur when performing side- 
effect analysis on a function: (1) no side-effects exist, 
(2) side-effects exist, and (3) don’t know. We assume 
that only functions in the first category may be exe- 
cuted in parallel. We recognize that some algorithms 
exist which correctly use overlapping execution with 
side-effects, but we will not consider such cases here. 


Static syntactic analysis should approach 100% 
accuracy in identifying the presence of global vari- 
ables and side-effecting functions if they exist, but will 
be less accurate in establishing the absence of global 
variables or the lack of side-effecting functions. 


In this determination, however, “don’t know” 
may be the most common answer. Certain constructs 
are so suspicious ((setq z (format ...)) or (eval ...) ) 
that we will flag them as side-effecting without further 
analysis. Others, such as setg will require further con- 
sideration. If a global variable is used as the assignee, 
then setq will cause side-effects. The use of a local 
variable as the assignee of a setq encapsulates the 
side-effect to the scope of the local variable. Hence 
the function itself acts as a black box, and can be exe- 
cuted in an applicative and parallel fashion. 


Obtaining Parallelism 


After functions have been identified as safe candi- 
dates for parallel execution, some mechanism must be 
supplied for creating a separate task for their execu- 
tion and observing the degree of parallelism obtained. 
We selected the future construct from Multilisp [5] as 
our primary construct for specifying parallel execu- 
tion. The future is a form which spawns a new task 
for the evaluation of a given Lisp form. For example, 


(setq x (future (add1 y)) 
(setq z x) 


spawns a task to evaluate (add1 y), assigns to x a 
pointer to the future of (add1 y), and continues execu- 
tion in line. When (add1 y) completes, the value of z 
becomes a pointer to the value of (add1 y). If the 
value of z is needed before the future is completed, 
that task will suspend itself. However, if only xz as a 
whole is needed (as in the subsequent assignment), the 
task is NOT blocked; instead z is set to point to the 
future also. 


Note that the word future is used in two senses: 
as the name of the function which spawns the tasks, 
and as the (uncompleted) value of the task spawning. 
When it is necessary to distingush between the two, 
we will refer to the value of the task spawning as the 
*-future-* data-structure. In the example above, if x 
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were inspected before the value was available, then 
something like the following might be seen: 


*-FUTURE-*: (future (add1 y)) 


Two other forms are used in our study: touch and 
pmapcar. The function touch will return the value of 
its argument, blocking if necessary on an as-yet 
undetermined future. The function pmapcar is also 
built from futures. It is the analog of mapcar: 


(pmapcar fn ’(1 2 3 4 5)) 


will spawn tasks to evaluate (fn 1), (fn 2), ..., (fn 5) 
simultaneously, and return the list of values (or their 
futures). 


It is important to note that a future does not 
automatically allow parallel execution. Consider the 
following: | 


(plus 10 (future (sin x))) 


No useful parallelism is obtained. While (future (sin 
x)) does create a parallel task, the main line task tries 
to execute the plus instruction, and immediately 
blocks until the computation of (sin x) is completed. 
If future has any overhead, it would take less time to 
execute: 


(plus 10 (sin x)) 


Therefore, a future should only be inserted when both 
paths of computation can be expected to have more 
computation to perform than the overhead of invok- 
ing a future before the invoker requires the value of 
the future. 


The Insertion Process 


Because the future is a data structuring concept, 
and because it behaves like a function call, it is not 
only easy for the programmer to use, but it is also 
easy for a pre-processor to insert futures in a program 
to be parallelized. Since we do not currently have a 
program which detects side-effects or estimates com- 
putation requirements of code fragments, we inserted 
futures in the application kernels where we judged 
such programs would make those insertions. 


As an example, consider the following: 


(defun rewrite-args (Ist) 
(cond ((null Ist) nil) 
(t (cons (FUTURE (rewrite (car Ist))) 
(rewrite-args (cdr Ist)))))) 


It is known that rewrite is side-effect-free; hence, 
the future can be inserted on the first argument of 
cons. 


However, guaranteeing freedom from side effects 
is only a sine qua non for insertion of futures. The 
candidate function must also satisfy two other cri- 
teria: 


1. The work being done by the spawned task must | 
be sufficient in the light of the process scheduling 
overheads. :; | 

2. The work remaining in the spawning task until 


the future value is needed must also exceed the 
scheduling overhead. 


For example, consider the following function: 


(defun test (x) (+ (sin x) (cos x))) 


One may insert futures indiscrimately as follows: 


(defun test1 (x)(+ (future (sin x)) 
(future (cos x)))) 


Such insertion on the second argument of the addition | 


is pointless -- the mainline task must block while it 
waits for the completion of computation of both argu- 
ments. 


A better insertion is 


(defun test2 (x)(+ (future (sin x)) 


(cos x))) 


Now, after spawning the (sin x), the mainline will 
compute the (cos x) coincidentally with the computa- 
tion of (sin x). The parallelism is now useful provided 
that (sin x) and (cos x) take longer to compute than a 
task takes to be spawned. 


We note that Sharon Gray at MIT has developed 
a program for inserting futures in code that is already 
known to be side-effect-free [4]. 


MEASUREMENT TOOLS 


Several tools were developed to allow measure- 
ment of the parallel execution of Lisp programs. 
Since a multiprocessor was not readily available to the 
authors, a simulator was developed to approximate 
the behavior of parallel Lisp programs on a multipro- 
cessor. It generates a trace of events related to task 
scheduling and timing, such as task start, task end, 
and task blocking, using a value generated by another 
task. These trace events are used by a post- 
processing program to determine a running average 
degree of parallel execution and the overall average 
number of processors in use. 
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The “Pure-Lisp” Simulator 


The “pure-Lisp” simulator extends Common Lisp 
by implementing two interrelated primitives, future 
and touch to support parallel execution. As men- 
tioned before, the future primitive returns a *-future-* 
data-structure. However, Common Lisp primitives do 
not support the +*-future-* data type. This support is 
provided by means of a preprocessor for Common Lisp 
source code. The arguments to each system primitive 
such as car which requires access to their contents are 
enclosed in a touch primitive. The touch primitive 
treats normal Lisp values like the identify function. 
When its argument is a *-future-* data-structure, it 
simulates the appropriate delay, if necessary, and 
returns the value computed by the associated *- 
future-*. Thus, no changes are needed to the underly- 
ing Lisp system primitives. 

During program execution, the various tasks are 
executed in a depth-first order, with the most recently 
created task being executed until it completes or 
starts another task. When it completes, then the task 
that started it is resumed. A simulated time is main- 
tained throughout this execution. Whenever a future 
primitive is encountered, the current simulated time is 
recorded as the start-time for that future. Then the 
argument of that future is evaluated as a normal Lisp 
expression. The new simulated time is recorded as the 
end-time for that future. Then the simulated time is 
reset to the start-time of the future. Thus, the next 
instruction in the task that executed the future 
appears to occur during the time that the future’s 
argument is being evaluated. When a task executes a 
touch of a *-future-* data structure, the simulated 
time is set to the greater of the current simulated 
time and the simulated completion time for that *- 
future-*. This adjustment represents waiting for a 
future to complete if necessary. 


The “pure-Lisp” simulator described above has 
the advantage of relatively straightforward implemen- 
tation. Further, so long as the arguments to the. 
future primitive have no side-effects, the results com- 
puted will be correct and the estimated degree of 
parallelism and simulated timing will also be correct. 
However, if the arguments to the future primitive 
have side-effects, the simulator can be arbitrarily 
optimistic or pessimistic. 


Suppose that a single computation is needed in 


-several places in a parallel program. Further, suppose 


whatever portion of the parallel program needs that 


computation first sets a flag and computes the value 
while other portions of the program just use the result 
of the computation when they find the flag set. The 


future construct supports this kind of parallel pro-_ 


gramming, but the “pure-Lisp” simulator does not. If 
the shared computation first occurs inside of a future, 
then it could end up being used by another portion of 
the program at a simulated time earlier than it was 
computed. Or, if it is the value of a future which also 
includes a longer computation, another task could be 
delayed longer than necessary. Therefore, the “pure- 
Lisp” simulator used in these experiments is only 
appropriate for studying futures with no side-effects. 
Since the current study is only concerned with mutu- 
ally independent sub-tasks, this limitation is not 
harmful. 


The Trace and Post-Processing Facility 


The following events are traced: task start, task 
end, task block waiting on a future, and task unblock 
when a future is ready. Each trace event records the 
simulated event time and the event type. The first 
two events surround the future primitive, and the 
second two events surround the touch primitive. The 
act of recording the event data is hidden from the 
simulated timer. This series of events can readily be 
transformed into a time-weighted average of the 
number of active tasks and the number of blocked 
tasks. A small program computes the average number 
of active tasks and the maximum number of tasks 
that a parallel program uses. Also computed are the 
total computation for all tasks and the linear time 
required to execute the parallel program. These times 
represent the results that would be obtained if the 
program were executed on a ideal parallel processor 
that has as many processors as were needed, with no 
interference between the processors. 


SAMPLE APPLICATION KERNELS 


This report examines four sample programs to 
assess the potentials of obtaining automatic parallel 
execution. Two of these, Mandelbrot and TAK are 
small programs, intended to exercise the primitives 
essential to parallel execution. The other two, EMY 
and REWRITE are kernels of major AI programs, 
intended to represent the computation of those pro- 
grams. 
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Mandelbrot 


The Mandelbrot program determines, for each of 
a set of points on a square grid in the complex plane, 
whether the point is in the Mandelbrot set [2]. Thus, 
Mandelbrot can be considered a representative kernel 
of a class of numerical algorithms. The determination 
for a single point is iterative, but independent of all 
other points. In the sequential version of the pro- 
gram, the determination for each point done inside of 
a double loop. The outer loop controls which row in 
the grid is being evaluated, and the inner loop 
controls which column within the row is_ being 
evaluated. Given an assumption that at least five 
instructions should be performed by a parallel task to 
justify its creation overhead, the smallest computation 
that is suitable for execution as a separate task is the 
determination of whether a single point is or is not in 
the Mandelbrot set. Since each such determination is 
independent of the others, this unit of computation 
could be surrounded by a future. 


To obtain experimental results, a grid of 50 by 50 
points was selected, with an iteration limit of 100. 
Without futures, the program takes 27.8 seconds. Fig- 
ure 4-1 shows the parallel execution profile with a sin- 
gle future. The y-axis is simulated time, and the x- 
axis is average number of executing tasks during a 
time window. With a total run length of 1.17 
seconds, the speed-up is approximately 24. 
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Figure 4-1: Mandelbrot - One Future Case. 


Since 2500 tasks were created and executed, this 
speed-up is disappointing. Careful consideration of 
program behavior shows that the problem is the rate 
at which tasks are created. On average, a delay of 
360 microseconds occurs between each task startup, 


mostly due to loop overhead. Since each task takes 


an average of 11,000 microseconds, only 30 tasks can 


be started before some tasks complete. Variation in 
the length of each task causes the variation in the 
number of active tasks. 


As the program is operating on a square grid, the 


iteration can easily be setup as a loop computing each 
point on a row, within a loop computing each row. 
Since there will now be a task for each row, 50 tasks 
will be starting tasks in parallel, and the average 
number of running tasks should rise to 1500. Figure 
4-2 shows the parallel execution profile of the Mandel- 
brot program with two futures. 
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Figure 4-2: Mandelbrot - 50x50 Two Futures Case. 


While 1500 tasks do execute simultaneously, the 


average speedup is approximately 850. The remaining | 
sequential startup time for the 50 rows and 50 points 


in each column is responsible for average speedup 
being lower than the peak number of tasks. 


A fairly simple program for obtaining automatic 
parallelism would find the two locations for inserting 
futures that were used here. A more advanced pro- 


gram might even use three or four levels of futures to © 


speed the startup further. As very substantial 


amounts of speedup were obtained with minimal over- | 


head, we conclude that programs similar to the Man- 
delbrot program can be automatically converted to 
effective parallel execution. 


TAK 


TAK is a six line program used in the Gabriel 
benchmarks [3] intended to test the efficiency of Lisp 
implementations in calling recursive functions. For 
_ the purposes of this report, TAK provides opportunity 


for rapidly increasing parallelism provided that 
efficient primitives for fine grain task computation are 
available. Each recursive call to TAK spawns four 
more calls to TAK, three of which can execute in 
parallel. Thus, the potential parallelism of TAK 
grows exponentially with the call depth. Since the. 
benchmark requires a call depth of 18 to complete, | 
massive parallelism is available. However, a typical. 
recursive call has fewer than ten instructions not 
counting those required by the deeper levels of recur- 


sion. If the overhead of creating a separate task is_ 


great, then the effective speedup will be poor. There- 
fore, TAK requires an efficient implementation of the. 
parallelism primitives. TAK requires 3.05 seconds of 
sequential time to execute without futures. Figure 4-3 
shows the results of futures on each of the three argu- 
ments to the recursive call to TAK. 


The total processing time increases to 5.95 
seconds in the parallel execution of TAK, almost dou- 
bling. The overhead of creating tasks is almost as 
much as that of executing. However, most of these 
task creations occur in parallel, so an _ effective 
speedup of approximately 85 is obtained. Indeed, if it 
were not for the long tail of low processor count, the 
speedup would have been greater. This tail is caused 
by tasks which block on their futures resuming and 
continuing the recursive computation. Therefore, 
even with very fine grain parallelism, significant 
speedup is possible. 


EMY 
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Figure 4-3: Tak with three futures. 
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EMY is an implementation of the inference ker- 
nel of the EMYCIN expert system [1]. It represents 
approximately 600 lines of Lisp code. It was discussed 
in detail in [6]. Briefly, it uses backward chaining rea- 
soning to answer queries against its database. During 
the process of making deductions, it updates a global 
database, to save conclusions for use by other rules. 
These frequent operations on a shared database or 
“blackboard” prevent much parallel operation without 
use of synchronization primitives. Note that the 
“blackboard” model of execution is used in other AI 
programs such as Hearsay. 


Nine locations were found for inserting futures in 
the EMY program. The effective speedup obtained 


was negligible. The lack of speedup reflects the small 


granularity of parallel tasks between operations on the 
shared database. These results are particularly 
noteworthy, as the EMY program has been shown to 
allow significant speedup (by factors of ten to twenty) 
when synchronized operations are allowed on the glo- 
bal database. At the current time, it is not well 
understood how to insert automatically synchroniza- 
tion around the access to a shared data structure. 


REWRITE 


REWRITE is a simple theorem prover developed 
by Boyer and Moore to evaluate Lisp implementa- 
tions. It is also included in the Gabriel benchmarks. 
It represents approximately 100 lines of Lisp code and 
350 lines of data consisting of 104 lemmas. It was 
written with the intent of providing behavior that was 
representative of the much larger Boyer and Moore 
Theorem Prover. Theorem proving is an important 
part of the planning section of many AI programs. 
While the authors (Boyer and Moore) now have some 
doubts about the representative nature of REWRITE, 
it is the best available sample theorem prover. As it 
is shown in the Gabriel benchmarks, there are two 
places in the code which use global variables to pass 
extra values back from a function. To allow parallel 
execution, the code was changed to pass the extra 
values back as part of the function. After this 
change, most of the program is without side-effects, 
allowing automatic parallel execution. | 


The following theorem was selected for our measure- 
ment experiment: 


(IMPLIES 
(AND (AND P1 Q1) 
(AND (IMPLIES P1 P) 
(IMPLIES Q1 Q))) 
(OR P Q)) 
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Figure 4-4: REWRITE with three futures. 


Figure 4-4 shows the results of hand-insertion of 
three futures on the selected theorem. Sequential exe- 
cution requires 7.1 seconds. A speedup on the selected 
test case of approximately 48 was obtained. More 
complex theorems would yield more opportunities for 
parallel execution, so we would expect greater 
speedups. We expect that similar results could be 
obtained from programs similar to theorem provers. 


Summary of Results 


Table 4-1 summarizes the results of each experi- 
ment. The Mandelbrot program is listed in both the 
1-future and 2-future form. We have observed speed- 
ups based on automatic syntactic insertion of parallel 
constructs ranging from 1 to 850. We draw the obvi- 


ous inference that the amount of parallelism to be 
obtained is strongly dependent on the application 
selected. : 


Parallel 
Time 


Parallel 
Factor 


Program 


Sequential 
Time 
(secs.) 


Mandelbrot-1 
Mandelbrot-2 


Table 4-1: Summary 


CONCLUSIONS 


Style 


Programming style is a major determinate of the 
effectiveness of these mechanical parallelizing tech- 
niques. The EMY program achieved no gain in per- 
formance due to frequent reading and writing of glo- 
bal data structures. The Mandelbrot program 
achieved massive parallelism because a well-defined 
separable unit of computation was incapsulated in a 
function. TAK and REWRITE achieved their parallel 
execution by having recursive functions with several 
arguments, each requiring substantial computation. 
Other programming styles may be expected to aid or 
hinder automatic parallel execution in similar ways. 


Implications 


The approach adopted here (simulated) 
automatic syntactic insertion of parallel constructs — 
has potential both for research into parallel execution 
of some Lisp programs, and also for “real” Lisp appli- 
cation programming. We expect to continue investi- 
gation of Lisp applications to broaden our understand- 
‘ing of the relationship between programming style 
and achievable parallelism. 


Software engineering theorists have stressed that 
an applicative, non-side-effecting style of programming 
produces clearer and safer code. It can also be run in 
parallel with the same clarity and safety. 


The requirements of many applications force a 
“blackboard” model for efficient operation. To the 
degree that the blackboard permeates the program 
design, to that degree will automatic syntactic inser- 
tion of parallel constructs be hindered. This style of 
programming shows that the automatic parallelization 
by syntactic pre-processors is not panacea for all pro- 
grams, but just a part of a total solution. 


Approach to a Solution 


A more complete solution would include: (1) syn- 
tactic pre-processors for modifying existing programs, 
(2) interactive analyzers to guide programmers in 
modifying existing programs for parallel execution, 
and to guide them in the initial construction of new 
applications, and (3) research into semantic analyzers 
to allow automation of the parallel execution of a 
broader class of programming styles. 


In addition, it is imperative that a “cost 
analyzer” be incorporated into the pre-processor. The 
purpose of the analyzer is to predict the processing 
cost (time, memory references, bus delays, etc.) in 
order to estimate the benefits of simultaneous 
execution. | 
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Abstract -- Present methods for exploiting parallelism in 
Lisp programs perform poorly upon lists (long, flat s-expressions), 
as such structures must be both created and traversed sequen- 
tially. While such a serial operation may be masked by overlap- 
ping it with other computation (by virtue of process spawning, or 
by the use of a mechanism such as futures), it represents a lost 
(and potentially large) source of parallelism. In this paper we 
describe the representation of s-expressions employed in PARCEL 
(Project for the Automatic Restructuring and Concurrent Evalua- 
tion of Lisp), which faciliates the creation and access of lists, 
without compromising the performance of functions which mani- 
pulate s-expressions of a more general shape. Using this represen- 
tation, the PARCEL compiler translates Lisp programs written in 
a subset of the Scheme dialect (which allows for global variables 
and atom properties) into code for a large, tightly coupled shared 
memory multiprocessor. 


Introduction 


Conventional methods of evaluating Lisp programs on paral- 
lel architectures are heavily skewed in favor of s-expressions which 
assume the shape of (relatively balanced) trees. In particular, it is 
frequently pointed out that much parallelism may exist in a pro- 
gram, written in an applicative subset of Lisp, which makes a 
symmetrical traversal of such an s-expression. By the use of a 
construct such as the future (see [6]) the time to traverse and 
operate upon a long, flat s-expresssion may be masked to a degree 
(if several processes are traversing the list, their operations upon it 
may be pipelined), but the essentially serial nature of such opera- 
tions cannot be overcome by such a mechanism. In this paper we 
present the details of the storage scheme employed by PARCEL 
(Project for the Automatic Restructuring and Concurrent Evalua- 
tion of Lisp). This representation is intended to allow the fast 
creation and access of s-expressions which have the shape of lists, 
while not detracting from the performance of code which traverses 
s-expressions of other shapes (such as the trees mentioned above). 
The representation has some properties which make it interesting 
apart from its implications for parallelism in Lisp. 

PARCEL is an investigation of the problem of compiling 
Lisp for evaluation on a large, tightly coupled shared memory 
multiprocessor. The project consists of a restructuring compiler, 
which produces from a program written in a subset of the Scheme 
dialect of Lisp an object code containing various directives for 
parallelism, and a run-time system, including parallel algorithms 
invoked by the compiled code, and a parallel garbage collection 
mechanism. The subset treated by PARCEL includes the side- 
effect causing mechanisms of free variables, atom properties, etc., 
as well as downward and upward funargs. 

The PARCEL compiler employs novel algorithms for the 
compile-time transformation of Lisp programs. Some of these are 
extensions of ideas developed in the PARAFRASE project (see [8], 
[12]), while others are specific to the problem of compiling Lisp. 
For instance, an algorithm for creating loops from a recursive 
function is employed, which requires neither that the function be 
tail-recursive, nor that it be side-effect free (the algorithm is 
described briefly in this paper). A complete description of the 
design of the compiler is presented in [7]. 

The machine for which we are compiling is assumed to 
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car 


Figure 1. The Cons Cell Employed in PARCEL 


leaved memory via a fast interconnection network. Examples of 
such machines are the NYU Ultracomputer (see (4]), and the 
CEDAR machine (see [5]). We assume that the processors operate 
asynchronously, but at nearly identical rates of processing. Delays 
due to memory read conflicts will be ignored, in estimations of the 
performance of algorithms presented in this paper. 


Pointers and Cons Cells 


The PARCEL compiler employs an unusual cons cell, illus- 
trated in Figure 1.” This cell consists, of course, of two pointers, 
the car and the cdr. The p field within each is an address in 
memory; for simplicity, let’s assume that the fundamental unit of 
memory (machine word) is a single cons cell. The 1 field is an 
integer whose use will be explained shortly. 

Throughout this paper, the word pointer will indicate a pair 
[p, 1] such as the car and cdr in the illustrated cell. (Brack- 
ets will be used when treating a pointer explicitly as an ordered 
pair.) When we mean the location in memory of a data object, we 
will use the word address; an address is but one of the two com- 
ponents of a pointer. If z is a pointer, then we will use z.p to 
mean the p component of z, and z.1 to mean its 1 component. 
If z points to a cons cell, then we will use z.p->car to indicate 
the car pointer within the cell, and z.p->cdr to mean its cdr 
field. Accordingly, the components of the car pointer would be 
Z.p->car.p, and z.p->car.1, and likewise for the cdr 
pointer. (This is meant to mimic the conventions of a language 
such as C or pascal.) 

Consider Figure 2. The list x (a b c de f) is illus- 
trated. Beside each arc (p component) in the figure is shown the 
value of the corresponding 1 component. For “variable” pointers 
(such as x) and car pointers, the 1 component indicates the 
length of the pointed-to s-expression. The list x has length six; 
beside the arc leaving x is shown a 6. A length of zero is associ- 
ated with pointers to atoms. The 1 component of a cdr pointer 
has a very different meaning. Suppose that we take a double vert- 
ical line separating a pair of adjacent cons cells to indicate that 
the cells occupy adjacent locations in memory. In this example, 
the sublist (a b c ad) consists of a block of four cons cells in 
contiguous memory locations (x.p through x.p + 3), and the 


sublist (e f) occupies two consecutive memory locations. If y is 
(Sa SS lel si SSP assesses 


(a) For the purposes of illustration, we have omitted other tag 
bits used to indicate, for example, a numeric atom (integer), 
etc. 


(6) We will see shortly that it is also possible for a list to ter- 
minate “within” a contiguous block, i.e., at a cell y such that 
y.cdr.1 # 0. 


a cons cell, then y.cdr.1 indicates the number of cons cells “to 
the right” of y in the contiguous block of cells containing y. In 
other words, if one is traversing a list beginning at y, y.cdr.1 is 
the number of adjacent locations following y that will be visited 
before a cdr pointer leads to a non-adjacent memory location, or 
the end of the list is reached.” This is very similar to the cdr- 
coding scheme for list storage; see [1], [10]. Cdr-coding is a tech- 
nique for reducing the average space consumed by a cons cell, and 
is based on the observation that the cells in the top level of a list 
frequently occupy contiguous memory locations. The storage 
scheme presented here has properties not available from cdr- 
coding; the motivation for its development is to facilitate the 
parallel creation and access of lists, although we will see that it 
allows a more flexible sharing of sublists than conventional list 
representations, which may result in greater than usual efficiency 
of storage. For the moment we will assume that, unlike in the 
cdr-coding scheme, all cdr pointers are present and properly 
assigned, even within contiguous blocks of cells.‘* 


The Primitive © 11 


Figure 3 shows our definitions of four primitive operations 
upon s-expressions: car, cdr, cons, and length. 

Let’s consider these definitions in turn. car(x) simply 
returns the value of the cons cell to which x points. cdr (x) isa 
bit more complex. Notice first that we do not use the reading of a 
null pointer as an indication of list termination; if the length of x 
is one, then cdr (x) returns nil, under the assumption, which 
we will make throughout this paper, that all lists are nil- 
terminated.” nil is, like all pointers, an ordered pair consisting 
of the address of the atom nil and an 1 field of zero. If x.1 > 
1, then cdr (x) returns a list whose length is one less than that 
of x. 

Now consider the function cons(y, x). If the newly allo- 
cated cell happens to immediately precede x in memory, then the 
1 component of the cdr pointer of this new cell is set to indicate 
that it is part of the same contiguous block of cons cells as the 


e f 


Figure 2. The List (a b c d e f) 


(c) As the reader may have guessed, this is unnecessary, but the 
details of conditionally omitting these fields are not treated in 
this paper. This system is being implemented experimentally 
for the CEDAR machine [5], which has a 64 bit word; each of 
our pointers will occupy one machine word. 


(4) It is a simple matter to extend the representation to accomo- 
date improper lists; see {7]. 


constant time, with the operation firstn(length(x) - 
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first cell in x; otherwise, this 1 component is set to zero, indicat- 
ing that the new cell is the rightmost in a new contiguous block of 
cells. This is an unsatisfactory method of creating large contigu- 
ous biocks, as it depends upon only one list at a time being 
created by calls to cons. In other words, if calls made to cons 
for the allocation of cells to be used in building a list x are inter- 
spersed with calls for cells to be used in building y, then neither 
x nor y will contain long contiguous blocks. We will return to 
this matter shortly. 

Finally, consider the trivial definition of length(x). It 
goes without saying that if we associate the length of an s- 
expression with every variable and car pointer, that we may find 
the length of a list in constant time. 


Bae been vedenOasit 


Now let’s consider some slightly more complex list opera- 
tions as they might be defined to take advantage of our storage 
scheme. Consider the operation nthcdr, as defined in Figure 4. 
nthcdr (x, n) is equivalent to cddd ... dr(x) (n d’s). It 
works by first deciding if the cell for which it is searching is 


‘among those in the contiguous block containing the first cell in x; 


if so, it jumps immediately to that cell, and returns a list n cells 
shorter than x. Otherwise, it skips directly to the end of this con- 


tiguous block, and follows the last cdr pointer to the next contigu- 


ous block, repeating the process. The important point is that any’ 
cell within x may be reached within time proportional to the 
number of contiguous blocks comprising x, and not proportional 
to the length of x. A great deal of effort is spent by the PARCEL 
compiler and run-time system in assuring that the contiguous por- 
tions of a list are as long as possible; that is, that each list is 
comprised of as few contiguous segments as possible. We will 
return to a discussion of some of the means of doing so. 

The operation firstn(x, n) reveals an essential charac- 
teristic of our list representation. As its name indicates, firstn 
returns a list consisting of the first n cells of x. Given a conven- 
tional representation for lists, there are two ways of achieving this 
effect. The first (the usual definition of firstn) is simply to 
copy the first n cells of x, and return a pointer to this new list. 
This has no ill-effect upon the storage system as a whole, but 
necessitates the expenditure of n cells and O(n) time to copy 
them. The second means is to march out to the nth cell in x, 
and rplacd the value nil into the cell’s cdr field. This has 
the virtue of requiring no additional storage, but the disadvantage 
that the original list, along with every other list sharing the cell, is 
altered, and that it, too, takes O(n) time. 

Notice how easily this operation may be performed given our 
representation. This is entirely due to the fact that the end of a 
list is located, not by finding a null cdr pointer, but rather by 
successively decrementing the list’s length as it is traversed. In 
particular, notice that we may remove the last element of a list in 
1), 
which makes no examination of the list x at all. This allows us to 
replace a destructive, side-effect causing operation (rplacding the 
next-to-last cell of a list) with a functional one, and to reduce the 
computation time by a factor of n at the same time. 

Finally, consider the definition of lastn(x, n) which 
takes, like nthcdr, time proportional to (at most) the number of 
contiguous blocks of which x is comprised. 

The operations illustrated in this section are very useful 
when performing traversals (especially parallel access) of an s- 
expression which is essentially list-like; that 1s, one that is long in 
the “cdr” direction. But notice that the speed of a non-linear 
traversal (one which follows, say, car and cdr pointers in a sym- 
metrical way) has not been compromised; the functions car and 
cdr have remained quite inexpensive. This is important, for, Just 
as it is not always the case that s-expressions describe balanced 
trees, so it is not always the case that they describe long and flat 
lists; we wish to use a representation which does not provide fast 
access for the one shape at the expense of the other. 


nthedr (x, n) = 
if n = O then return (x) 
else 
dist := x.p->cdr.1; 
if dist >n 
then return([x.p + n, x.1l - nj); 
else return (nthcdr ( 
cdr([x.p + dist, 
K.1.-= adisth), 
n - dist - 1)); 


firstn(x, n) = return([x.p, x.l - n)); 
lastn(x, n) = return(nthcdr (x, x.1 - n)); 


Figure 4. Definitions of nthcdr, firstn, and lastn 


Figure 5. After Performing append (x, y) 


rplaca, rplacd, append, and nconc 


The operations rplaca and rplacd are forbidden in PAR- 
CEL. While the operation rplaca is no different to perform 
using our representation than when using a conventional one, the 
increased level of sublist sharing causes a concomitant increase in 
the likelihood of rplacaing a shared cell, and makes it practi- 
cally impossible to give a precise semantic definition of the con- 
struct. This combines with the fact that rplaca may be used to 
violate the call-by-value mechanism in Lisp to make for a very 
sticky problem. The operation rplacd is far more difficult to 
perform using this representation. The reason is that the usual 
semantics of rplacd cause it to affect the length of every list 
sharing the altered cell. Even if we could determine the location 
of every variable and car pointer to a list sharing each cell (a 
very expensive affair), we would have to know the exact position 
within the list of the cell to determine its effect on the list’s 
length. The philosophy of PARCEL is to forbid the rplac con- 
structs until (unless) they can be incorporated without a crippling 
effect on the parallelism of the resulting code. 

Two of the most important uses of rplacd are to remove 
elements from the end of a list (delete a suffix) and to add ele- 
ments to the end of a list, in both cases without causing any of the 
unaltered portion of the list to be copied. We have already seen 
that our representation provides a much faster means of doing the 
former (via firstn); let us now turn to the latter. Consider Fig- 
ure 2 again, and notice that the cdr pointer of the last cell of the 
list is never used; non of the primitives we have defined so far 
examine or alter its contents. 

Suppose that we wish to form the list z = append(x, y),, 
where x is as shown in Figure 2 and y = (g h i). We may 


perform this operation by rplacding the last cell in x; (0) the P 
component of this cdr pointer is set to the address of the first cell 
in y, and its 1 component is set to zero. We return a pointer to 
the first cell in x, with an 1 component equal to x.1 + y.1. 
Figure 5 shows the state of affairs after this operation. Notice 
that this operation has the semantics of append, and not of 
nconc; this is, as we have said, because the extent of x is deter- 
mined using its length and not by searching for a null cdr 
pointer. The reader may wish to verify that the primitives defined 
to this point still apply to the list x. 

Figure 6 shows a definition of the operation append (x, 
y). It works as follows. First we find the last cell in x. If the 


append (x, y) = 
lastcell := last (x); | 
/* lastcell is a pointer */ 
if lastcell.p->cdr := nil then 
lastcell.p->cdr := [y.p, O]; 
return([x.p, x.1 + y.1]); 
else return (old-append(x, y)); 


Figure 6. A Definition of append 


cdr pointer of this cell is nil, then we simply place the address 
of y in the p component of this cdr pointer, and a zero in its 1 
component, indicating that y is not part of the contiguous block 
containing the last cell in x. In this case, the entire operation 
takes only time proportional to the number of contiguous blocks of 
which x is composed. If the cdr pointer of the last cell in x is 
non-null, then we must use the old (conventional) definition of 
append (called old-append here), which copies x to a new 
location. If, for instance, after performing z = append(x, y) 
as above, we attempt to form q = append(x, y), we will find 
that the cdr pointer of the last cell in x is non-null, forcing us to 
perform the old version of append. Notice that if x is 
represented in a small (constant) number of blocks,“ that it may 
be copied in parallel with linear speedup, to a new (contiguous) 
block of cons cells. Even if this is done sequentially, by copying it 
to a contiguous block, we guarantee that later accesses to x will 
be made faster. (It is easy to allocate a contiguous block, since we 
know x’s length.) | 

This definition of append affords many of the advantages of 
nconc, without the side-effects to other lists for which that opera- 
tion is well-known. Of course, there are occasions for wishing to 
alter several lists, using “back-door” side-effects, via such con- 
structs as rplacd and nconc; but, as we have said, our primary 


(e) Ironically, forbidding the use of rplacd has allowed us to 
make use of it in a way that is impossible when it is permit- 
ted. 


(f) A possibility worth exploring is to add to each car and cdr 
pointer a field indicating the number of contiguous blocks of 
which the referenced s-expression is composed. The ratio of 
the length of the list to this value is an estimate of the poten- 
tial speedup (number of useful processors) of a traversal of the 
list, such as that done during copying. It is easy to incor- 
porate this information into the return value from cons, 
append, etc. One difficulty comes when using firstn as we 
have defined it (there is no way to see the resulting number of 
blocks without examining the list), but if the number of blocks 
in a list is used for purposes of estimation, then we may 
assume that the ratio of length to number of blocks for a par- 
ticular list remains constant, and adjust the latter accordingly. 
Such a number might also be used to determine when garbage 
collection should be applied (our garbage collection algorithm 
puts every list in the system into contiguous form, space per- 
mitting). But we digress. 


motivation in developing this representation of s-expressions is to 
facilitate the parallel reading and writing of lists. It is hoped that 
the advantage of being easily able to do so will outweigh the 
advantage of permitting such aliased side-effects. 

The definition (of append) is not yet satisfactory however; 
as given in Figure 6, it decides dynamically whether its operation 
can be done using “nconc”, or must be done by copying its first 
argument. If, however, there is a sufficient number of processors 
at hand, it may be best to copy both arguments to a single con- 


tiguous block, than to copy only one or none. In the extreme case, 


suppose we have two lists of length n/2 (which are, say, in fully 
contiguous form), and n processors with nothing to do. Clearly it 
would be best to put these to use by copying these lists (in one 
time step) to form a single contiguous list.‘*’ Experimentation is 
needed in designing a well-behaved heuristic for deciding when 
such copying should be done; for the moment, let us say that if 
the number of available processors P is greater than n/k, where k 
is some small constant and n is the sum of the lengths of the 
arguments to append, then both arguments will be copied; other- 


wise the definition of Figure 6 will be applied. (This seems an. 


ideal application for information concerning the number of blocks 
composing a list - see footnote (f) above.) 


Parallel Creat LU Li 


In this section we will show, in an abstract setting, how some 
recurrences involving list operations may be solved in parallel; the 
examples will seem, no doubt, removed from the realities of Lisp 
programming. Afterwards, we will suggest briefly the way our 
compiler arranges for the detection and solution of such 
recurrences in the context of real Lisp programs. The example of 
Figure 5 suggests the possibilities for sublist sharing allowed by 
this representation. Notice that we have done something impossi- 
ble with ordinary representations: two lists may share a cell 
without sharing the entire subexpression rooted at that cell. It is 
this important property which will allow us to solve some (ordi- 
narily very costly) recurrences involving list operations with 
greater than linear speedup (when compared to conventional 
representations and solutions). 

Figure 7 shows a simple recurrence, consisting of a loop in 
which x is traversed, cell by cell. When f(i) is true, x is 
advanced to the next cell, otherwise it is unaltered. Let us sup- 
pose that we wish to know the value of x after every iteration of 
this loop (say, for input to another parallel loop). We may treat 
x as an array x[] of length n (a technique called scalar expan- 
ston - see [12], [7]). x[O] will be synonymous with the starting 
value of x, and x[i] will be the value of x after iteration i of 
the loop in Figure 7. In this case the code of Figure 8 will suffice. 
(A doall loop is one in which all iterations may be executed 
simultaneously; that is, given n processors, where n is the 
number of iterations of the doall loop, the loop will complete in 
the time required by its longest running iteration. See [2|). After 
loop (2), d[i] will indicate the number of times cdr has been 
applied to x in producing x[i]. This computation of the vector 
d[{] embodies a familiar technique called parallel prefiz - see (8). 
Assuming that x is contained in a small (constant) number of con- 
tiguous blocks, the speedup obtained is O(n/lgn). We can do 
much better than this if n is large relative to the number of pro- 
cessors (P) allocated to this loop; in fact, ifn > P+lgP, then we 
can achieve constant efficiency (a running time of. O(n/P)) by 
operating on the vector d[] in subsequences of length n/P on 
each processor, before and after the “recursive doubling”. Figure 
9 shows the improved version of the algorithm in loop (2), which 
will now have a running time of T, = O(n/P + Ig P). The vector 
D[] has length P + 1; the algorithm works in the following way. 


(3) This has no effect upon the semantics of append; however, it 
(once again) affects the amount of sublists sharing (this time, 
by decreasing it) which, in turn, affects the use of rplaca. 
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do i=il1ton 
if £(i) then x := cdr(x); 


Figure 7. A Simple Recurrence Involving cdr 


(1) doall i = 1 ton 


if f(i) then d[ij = 1; 
else d[{i] = 0O; 


dist = 1; 

(2) do j = 1 to Ilgnl 

doall i=l1ton 

if i> dist then 
d[{i] := d[i] + d[i-dist]; 

dist := dist * 2; 
(3) doall i = 1 ton 

x[{i] := nthcdr(x[O], d[i]); 


Figure 8. A Parallel Version of Figure 7 


doall i 1 to P 
do j = (i-1) -‘Infel + 2 to i-Inpl 
d{j] := d[j] + afj-1]: 
D(i] := aficlnpPh;: 
do j = 1 to IlgPl 
doall i=i1toP 
if i > dist then 


ie = D[{i] + D[i-dist]; 
dist := dist * 2; 
D[O] := 0; 
doall i= 1 to 
do j = (i- 1): Infeli to i: Inf 
a3] = d{j] + D[i-1]; 


Figure 9. An Efficient Algorithm for Parallel Prefix 


Processor 1<i <P, forms the sum 
d{(i-1)° lar + al(i- 1)-Infeoke] + ... + ali-Infel. Then, 
in the middle loop in the figure, we form the partial sums 
D[i] = a{1] + [2] + ... + afi-Infel]. Finally, we add the 
value D[{i-1] into each of al(i- 1)-In/el+1] through d[i- Inf], 
in the final doall loop. This is a well-known method for folding 
the parallel prefix solution onto limited processor; see [8]. 

Now consider the example in Figure 10. Here, we are adding 
the list x[i] to the end of y, during each iteration of the the 
loop. (x[i] might, for example, be a value such as 
f(car(nthcdr(z, i-1))), for some other list z, as computed 
by a loop such as that in Figure 8. Notice how the computation 
has been broken into a series of parallel steps; this is the result of 
loop distribution - see [8], [12], [7]. Figure 11 shows the parallel 
version of this code, in which the value of y after every iteration 
of the original loop (y[i] is the value after the ith iteration) is 
produced. In loops (1) and (2), we compute d[i], 1 <i <n, 
which equals the number of cells added to (the original) y after 
iteration i of the original loop. Next, we allocate a contiguous 


block of cons cells of length d[n], i.e., of length equal to the total: 
number of cells added to y. This block is pointed to by z. In. 


leop (3), we copy each appended sublist to its destination within 
z. The function copy_to takes as arguments a list (pointer) and 
a destination address; it copies each cell in the top level of the list 
into contiguous locations beginning at the destination address. In 
addition, it causes the cdr pointer of the last cell in the duplicate 
sublist to point to the next location in memory (the destination 
address of the next sublist to be copied.) After explicitly setting 
the cdr pointer of the last cell in z to nil, we append z to y 
(see discussion of append above). Finally, we compute the value 


of y[i], for 1 < i <n; this is simply a pointer to y (recall 


do i= 
Y 3= 


1 ton 
append(y, x{i]):; 


Figure 10. A Simple Recurrence Involving append 


doall i= 

d[i] := 
dist := 1; 
do j = 1 to [lg nl 

doall i=1ton 

if i > dist then 
a[i] := d[i] + d[i-dist]; 

dist := dist * 2; 
z := allocate(d[n]); 
a[O] := 0; 
doall i=i1ton 

copy_to(x[i], z-p + af[i-1]): 
(z.p + dfn] - 1)->cdr = nil; 
y := append(y[0], 2); 
doall i=il1ton 

y{iJ = [y-p. y[O].1 + d[i]}]; 


1 ton 
x[ij].1 


(1) 


(2) 


(3) 


(4) 


Figure 11. A Parallel Version of Figure 10 


that we are adding cells to the end of y), with length equal to the 
sum of the lengths of (the original) y, and the number of cells 
added during iterations 1 through i of the original loop. 

To make this procedure more tangible, let’s suppose that n 
= 4, that y (initially) equals (a b c), and that x[0] = (d 
e). x[{1]) = (f 9g h). x[2] = nil, and x[3] = (i j). 
The initial state of affairs in shown in Figure 12, and the state 
after execution of the code in Figure 11 is shown in Figure 13. 

In the general case, let us assume that each of the x[i]’s to 
be appended, along with the original y, has length of approxi- 
mately m, and that (the original) y is represented using a small 
(relative to n) number of contiguous blocks. Then the time to 
perform the operation of Figure 11 using P processors is 


nelgn m 
tT, = 0 + o|—— 
P P 


With the improved version of 

parallel prefix described above, it is 
n m°*n 

T, = o| + Ig | + o[*-"} The number of cells used in 
Pp P 


representing the resulting lists is O(nm). 

Using a conventional Repl esen avon, and means of evalua- 
tion, this operation would take T, = O(mn? a Apparently in this 
(arguably anomalous) case, we have: achieved a speedup of O(nP), 
suggesting that considerable improvement could be made in the 
usual (sequential) means of evaluating such a recurrence. 

It is straightforward to extend this type of parallel list opera- 
tion to the operator cons, or to the case of appending onto the 
front of a list; in these cases, we must be satisfied with (at best) 
linear speedup. Notice that if, within a loop such as that in Fig- 
ure 7 or 10, we decrease the length of a list by a constant number 
of cells per iteration (via the operation cdr) or increase its length 
by a constant number of cells per iteration (with cons or 
append), that there is no need to use parallel prefix to compute 
the lengths of the intermediate values; in such a case, the length of 
the list behaves as an induction variable (i.e, it describes an arith- 
metic sequence). 


(h) Note that this is true even if nconc were used; however, as 
nconc is destructive, only the final value of y could be con- 
structed in this manner. It is arguably rare to repeatedly 
append to the end of a list; yet it may be that this is because 
it is ordinarily so expensive (for it seems a natural enough 
thing to do). 
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Figure 13. The State After Performing Figure 11 


It is an important property of the code generated by the 
PARCEL compiler that every: parallel list-creating operation 
results in a single, contiguous block of cells, such as z in the 
above example. This is because such operations make a single call 
to the function allocate. As mentioned, the performance of 
code in which the elements of a list must be accessed in parallel 
depends upon the existence of large, contiguous blocks; conse- 
quently, it may be useful to apply techniques which increase the 


contiguity of lists created sequentially, by repeated calls to cons. 
See [1], [7]. 


The Use of the & Scheme in Compiled Lisp Cod 


Consider the definition of union in Figure 14. The func- 
tion is recursive, but not (strictly) tail-recursive, as cons may be 
called after returning from a recursive invocation. The first step 
taken by our compiler is to build a representation of the function, 
which is, for the most part, a standard flow-of-control graph. See 
Figure 15. The variable r is a local temporary variable, used to 
hold the return value of an invocation of the function. The first 
optimization to be performed by the compiler in this case is called 
recursion splitting. Recursion splitting is a very general technique 
for forming loops from recursive functions. It does not require 
that the function be tail recursive, nor that it be side-effect free. 


_ The basic idea is to select a set of recurstve nodes of the function 
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(nodes in which there is a recursive call to the function) such that 
there is at most one member of the set along any path through the 
graph of the function. Then the graph is “split” into two loops 
(called the forward and backward loops), using these nodes as the 
points of separation. See Figure Figure 16, in which an abstract 
function graph is shown. The recursive nodes aré shown as double 
circles, labeled 1, 2 $, and 4. Suppose that we choose to split 
using nodes 1 and 1 gt) 4 the result would look something like Figure 
17. The first loop in this ‘figure performs the operations of the 
function as recursive calls are made from members of the selected 
set of recursive nodes (that is, the portion of the computation of 
the original function which occurs as the stack becomes deeper due 
to these calls); the second loop performs the operations of the 
function as those calls unwind. Recursion splitting works hand- 
in-hand with our techniques for recurrence recognition and solu- 


(def union (lambda (a b) 
(cond [(null a) b] 
{(member (car a) b) 
(union (cdr a) b)] 
[t (cons (car a) 
(union (cdr a) b))]))) 


Figure 14. The Function union 


null (a) ? 


member (car (a) ,b) ? 


w 
" 


union (cdr (a) ,b); 


union (cdr (a),b); 


cons (car(a),tl); 


Figure 15. The Graph of union 


Figure 16. A Function Graph, with Recursive Nodes Indicated 


BO 


Figure 17. The Graph of Figure 16, After Recursion Splitting 


tion. For the transformation to be successful, we must be able to 
‘form a recurrence which is used to determine, prior to entering the 
loops we have created, the number of iterations of each (they have 


(3) The choice of nodes with which to split a function depends 
upon the recurrences which will result among the variables in 
the created loops. We wish to find a set of nodes such that 
the resulting recurrences may be solved in parallel. (We speak 
of recurrences, but it may be that the variables describe a 
very simple pattern, such as an arithmetic sequence, or that 
their values from one iteration to the next are entirely 
independent; recurrences are the most general case treated by 
the technique.) 
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the same number of iterations, just as, in the original function, we 
must have as many returns from, as we have calls to, the func- 
tion). This number is equal to the number of consecutive calls 
made from the nodes used to split the graph. Expanded variabies 
(in the case of Lisp, single pointer variables turned into vectors of 
pointers) simulate the action of the stack upon the local variables 
of the function; that is, instead of using a stack to record the state 
of variables before and after those recursive calls made from those 
nodes used to split the graph, we represent each variable as a vec- 
tor, and write the value of the variable before the ith such recur- 
sive call into the ith position of the vector. As a consequence, it 
is much easier to access and manipulate these values in parallel, 


than if they were dispersed across the run-time stack. 
Note that, in this abstract example, the recursive calls in 


nodes 3 and 4 remain in the modified graph of the function. They 
become recursive calls to the transformed function, which means 
that there will be dynamically nested parallel loops if the forward 
and backward loops are amenable to further optimization. If not, 
that is, if the bulk of the computation in these loops remains 
sequential, then we would probably be better off with a more 
direct translation of the original function definition. The purpose 
of recursion splitting is to introduce parallelism, and not to 
remove recursion per se. 

Figure 18 shows what union might look like after recursion 
splitting. Each of a, b, and r is turned into a vector of pointers 
(in the subsequent transformations, we will turn around and 
restore b and r to single pointers). The vector e[] is used to 
record, for each iteration of the forward loop, which of the recur- 
sive nodes is reached. This information is used by the correspond- 
ing iteration of the backward loop to allow the computation to 
resume at the correct point. In terms of the original function, this 
corresponds to the fact that, before a recursive call, a record is 
made on the stack of the point from which the call is being made, 
so that execution may proceed from that point when the stack 
frame is restored. Notice that, intuitively, the backward loop runs 
“in the opposite direction” of the forward loop; that is, increasing 
the iteration number of the forward loop corresponds to moving 
deeper on the run-time stack, while the opposite holds true for the 
backward loop. 

The code finally produced for union, after performing 


:=cdr(ali-1)); 


=b[i-1]; 


if :=n-i+1; 


eli']? 


r[{iJ:=r[i-1]; 


Figure 18. The Graph of union, After Recursion Splitting 


many other manipulations of its graph would look something like 
‘that shown in Figure 19. We will describe each of the doall 
loops in turn, assuming that we have P < n processors with which 
to perform the function, where n is the length of (the starting 
value of) a. (As is frequently the case, no recurrence solution is 
needed to compute the number of iterations of these loops, as it is 
simply the length of a list; and because this length is recorded 
within the pointer to the list, the time to “compute” the loops’ 
upper bound is constant.) Loop (1) assigns all of the intermediate 
values of a in the loop. As we saw, if a is represented in a con- 
stant number of blocks, this will take T, = O(n/P). The compiler 
recognizes that the length of a behaves as an induction variable, 
and knows therefore that its intermediate values may be computed 
without the use of a recurrence solution technique. Loop (2) 
assigns the value of e[1i], which records the number of the recur- 
sive node reached during iteration i of the forward loop. This is 
also very striaghtforward for the compiler, as it sees that the value 
of the variable e (which we have “expanded” into a vector) dur- 
ing each iteration is independent of its value in all others. Loops 
(1) and (2) were created by “loop distribution” from the forward 
loop (see [12], [7]). (In reality, these loops would be “fused” into a 
single doall loop; see [11].) Loops (3), (4), and (5) were created 
by distributing the backward loop. Loop (3) places an integer (1 
or 0) into each location in the vector d[], indicating whether a 
cons cell is added to r during a particular iteration of the back- 
ward loop. The compiler actually translates this in a more general 
way, as the length of the sublist appended to the front of r dur- 
ing each iteration. This allows the process of recurrence recogni- 
tion to made quite formal, as the compiler maps all associative 
recurrences involving list operations onto arithmetic recurrences 
involving the intermediate values of the length of the recurrence 
variable. Loop (4) is the familiar and inefficient version of parallel 
prefix used to form the values d[1], d[1] + d[2], etc. In the 
code generated by the compiler, this would be a function call and 
not written explicitly into the code. Before entering loop (5), 
space is allocated for the cons cells added by union to the front 
of b. In loop (5), each added cell is copied into its final resting 
place within this contiguous block of cells. Again, the compiler 
translates this in a more general way; a doall loop is generated, 
in which every sublist appended to the front of the return value 
of the function is placed at its destination position within z; a 
sublist of length zero causes no such placement to occur. We 
append the block z to the front of b, and return the result. 
Using an efficient algorithm for parallel prefix, the whole process 


- of 


the time for a call to member. Now, member is also quite easily 
made parallel; it, in turn, calls equal, which is as well. However, 
in reality a function such as union is likely to appear near the 
“bottom” of the calling graph of a program, meaning that unless 
we have either a very large machine, or only procedures which we 
cannot make parallel “above” union in the calling graph, union 
is unlikely to receive a large number of processors. If it is always 
evaluated sequentially, a translation closer to the original would 
be better, as (for one thing) the code produced will be smaller 
than for the transformed version. 

Finally, let’s examine the definitions of quicksort and 
split and given in Figure 20. First consider the driving routine, 
quicksort. Recursion splitting fails here, as the recurrence 
defined by the operation split cannot be solved in parallel by 
our methods. (In effect, the compiler sees a recurrence such as 
1 [itl] car(split(cdr(1[i]),. car(1[i]), nil, 
nil) ) which must be solved sequentially.) That attempt failing, 
the compiler inserts a forking directive, causing the two recursive 
calls to quicksort to be executed simultaneously, resulting in a pro- 
cess tree. 

Now consider the definition of split. Recursion splitting 
succeeds here (trivially), resulting in a function which takes 


n*T(member) 


will take T,, | + O(lg p), where T(member) is 
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n := a[O].1; 
(1) doall i =1ton 
afi] = nthcdr(a[O], i); 
(2) doall i =1 ton 
if member (a[i], b) then e[i] := 1; 
else e[i] := 2: 
(3) doall i=1ton 
i' :=n-itd; 
if e{i'] = 1 then d[i] := 1; 
. else d[i] := 0; 
dist := 1; 
(4) do j = 1 to Mlgnl 
doall i=i1ton 
if i > dist then 
d{i] := d[{i] + d[i-dist]; 
dist := dist * 2; 
Z := allocate(d[n]): 
ad[O] := 0; 
(5) doall i=1ton 
i’ s=n-i+ 1; 
if e[i'] = 1 then 
(z.p + d[i'] - 1)->car := 
car (a[i'-1]); 
(z.p + d[i'] - 1)->cdr := 
[2.p * @f[it]> 261 —-d[i'}]: 
r := append(z, b); 


return (r); 


Figure 19. A Parallel Translation of union 


(def quicksort (lambda (1) _ 
(cond 
{(mull 1) nil] 
[t (let ((l_and_r (split (cdr 1) (car 1) 
nil nil))) 
(append (quicksort (car l_and_r)) 

(list (car 1)) 
(quicksort (cadr l_and_r))) 


)]))) 


(def split (lambda (1) 
(cond [(null 1) (list left right) ] 
[(lessp (car 1) partition) 
(split (cdr 1) partition 
(cons (car 1) left) right)] 
[t (split (cdr 1) partition 
left (cons (car 1) right))] 
))) 


Figure 20. The Function quicksort 


n 
Tp = [2 + Ig | on P processors, where n is the length of 
Pp 


the first argument to split. The fact that split is tail recur- 
sive means that the backward loop will be empty (non-existent). 
Now let’s consider how the transformed versions of the func- 
tions would work together. Assume that n processors are avail- 
able to quicksort, where n is the length of the list to be 
sorted; all participate in the operation split, which thus 
takes T, = O[lg i given an input list represented in a constant 


number of blocks. In the recursive calls to quicksort, assume 
that the processors are again divided according to the lengths of 
the sublists to be recursively sorted. Upon the return (join) of 
these two recursive calls, all n processors are available to perform 
the append operation; hence the two sorted sublists, along with 
the list consisting of the single partitioning element, are copied to 


a single contiguous block of storage (see the discussion above, con- 
cerning parallel copying during append operations). Assuming 
that we have input lists in random order, it can be shown that the 
average time to perform the entire sort will be T, = O(lg *n). In 
the case in which there are P < n processors available, the aver- 


| + O(lgP-lgn) - O(lg *n). 


nlgn 


age running time is T, = | 


There is an interesting point here, concerning append. 
Each recursive call to quicksort returns a list which is in a single 
contiguous block. Because the complexity of split is 
O(n/P + lg P), the time to copy the two results returned by recur- 
sive calls into a single contiguous block (O(n/P)) will always be less 
than the time to split the original list which was passed to those 
calls, meaning that the complexity of the algorithm is not affected 
adversely by the use of the “copying” version of append. On the 
contrary, it is only possible to achieve this running time when the 
copying version of append is used, for otherwise the recursive 
calls will return lists consisting of (within a constant factor) the 
same number of blocks as elements, causing append to require 
linear time. 


Conclusion 


We have presented a storage scheme for s-expressions which 
facilitates the parallel creation and use of lists. The representa- 
tion may be inexpensively used to represent s-expressions of more 
general shape, and allows a more flexible sharing of sublists than 
conventional representations. The PARCEL compiler, assuming 
this representation, produces code for a shared memory multipro- 
cessor from programs written in a subset of the Scheme dialect of 
Lisp. 
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Abstract -- The Classifier System is a parallel computational 
model that supports two important aspects of machine Intelligence: 
machine learning algorithms and operations on high-level 
‘“‘knowledge”’ structures such as those frequently used in artificial 
intelligence research. The Classifier System formalism {fs presented 
and previous research using the Classifier System in conjunction 
with two learning algorithms is reviewed. <A Classifier System 
implementation of the knowledge representation language KL-ONE 
is described, and analytical results are described that show that the 
implementation exploits the parallelism of the Classifier System. 


Introduction 


Researchers in artificial intelligence (AI) are exploring various 
models of parallelism, both for the purpose of improving 
performance of AI programs and to express models of intelligent 
behavior. The possibility that parallel architectures could 
implement AI models directly has encouraged the development of 
parallel models that might have both efficient hardware 
implementations and plausibility as intelligent systems. 


Performance improvements can be obtained by using 
parallelism to improve the execution speed of current languages in 
which most AI programs are implemented. Projects to construct 
parallel implementations of Lisp [12], [10] and Prolog [4] fall into 
this category. Parallel hardware is also being designed to help 
traditional AI programs execute more efficiently. The Butterfly 
machine under development at Bolt Beranek and Newman, and the 
FAIM [6] and NON-VON [19] machines are three examples of such 
hardware projects. 


A second approach is to consider new intermediate 
representations out of. which intelligent systems can be built. 
Projects such as ABE [7] provide computational units that can be 
readily composed into large knowledge systems and are amenable 
to implementation on many different multi-processing systems. 


Yet another: approach is to build machines that implement 
parallel models of intelligent behavior more or less directly. The 
most prominent example of this is the Connection Machine 
architecture [13] being developed by Thinking Machines Inc. 
However, there are several other proposed computational models 
for which hardware implementations do not yet exist that claim to 
express some aspect of intelligence. These include the Classifier 
System [18], the Boltzmann machine [14], and the NETL machine 
[8]. These models all have some features in common, such as 
massive parallelism, processors with limited computational power, 
and a high degree of connectivity between processors. The 
Classifier System and the Boltzmann machine are notable in that 
the models have been designed expressly to support learning; 
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however, it has not been clear how well these two ‘‘fine-grained”’ 
models will be able to represent and process high-level symbolic 
structures. 


The research described in this paper explores the question of 
how well the Classifier System can represent and process high-level 
symbolic knowledge structures. To explore this issue, I selected a 
high-level knowledge representation language (KL-ONE) that has 
well-defined algorithms for accessing and manipulating Information 
stored in knowledge structures, and studied how well the Classifier 
System could support its facilities. The remainder of the paper is 
divided into four sections: The Classifier System, Methodology, 
Results, Discussion, and Conclusions. 


The Classifier System 


The Classifier System [18] is a massively parallel model 
developed by John Holland in which each computational unit 
(classifier) has very limited processing power, and in which there is 
no global supervisor. Every processor can communicate with every 
other processor, and, as will be shown, the extent of this 
communication can be controlled easily. Each processing element 
is highly standard, which makes the Classifier System a promising 
candidate for implementation in hardware. The Classifier System 
is well suited for problems in artificial intelligence because (1) it 
possesses a high degree of parallelism, (2) it allows processors to 
communicate with one another flexibly, (3) learning mechanisms 
can be applied at several levels within the system, and (4) high- 
level symbolic representations can be expressed within the 
formalism. 


The Classifier System is modeled on the idea of production 
systems [5]. It contains a data base of production rules, called 
classifiers. Each classifier can perform one action, that of adding 
messages (bit strings) to the short term memory (called the 
message list). At any instant the state of the message list 
determines which classifiers are eligible to write information to the 
message list at the next time step. 


Each classifier, or production rule, consists of a condition 
part and an action part. The action part specifies exactly one 
action, while the condition part may contain many conditions (pre- 
conditions of activation). Rules with more than one condition are 
referred to as ‘“‘multiple-condition classifiers.’’ A multiple-condition 
classifier must have each of its pre-conditions matched in a single 
time step for it to be activated (although it is not necessary that 
the same message match each condition). The conditions and 
actions are fixed-length strings over the alphabet (1,0,#) where # 
denotes ‘‘don’t care” and 1 and O are literals. The determination 
of whether or not a specific message matches a condition Is a 


logical bit comparison on the defined (1 or 0) bits. If a “‘not’’ 
condition is used, the condition Is fulfilled just in the case that no 
message on the message list matches it. The #’s in the condition 
part designate ‘‘don’t care’’ positions In the sense that they match 
either 1 or O. The action part of the classifier determines the 
message to be posted to the message list: if the i-th symbol, A; of 
an action is O or 1, the i-th symbol of the posted message will be 
A,. Otherwise, A; is #, and the i-th message bit will equal the bit 


matched by the i-th bit of the distinguished condition.” 


Each classifier may be regarded (functionally) as a separate 
processor that takes messages as input and produces messages as 
output. The configuration of the individual classifier determines 
which messages are accepted as input and how accepted messages 
are transformed into output messages. A classifier is activated at 
a given time step if its pre-conditions are satisfied by at least one 
message which appeared on the message list at the previous time 
step. The message list is completely rewritten once per time step 
so that each message has a duration of exactly one time step. 
Thus, the primary action of the system is a loop in which all of the 
classifiers access the current message list, each determining if its 
pre-conditions have been met, and if so, posting its own output 
message(s) at the next time step. This process continues until the 
system has iterated a fixed number of times or, in some systems, 
until the message list remains unchanged (quiescent) for two 
successive time steps. All external communication (input and 
output) is via the message list. As a result, all internal control 
information and external communication reside in one data 
structure. 


As a simple example, consider the following four bit (n = 4) 
‘classifier system: 


#00# => 1101 

#101 

###1 => #18 

“1111 => 1111. 


This classifier system has three classifiers. The second 
classifier illustrates multiple-conditions, and the third contains a 
negative condition. If an initial message, ‘‘0000’’ is placed on the 
message list at time TO, the pattern of activity shown in Figure 1 
will be observed on the message list: 


*For multiple-condition classifiers, this interpretation of an action is 


ambiguous since it is not clear what it means to simultaneously perform ‘‘pass 
through” on more than one condition. The ambiguity is resolved by 
distinguishing one condition to be used for pass through. By convention, this 
will always be the first condition. Another ambiguity arises if more than one 
message matches the distinguished condition in one time step. Again by 
convention, in my system I process all the messages that match this condition. 
The example illustrates this procedure. 
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Time Step Message List Activating Classifier 
TO: 0000 external 
T1: 1101 first 
1111 third 
T2: 1111 second 
T3: 
T4: 1111 third 


Figure 1: Example Classifier System Behavior 


The final two message lists (<empty> and 1111) will 
continue alternating until the system is turned off. At T1, one 
message (1101) matches the first (distinguished) condition and both 
messages match the second condition. Pass through is performed 
on the first condition, producing one output message for time T2. 


If the conditions had been reversed (###1) distinguished), the 
message list at time T2 would have contained two identical 


messages (1111). 


One apparent drawback of the Classifier System is that each 
classifier reads (examines in its entirety) a potentially large 
message list on which most of the messages may not be relevant. 
Having each classifier read the entire message list introduces a 
time consuming search. However, it is possible to arrange the 
system in such a way that the messages are routed directly to the 
classifiers that they will activate. Because the format for 
expressing the condition parts of a classifier is so constrained, it is 
possible to sort: the conditions of any given list of classifiers so that 
messages are routed efficiently. If this were done, each classifier 
would only have to read the relevant messages. Messages that 
were relevant to many classifiers would still be effectively global, 
but messages that were only relevant to one classifier would only 
be read by that classifier. This data-flow approach would not 
change the overall behavior of the system although It would affect 
the system’s efficiency. Since in most applications, the size of the 
message list remains small with respect to the total number of 
classifiers, the actual speedups might not be significant. 


In large scale parallel systems such as the Classifier System, 
the issue of design is central. Design issues arise in two ways for 
the Classifier System: in deciding which external classifiers are to 
be generated, and in deciding which external messages are to be 
placed on the message list and when. As the number of classifiers 
increases, it quickly becomes impossible to do this by hand. Two 
automatic approaches have been explored: ‘learning’ and 
“‘compiling.”” The process of compiling can be viewed as mapping 
high-level structures onto lower-level operations (‘‘top down’’). 
Likewise, some kinds of learning (for example, genetic algorithms) 
can be viewed as the gradual emergence of higher-level structures 
from a random assortment of low-level processes; systems using 
these kinds of learning organize themselves from the ‘‘bottom up.” 


Learning algorithms have been used with Classifier Systems 
by Holland [18], Smith [20], [21], Booker [1], Wilson [22], and 
Goldberg [11]. Each of these systems has been designed for a 
specific purpose and relies on adaptive algorithms to control 
system behavior. In each case, the researcher was able to 
demonstrate learning for a complex task: Holland’s system learned 
control routines for operating in a two-dimensional environment, 
Smith’s system learned poker-playing strategies, Booker’s and 
Wilson’s systems simulated a hypothetical organism that learned 
to locate resources and avoid noxious stimuli in an uncertain 
environment, and Goldberg’s system learned to control gas pipeline 
operations. 

Access to the message list can be limited by choosing an 
upper bound for the number of active messages at any one time. 
In current systems this is typically a small number (for example, 
thirty-two). The classifiers that are potentially active (those 
whose conditions are matched by messages on the current message 
list) then bid to put their messages on the list, and those with the 
highest bids are allowed to do so. The bid of a classifier is 
dependent on at least two components: (1) the specificity of the 
classifier’s conditions and (2) the strength of the classifier.” There 
are two principal ways in which learning Is used to control a 
Classifier System whose message Hst is of limited size. (1) by 
controlling write access to the message list and (2) by controlling 
which classifiers are In the data base of rules. Two different 
algorithms are used for each of these functions. 


The first, the ‘bucket brigade’ [16], adjusts the strengths of 
the classifiers over time, rewarding those classiflers that have 
contributed to good solutions and punishing those that do not 
prove useful. Classifier strength is increased when the system 
produces a ‘“‘good”’ external response. Thus, reward is ultimately 
dependent on the system’s performance in its external 
environment. 


The choice of which classifiers are In the data base is 
controlled by the Genetic Algorithm [15]. The algorithm ts used 
periodically throughout the operation of the Classifier System to 
evaluate which classifiers are doing well (contributing to useful 
solutions) and which ones are not doing well. The evaluation is 
based on the current strengths of each classifier. Based on this 
evaluation, weak classifiers are eliminated from the data base, 
strong classifiers are retained, and new classifiers are generated by 
applying “‘genetic operators’’ to previously successful classifiers in 
the hopes of generating even more successful recombinations. 


Methodology 


In order to study the relation between the  bit-level 
operations of the Classifier System and higher-level symbolic 
operations, I selected one knowledge representation language, KL- 
ONE [3], and showed how both its data structures and accessing 
algorithms can be implemented using the Classifier System. 


bRecently, a third component has been added. This is called ‘“‘support,”’ 
and it corresponds roughly to the number of previously active classifiers that 
thinks the bidding classifier should be active now. 
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A knowledge representation system consists of data 
structures (facts known to the system and the relations between 
the facts) and interpretive procedures used to access them in an 
orderly fashion. KL-ONE is a knowledge representation system in. 
which information ts represented as a directed graph (called a 


semantic network). 


The implementation takes the form of a compiler, mapping 
‘*high-level’’ semantic network definitions onto the Classifier 
System. In this context, the Classifier System is properly viewed 
either as a lower-level target language or as a specification for an 
abstract parallel machine. An external command processor runs 
the Classifier System, providing input (and reading output) from 
the ‘‘classifier program.” The parallel algorithms are formulated 
as a sequence of queries to the Classifier System representation of 
a KL-ONE network. The queries are initiated by placing a set of 
messages on the message list, allowing the system to iterate for a 
fixed number of cycles, and then reading the new set of messages 
from the final message list. Figure 2 illustrates the organization of 
the implemented system. 


KL-ONE NETWORK 


(Subsumes? A B) 


1. Inherit Properties of A 
2. Inherit Properties of B 


3. Compare (1) and (2) 


Message List 


Yes 


Classifier Set 


CLASSIFIER SYSTEM COMMAND PROCESSOR 


Figure 2: Classifier System Implementation of KL-ONE 


KL-ONE is one of the few knowledge representation systems 
for which both the data structures and the operations are well- 
defined. The central operation in KL-ONE is that of 
““classification.’’ In its most general formulation, classification is 
the problem of how to relate new information to an existing 
knowledge base. In network-based systems, this becomes the 
problem of deciding which links to add between new and old nodes 
when incorporating new structures into the network. Of the 
various Knowledge representation paradigms in use today, the KL- 
ONE family has focused on the issue of classification most 
precisely. 


KL-ONE organizes descriptive terms into a multi-level 
structure which allows properties of a general concept, such as 
‘“‘mammal,’’ to be inherited by more specific concepts, such as 
“zebra.” This allows the system to store properties that pertain to 
all mammals (such as ‘“‘warm-blooded’’) in one place but to retain 
the capability of associating those properties with all concepts that 
are more specific than mammal (such as zebra). In KL-ONE, the 
multi-level structure is easily represented as a graph, where the 
nodes of the graph correspond to concepts and properties, and the 
links correspond to relations between nodes. 


In KL-ONE, classification is the process of deciding where a 
new term (a subgraph) should be located in an existing network. 
The term may be a single concept, or more likely, a complex 
description built out of many concepts. The classification 
procedure can be expressed as the procedure of deciding which 
concepts are “‘above’’ (more general than) and ‘‘below’’ (more 
specific than) an incoming concept. If a concept A is more general 
than another concept B (based its definitions), then A is said to 
subsume B. The procedure for deciding whether one concept 
subsumes another is computationally expensive; the procedure has 
been shown to be NP-Complete for a related knowledge 
representation formalism [2]. 


Results 


The development of the system was divided into several 
phases: (1) implementing the Classifier System in software, (2) 
designing and implementing a program that translates KL-ONE 
networks into production rules for the Classifier System, (3) 
developing the parallel algorithms to determine subsumption, (4) 
validating the algorithms, and (5) analyzing the complexity of the 
system. The Classifier System simulation program, the compiler, 
and the command processor are all implemented in Franz Lisp, and 
run on a Vax 11-780 using the Unix Operating System. 


The algorithms that have been developed can be divided into 
two classes: high-level (with respect to the Classifier System) 
constructs that should be of general utility to other users of the 
Classifier System, and special-purpose algorithms for classification 
in KL-ONE. The first class provides a set of general-purpose 
operations that proved useful for developing KL-ONE->-specific 
algorithms. The general-purpose operations include boolean 
operations on sets of messages (intersection, complementation, 
etc.), stack operations (push and pop), and some numerical 
operations (find the maximum or minimum of a set of numbers, 
compare two numbers, and addition). For KL-ONE, algorithms 
for two major operations were developed: (1) a decision procedure 
for subsumption, and (2) a search procedure that finds the set of 
Most Specific Subsumers® (MSS). The sequential version of these 
algorithms is described in [17] and the parallel version in (9]. 


“A is a Most Specific Subsumer of B iff 
A SUBSUMES B, and 
there does not exist a C, such that 
A SUBSUMES C AND C SUBSUMES B. 
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The successful implementation of KL-ONE using the 
Classifier System demonstrates that the Classifier System is 
capable of representing complex data structures and that complex 
computations can be performed within the Classifier System 
formalism. However, this alone does not demonstrate that the 
implementation is ‘reasonable.’’ The reasonableness of the 
Implementation has been evaluated using formal complexity 
measures. . 


Four measures of complexity were considered: length of 
computation (in simulated time steps), the size of each processor, 
the number of processors, and the amount of inter-processor 
communication. For the Classifier System, these four measures are 
interpreted as the number of time steps required to complete a 
computation, the number and length of conditions for each 
classifier, the number of classifiers used to represent a KL-ONE 
network, and the maximum size of the global message list. A 
reasonable implementation is one in which data are not stored 
redundantly (the number of classifiers and the number and length 
of conditions grow linearly with the size of the network), the size 
of the global message list remains small with respect to the total 
number of classifiers (inter-processor communication should not 
increase disproportionately to the size of the network), and the 
parallel algorithms are computationally more efficient than their 
sequential counterparts (the speed-ups are expected to be related 
to the fan-out of the underlying structure). 


A detailed analysis of the parallel algorithms using these four 
measures of complexity appears in [9]; in summary, the analysis 
shows that the implementation meets the above criteria. The 
number of classifiers required to represent each KL-ONE network 
is directly proportional to the size of the network definition. The 
size of each classifier grows as the log of the number of nodes in 
the network. The maximum size of the message list is related to 
the topology of the KL-ONE network being processed; for 
networks in which the fan-out Is greater than the fan-in (the 
expected case), the maximum size of the message list grows with 
the depth of the network. The Subsumption algorithm and the 
search for Most Specific Subsumers each run in time proportional 
to the depth of the network.? Because the search for Most Specific 
Subsumers invokes the Subsumption test, the overall complexity 


for the algorithms taken together is proportional to the (depth)?. 


Iphe implemented version of the search for Most Specific Subsumers may 
run somewhat slower than this for some networks. However, it would be 
straightforward to add one field (a small number of bits) to each classifier and 
guarantee that the algorithm always ran in time proportional to the depth of 
the network (with a larger message list. size). Since the performance of the 
algorithms is highly dependent on the conformation of the networks it 
processes and there is very little data available about the conformation of 
typical networks, it was not clear at the time of implementation whether this 
was a worthwhile optimization. 


Discussion 


In the current implementation, the algorithms rely to some 
extent on the host language (Lisp) in which the command 
processor is implemented. That is, some of the algorithms take 
advantage of the control structures and data management facilities 
of Lisp beyond using it to formulate queries to the Classifier 
System. This is an important issue as it would be easy to hide 
significant amounts of complexity in the processing of the host 
language and make it appear that the parallelism was 
accomplishing more than it really was. On the other hand, it 
seems unreasonable to force intrinsically sequential operations into 
the parallel formulation when there is an adequate sequential 
language immediately available. In the extreme, it would be 
possible to generate a set of classifiers that could answer the 
subsumption question with one query (with no interference from 
the host language)° , but the number and length of classifiers and 
the time to translate KL-ONE networks into the classifier 
representation would be unreasonably large. 


The parts of the algorithm that have been implemented in 
Lisp are either natural components of the host language (such as 
invoking the Classifier System) or they are operations that could 
be made parallel by adding a constant number of bits (to be used 
as tags) to each classifler. In particular, the embedding language 
(Lisp) is used in four ways: (1) to translate symbolic queries such 
as ‘‘(Subsumes? A B)’’ into binary messages and translate them 
back after the query, (2) to invoke the Classifier System, (3) to 
store the results of queries (for example, a list of messages) that 
will be needed again later, and (4) to control the highest-level 
sequences of queries through the use of conditional and iterative 
constructs. The first two of these are appropriate for the 
command processing role that is played by the embedding 
programs. The second two are only a matter of convenience for 
the current simulation and could be implemented in the Classifier 
System. 


The implementation revealed a set of techniques that are 
useful for controlling the Classifier System. These are tagging, the 
use Of negative conditions as triggers, computations that process 
one bit position at a time, and synchronization. 


Tagging, in which one field of the classifier is used as a 
selector, is used to maintain groups of messages on the message list 
that are in distinct states. Tagging allows the use of specific 
operators that are defined for particular states. This specificity 
also allows additional ‘“‘layers’’ of parallelism to be added by 
processing more than one operation simultaneously. In these 
situations, the messages for each operation are kept distinct on the 
global message list by the unique values of their tags. 


Negative conditions activate and deactivate various 
subsystems of the Classifier System. Negative conditions are used 
to terminate computations and to explicitly change the state of a 
group of messages when a “‘trigger’’ message is added to the list. 
When the trigger message appears, it violates the negative 
condition and that classifier is effectively turned off. 


“it is straightforward to show that for any finite function, a Classifier 
System can be constructed that computes it. 
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Computations that proceed one bit at a time illustrate two 
techniques: (1) using control messages to sequence the processing of 
a computation, and (2) how to collect and combine information 
from multiple independent messages into one message. Sequencing 
will always be useful when a computation is distributed over 
multiple time steps instead of being performed in one step. 
Collection is important because in the Classifier System it is easy 
to “‘parallelize’’ information from one message into many messages 
that can be operated on independently. This is most easily 


accomplished by having many classifiers that match the same 
message and operate on various fields within the message. The 
division of one message into its components takes one time step. 
However, the recombination of the new components back into one 
message (for example, an answer) is more difficult. The collection 
process must either be conducted serially (one bit position at a 
time) or one classifier must be allocated for each possible message 
combination (a potentially huge number). Intermediate solutions 
are also possible. 


Synchronization techniques allow one operation to be delayed 
until another has finished. Synchronization can be achieved by 
combining tagging with negative conditions. 


Conclusions 


The implementation of KL-ONE demonstrates that the 
Classifier System formalism is a powerful and_ flexible 
representation system. This result combined with previous work 
that demonstrates how the Classifier System can learn suggests 
that the Classifier System is a promising low-level organization 
from which intelligent systems can be constructed. 


In particular, the implementation is one in which reasonable 
computation time improvements can be obtained without an 
unreasonable increase in the number and size of processing units or 
in the degree of inter-processor communication. The approach 
that was demonstrated for KL-ONE could easily be extended to 
other similar inheritance networks. 


This particular implementation does not include those 
features that are specific to the use of adaptive algorithms, such as 
bidding, support, etc. However, previous experiments that 
combined learning algorithms with the Classifier System worked 
with rather small sets of classifiers (usually less than two hundred 
and often less than one hundred). Further, it is very difficult to 
“‘decompile’’ these systems after significant learning has occurred 
to determine what role is played by individual classifiers. Using a 
compilation technique allows the construction of much larger sets 
of classifiers and it allows complex behavior (such as 
synchronization) to be identified and studied. One promising 
direction for further research is to try combining the two 
approaches to see if the kind of complex behavior demonstrated by 
the KL-ONE implementation can be learned by the bucket brigade 
and genetic algorithms. 
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ABSTRACT 


This paper describes the implementation of a 
continuous speech recognition algorithm on the BBN 
Butterfly ™ Parallel Processor. The implementation 
exploited the parallelism inherent in the recognition 
algorithm to achieve good performance, as indicated by 
execution time and_ processor _ utilization. The 
implementation process was simplified by a programming 
methodology that complements the Butterfly 
architecture. The paper describes the architecture and 
methodology used and explains the speech recognition 
algorithm, detailing the computationally demanding area 
critical to an efficient parallel realization. The steps 
taken to first develop and then refine the parallel 
implementation are discussed, and the appropriateness 
of the architecture and programming methodology for 
such speech recognition applications is evaluated.| 


INTRODUCTION 


This paper describes research to investigate the 
uses of parallel computation for continuous speech word 
recognition. Our goal in this work is to determine the 
extent to which continuous’ speech recognition 
algorithms can make use of parallel processing to 
achieve real time speeds. Our approach has been to 
develop parallel versions of an existing recognition 
algorithm on BBN's Butterfly Parallel Processor. 


BUTTERFLY 


The Butterfly Parallel Processor [1] is composed of 
multiple (up to 256) identical nodes interconnected by a 
high-performance switch. Each node contains a 
processor and memory. The switch allows’ each 
processor to access the memory on all other nodes. 
Collectively, these memories form the shared memory of 
the machine, a single address space accessible to every 
processor. All interprocessor communication is 
performed using shared memory. Typical memory 
referencing instructions accessing local memory take 


about 2 microseconds to complete, whereas’ those 
accessing remote memory take about 5 or 6 
microseconds. 
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The Butterfly Parallel Processor is a multiple 
instruction multiple data stream (MIMD) machine in 
which each processor node executes its own sequence 
of instructions, referencing data as specified by the 
instructions. Processor Nodes are tightly coupled by 
the Butterfly switch. Tight coupling permits efficient 
interprocessor communication and allows each processor 
to access all system memory efficiently. The Butterfly 
Parallel Processor is expandable to 256 Processor 
Nodes. Each Processor Node contains a Motorola 
MC68000 microprocessor, an _ optional floating-point 
coprocessor, from 1 to 4 MBytes of main memory, a co-— 
processor called the Processor Node Controller, memory 
management hardware, an I—O bus, and an interface to 
the Butterfly switch. The particular machine that was 
used for development in this project was a 16 processor 
machine with 1 Mbyte of memory on each Processor 
Node. It did not have hardware support for floating 
point arithmetic. 


WORD RECOGNITION ALGORITHM 


The problem of speech recognition requires that 
we map an analog signal onto a sequence of words that 
comprise sentences. However, the identification of 
particular speech sounds (phonemes) from this signal is 
made difficult due to the variability that occurs in 
speech production. This variability is due to the effect 
of neighboring speech sounds (coarticulation), and a 
variety of other effects that all combine to make the 
same speech unit appear differently each time it occurs. 


Our recognition algorithm is based on the explicit 
modeling of variability in speech through the use of 
probabilistic Hidden—Markov Models (HMMs) of speech 
sounds (phonemes) in various phonetic contexts [2]. 
Each phoneme consists of three states, which are 
associated with acoustic events corresponding roughly 
to the beginning, middle and end of the phoneme. 
There is, associated with each pair of states, a 
transition probability a(jli) which is the probability of 
going to state j} given that the process is in state i. 
Unlike a Markov chain, in which each state has 
associated with it a single output, each HMM state has 
an output probability density function (pdf) P(xli) that 
gives the probability of each possible output symbol x, 
given that the process is in state i. 


Rather than use actual segments of the speech 
signal as output symbols, we can represent the speech 
Signal as a sequence of spectra that occur at discrete 


time intervals. Furthermore, each of these spectra can 
be approximately characterized as one of a_ small 
number of spectral types (256 in our system), which are 
determined using a clustering procedure. The spectral 
characterizations, each a single number, then become 
the possible output symbols. Words can be modeled as 
concatenations of phoneme HMMs that have been 
modified to take into account the contextual effects of 
the word. | 


One approach to understanding HMMs is to imagine 
them in a synthesis role, where they are used to 
produce spectral sequences. Starting from the initial 
state of the model, we randomly choose the next state 
according to the transition probabilities on the arcs 
leaving the initial state. Whenever the process is in an 
acoustic state, we randomly pick an output symbol 
according to the state’s output pdf. As the process 
moves from state to state in this manner, it produces a 
sequence of symbols (speech spectra) as output. 


The recognition problem can be viewed as the 
inverse of the synthesis problem: given a sequence of 
input spectra and a set of models of all possible words, 
we wish to find the sequence of models that is most 
likely to have produced the spectral sequence. An 
adaptation of the Viterbi algorithm solves just this 
problem, and is the basis for our recognizer. 


The basic recognition algorithm finds the path 
through the states that is most likely to have produced 
the spectral sequence to be recognized. The algorithm 
does this by finding the best path to every state at 
every time given that the path to the previous state 
was also “best”. Each step along these paths has 
associated with it a "score", which reflects’ the 
probability of the step given the spectral sequence and 
the transition probabilities of the model. The scores 
are accumulated along the paths, so that, at every time, 
the best path to any state has a single score. The best 
path to a particular state, S, at time t, is determined 
by considering all possible predecessor states (states 
which have transitions to S) at time t-1 and the best 
path to each of these. The score for a path to S is 
then the combination of the path score to the best 
predecessor state and the score for the step from that 
predecessor state to S. 


The central computation in the algorithm is: for 
each time interval, update the scores for all states. 
Figure 1 schematically illustrates the scoring procedure 


for a single state in a word. In this figure, the score 
for state n at time t (the ao (t) in the lower right 


corner) is being computed based on the scores for 
three states computed at time t-—1 (the three circles on 
the left of the figure) each multiplied by the 
corresponding transition probability of going to state n. 
The new score for state n is just the maximum of the 
entering scores multiplied by the probability of the 
spectrum x, at time t, at state n, p(x|n). 


This scoring procedure is applied to all states in 
each word for all time frames in the utterance. The 
scores computed for terminal states (ends of words) are 


‘utterance. 
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special. They are compared and only the maximum 
terminal score is saved, as are the word that produced 
it and the start time for the word. The largest terminal 
score for a time frame is used as the score for all 
word-—initial states in the next time frame. In addition, 
the current state score, a,(t) is compared against the 


largest state score, Opest(t), encountered so far for the 


current time frame, and replaces it, if appropriate. 
This is used to derive a normalization factor (NF 
1/op,64(t)), which is used to. prevent arithmetic 


underflow. 


When all the time frames in the utterance have 
been processed in this way, the best sequence of words 
is determined. The maximum terminal score at the end 
of the utterance specifies the last word of the 
The start time of this last word is the end 
time for the previous word, so the maximum terminal 
score at this time indicates which word should be 
selected as the second-—last word in the theory, and so 
on, back to the beginning of the utterance. 


SINGLE PROCESSOR IMPLEMENTATION 


The first step toward a parallel implementation 
was to bring up the speech recognition program on a 
single processor of the Butterfly Parallel Processor. 
The existing VAX implementation depended on the file 
system to store the large amounts of data which 
included the transition probabilities and the pdfs for 
the word models. This data totaled 1.5 Mbytes. Storing 
this amount of data on the Butterfly required using 
some parallel memory management’ techniques' to 
allocate shared memory on multiple nodes. 


The VAX (and the first Butterfly implementation) 
used floating-point arithmetic. Because floating-point 


arithmetic is performed in software in our Butterfly 
Parallel Processor, it seemed likely that the time 
required for the floating-point arithmetic would hide 
any overhead and task granularity problems we 
encountered. We decided to investigate using fixed- 
point arithmetic. 


For storage efficiency, most of the data had been 
represented as indices into a table of probabilities. 
This table contained 256 entries, ranging between zero 
and unity, quantized logarithmically. Therefore, the 
indices themselves were scaled log probabilities. 
Multiplication of probabilities in the original program 
could, of course, easily be converted to addition of 
corresponding log probabilities. The Viterbi algorithm 
finds the maximum of the path probabilities to a node, 
which is equivalent in both log and linear domains. We 
converted the program to use the indices directly as 
log probabilities, and obtained the same results as 
before, but the execution time remained disappointingly 
long. Measurement tools allowed us to discover that 
most of the time was being squandered in two ways. 
First, using double indexing into two-dimensional arrays 
was quite slow because the 68000 compiler available at. 
the time used double precision multiplies for this kind 
of address’ calculation. We converted to register 


pointers to avoid this situation. Secondly, much time 
was being spent in the call to the subroutine that 
performed the word scoring. We simplified the calling 
sequence by defining many of the arguments globally. 
These modifications caused the execution time to drop 
to about two minutes for a 3.5 second utterance, (about 
the same speed as our optimized VAX program). This 
execution time seemed low enough for’ reasonable 
parallelization measurements. 


UNIFORM SYSTEM 


We used the Uniform System approach to obtain a 


parallel implementation. The Uniform System is a 
programming methodology supported by a library of 
high-level functions [3]. It exploits the uniform 
environment provided by the architecture of the 


Butterfly Parallel Processor to simplify the problem of 
load balancing for the memory as well as for the 
processors. Memory accesses must be organized to 
avoid memory contention. The load on the processors 
is balanced when all processors are equally busy and 
no processor is waiting for another to finish. 


Balancing the load on memory is accomplished by 


spreading out the data evenly across the different 
physical memories in the machine, under the assumption 
that this will also spread the accesses fairly evenly, 
reducing the inefficiency that results when many 
processors attempt to access the same memory 
simultaneously. Functions for allocating storage in the 


shared memory are included in the Uniform System, as’ 


are functions that perform block transfers between 


shared memory and local memory. 


The philosophy behind the Uniform System 
processor management methodology views the processors 
as a uniform pool of workers, all of which know how to 
execute the same tasks. Using this methodology, the 
‘programmer is only required to supply code _ that 
operates correctly when multiple processors execute it 
at the same _ time. The processor management is 
-accomplished by first copying the program to each of 
the processors. In most cases, the program will begin 
with a section of serial code that is executed on a 
single processor. To begin executing a section of code 
on multiple processors —-— a FOR loop, for example —- 
the programmer can use a "task generator” to replace 
the FOR statement and a “worker routine” to replace 
the body of the FOR loop. The task generator makes a 
task descriptor available to all processors, which use it, 
as they become free, to generate calls to the worker 
routine. Processors, using this descriptor, execute the 
routine repeatedly for different index values, until the 
index has run its range. When all processors have 
finished, the program, once again serial, continues 
executing on a single processor. 


We decided to use the Uniform System for several 
reasons. First, the speech recognition algorithm is 
essentially a single task, executed many times. This fits 
‘fhe Uniform System paradigm very well. Second, being 
novices, we were attracted by the simplicity of use of 
the Uniform System. Third, functions in the Uniform 


System allow automatic timing of the same program run 
on various numbers of processors, and this provided an 
easy way of evaluating the performance of the parallel 
implementation. Finally, because the same program can 
be run on one or many processors, we believed that 
debugging the parallel implementation would _ be 
simplified. 


PARALLEL IMPLEMENTATION AND RESULTS 


In our system, both the training and spectral 
analysis tasks are performed “off-line” and the results 
are stored. The recognition system begins by reading 
in the word models for a_ speaker. Then, for each 
utterance, the spectral parameters for the utterance 
are read and stored. At this point in the program, the 
actual recognition search task begins. This was the 
only portion considered for parallel implementation. 
Our execution time measurements began here and 
continued until the input utterance had_ been 
recognized, that is, a theory for the complete utterance: 
had been obtained. The pertinent portion of the speech 
recognition program can be abstracted as follows: 


FOR all frames 
initialize frame 
FOR all words 
initialize word 
FOR all states 
compute state score 
IF (new max score) 
replace max score 
IF (new max terminal) 
replace max terminal 
determine normalization 
FOR all words 
get terminal score 
determine theory 


Se phe er Towne aor ep 


The first parallel version combined lines. d) 
through f) and parts of g) and h) into a single task and 
used the generator GenOnIndex, which includes a 
prologue task and an epilogue task in addition to the 
main task. The prologue task is executed only once by 
each processor before that processor executes the main 
task for the first time. In this version, the prologue 
included line b). Similarly, the epilogue task is 
executed once by each processor after all main tasks 
have been completed by that processor. For this 
program, the central task determined the maximum 
state score and the maximum terminal score seen by 
each processor. The epilogue task compared these 
local maxima against global maxima, replacing the global 
maxima if necessary. The remainder of the program 
(lines k-m), including the second FOR loop’ was 
executed sequentially, on a single processor. 
Approximately 9 seconds was spent in the sequential 
portion of the program when it was run for a 3.5 
second utterance. When run on 15 processors, this. 
resulted in less than 50% utilization of the processors. 
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The next step was to attempt to reduce the 
sequential portion of the program. We noticed that the 


second FOR loop (iines 1 and m), which propagates the 
best word’s score in the current frame to word-initial 
states for the next frame, could be incorporated into 
the first FOR loop, effectively changing the program to: 


a. FOR all frames 

b initialize frame 

c - FOR all words 

d get previous frame terminal score 
e initialize word 

f FOR all states 

g. compute state score 
h IF (new max score) 

1 replace max score 
j IF (new max terminal) 

k replace terminal 

] determine normalization 

m. determine utterance. 


This revision substantially reduced the time spent 
executing serial code. For 15 processors, the execution 
time dropped from 15 seconds to 11 seconds for a 3.5 
second utterance, and the effective number _ of 
processors rose from 6.9 to 11.2, or approximately 75% 
utilization. The processor utilization is shown in Figure 
By 


CONCLUSIONS AND FUTURE RESEARCH 


Our work on this project has shown that the 
Butterfly architecture is suitable for continuous speech 
word recognition. The decomposition of the algorithm 
into tasks that match one word to one frame of input 
speech provided a granularity that made efficient use of 
the processors. The memory and processor management 
functions of the Uniform System made parallelization of 
the algorithm surprisingly easy and rapid. 


In the near future, we hope to extend the current 
research to include a grammar and larger vocabulary 
tasks. The grammar will require search of a much 
larger space that is too large to search exhaustively. 
The search will have to be pruned, thus presenting a 
more challenging parallel implementation task. 
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Abstract 


In highly parallel message routing networks, it is 
sometimes desirable to concentrate relatively few mes- 
sages on many wires onto fewer wires. We have de- 
signed a VLSI chip for this purpose which is capa- 
ble of concentrating bit-serial messages quickly. This 
hyperconcentrator switch has a highly regular layout 
using ratioed nMOS and takes advantage of the rela- 
tively fast performance of large fan-in NOR gates in 
this technology. A signal incurs exactly 2 log, n gate 
delays through the switch, where n is the number of 
inputs to the circuit. The architecture generalizes to 


domino CMOS as well. 


1 Introduction 


The problem of concentrating relatively few commu- 
nications on many input lines onto a lesser num- 
ber of output lines must be solved in communication 
networks of all kinds. In many parallel computing 
systems, communications are packaged into messages 
which are routed among the processors, This paper 
presents a design for a VLSI implementation of a fast 
concentrator switch suitable for routing bit-serial mes- 
sages in a parallel supercomputer. 


An n-by-m concentrator switch has n input 
wires X1,X9,...,Xn, and m < n output wires 
Y,, Yo,...,; Ym. The switch can establish m disjoint 


electrical paths from any set of m input wires to the 
m output wires. A concentrator switch always routes 
as many messages as possible. Specifically, whenever 
k out of the n input wires of an n-by-m concentrator 
switch carry messages, one of the following is true: 
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e If k < m, then an electrical path is established 
from each input wire which contains a message 
to an output wire. 


e If k > ™m, then each output wire has an electrical 
path established from an input wire which con- 
tains a message. 


When k > m, some messages cannot be successfully 
routed, in which case we say the switch is congested. 
Typical ways of handling unsuccessfully routed mes- 
sages in a routing network are to buffer them, to mis- 
route them, or to simply drop them and rely on a 
higher-level acknowledgment protocol to detect this 
situation and resend them. The switch design in this 
paper is compatible with any of these congestion con- 
trol methods. 

One way to create a concentrator switch is with a 
hyperconcentrator switch. An n-by-n hyperconcentra- 
tor switch! has n input wires X1,X2,...,Xn andn 
output wires Yj, Y2,...,Yn. The switch can estab- 
lish disjoint electrical paths from any set of k input 
wires, for any 1 < k < n, to the first k output wires 
Yi, Yo,...,Y,. In other words, we route the k mes- 
sages to the first k output wires. We can make any 
n-by-m concentrator switch from an n-by-n hypercon- 
centrator switch by simply choosing the first m output 
wires of the hyperconcentrator switch, Y;, Yo,..., Ym, 
as the m output wires of the concentrator switch. 

We can use a sorting network to implement a hy- 
perconcentrator switch. The inputs to the sorting net- 
work are 1’s and 0’s, representing the presence or ab- 
sence of messages on the input wires to the switch. 
The sorting of the 1’s and 0’s, with 1’s before 0’s, 
causes the k input messages to occupy the first k out- 
puts. 

Many sorting networks, such as Batcher’s bitonic 
sort [6], employ the technique of recursive merging. 


1The terminology is drawn from [11]. 


a 
BY aT ay 


eee eee 


Figure 1: An nMOS layout of a 32-by-32 hyperconcentra- 
tor switch. The recursive nature of the switch can easily 
be seen. This implementation includes superbuffers where 
needed to provide enough drive for high fan-out signals. 


A problem of size n is divided into two problems of 
size n/2, which are recursively solved in parallel. The 
two sorted sets are then merged to produce the solu- 
tion to the original problem. The recursion requires 
[log n] levels, and since each merge step can be per- 
formed in O(log n) time in parallel, the total time to 
sort n values is O(log” n). Sorting networks of depth 
O(log n) are known [1], but they are impractical to use 
as hyperconcentrator switches because of the large as- 
sociated constant. | 

The n-by-n hyperconcentrator switch presented in 
this paper also uses recursive merging, but by taking 
advantage of the relatively fast performance of high 
fan-in NOR gates in nMOS technology, each merge 
takes only 2 gate delays. A signal therefore incurs 
exactly 2 [log. n] gate delays in passing through the 
switch. The switch has a simple design and a regu- 
lar layout in both ratioed nMOS and domino CMOS 
technologies. Unlike many concentrator switches in 
the literature [8,9,10], our switch sets itself up “on- 
line” when messages are presented to it. 


The remainder of this paper is organized as follows. 
Section 2 covers some basic terminology and describes 
the message format and timing model upon which the 
switch is based. Section 3 discusses the merge box, 
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how to use merge boxes to implement the hypercon- 
centrator switch and describes an nMOS implementa- 
tion. Section 5 covers some applications which benefit 
from the switch. Finally, Section 6 contains further 
remarks about the switch. 


2 Preliminaries 


In this section, we define some basic terminology and 
notational conventions and present the message for- 
mat and timing model assumed by the hyperconcen- 
trator switch design. 

Bit and boolean values are denoted by “1” and “0”, 
or by “high” and “low”, for TRUE and FALSE respec- 
tively. 

We assume that the hyperconcentrator switch 
routes bit-sertal messages. Each message is formed 
by a stream of bits arriving at a wire at the rate of 
one bit per clock cycle. The first bit of each message 
that arrives at an input wire is the valid bit, indi- 
cating whether subsequent bits arriving on that wire 
form a valid message or an invalid message. The bit 
sequence following a valid bit of 1 forms a valid mes- 
sage, which we would like to be routed from an input 
wire to an output wire of the switch. From there it 
may pass through the remainder of the routing net- 
work. A valid bit of 0 indicates an invalid message, 
which does not need to be routed to an output wire. 
We assume that in an invalid message, not only is the 
valid bit 0, but so are all the remaining bits in the 
message.” 

The valid bits all arrive at the input wires of the 
hyperconcentrator switch during the same clock cycle, 
which we call setup. An external control line signals 
setup. Message bits entering through input wires at 
cycles after setup follow the electrical paths in the 
switch which are established during setup. 

We shall adopt some notational conventions to ease 
the exposition in the following sections. Uppercase 
symbols denote wire names and lowercase symbols de- 
note integer values. We shall also use uppercase sym- 
bols to denote bit values on the wires they name when 
the usage is unambiguous. Wire names will usually 
have subscripts. 


3 The Merge Box 


This section presents the design of the merge box, the 
key portion of the hyperconcentrator switch architec- 


2This assumption is easy to enforce — just AND the valid 
bit into each subsequent bit of the message. 


ture. The hyperconcentrator switch consists of many 
merge boxes, of various sizes, connected as shown in 
the next section. The design exploits the fast per- 
formance of large fan-in NOR gates in nMOS tech- 
nology, much the same as does a PLA, to merge two 
sets of messages of any size with two gate delays. The 
merge box design presented in this section uses ratioed 
nMOS technology and no pass transistors. 

A merge box merges two sets of messages, each 
set sorted by their valid bits, into one sorted set of 
messages. A merge bor of size 2m, where m is a 
power of 2, has two sets of input wires Aj, A2,..., Am 
and B,,Bo2,...,Bm and one set of output wires 
C1, Co,...,Cam. We require that the lower-numbered 
wires of both the A and B input sets carry valid mes- 
sages and that the higher-numbered wires of both the 
A and B input sets carry invalid messages. That is, 
if we let p and qg be the number of valid messages en- 
tering the A and B wire sets respectively, we require 
that the valid bits appear on the input wires during 
setup as follows: 


Aj, Ag,...,4p = 1 
Ap+1,Ap42;-++;Am = 0 

B,,Bo,...,By = 1 
Bo+1; Bot2;++-;Bm = 0 


During setup, the merge box establishes disjoint 
electrical connections between the p+ q input wires 
with valid messages and the p+q lower-numbered out- 
put wires C1, Co,...,Cp4,¢ in a combinational fashion, 
as shown in Figure 2. The connections C, = Aj, C2 
Ao,- oe Cp = Ap, Cp41 > By, Co+2 = Bo, ose Cote 
B, are established, and valid bits appear on the out- 
put wires as follows: 


Cu Criins Our 


Cp+q+1s Cpt+qt2s---1C2am 0 


These connections are maintained during subsequent 
cycles for the remaining bits in the message streams 
to follow. 

Figure 3 is a schematic diagram of a merge box 
for which m 4, This merge box includes 
eight NOR gates, with diagonal output wires la- 
beled C1, Co,...,Cg. Each of these NOR gate out- 
puts is inverted to produce the merge box outputs 
C', Co,...,Cg, 30 we may view the pulling down of a 
diagonal wire C; to be equivalent to the correspond- 
ing output C; being 1. The NOR gates have fan-ins 
ranging from just one pulldown circuit (e.g. the gate 
with output Cg) to 5 pulldown circuits (e.g. the gate 
with output C,). In general, the NOR gates have fan- 
ins of up to m+ 1 pulldown circuits. Each pulldown 
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Figure 2: The paths taken by valid messages in a merge 
box. Valid bits are shown as they enter and leave the 
merge box. The p valid messages arriving at input wires 
A, A2,...,Ap are routed to output wires Cy,C2,...,Cp 
respectively. Here, the only A wires with valid messages 
are A, and Az. These valid messages are routed to C; and 
C2 respectively. The q valid messages arriving at input 
wires Bi, Bo,...,Bg head toward Ci,C2,...,Cg but are 
steered to Cp41, Cp42,...,Cp4 q- Here, the valid messages 
entering through B,, Bz, and Bs are steered to output 
wires C3, C4, and C's respectively. 


circuit consists of just one or two transistors, regard- 
less of the size of the merge box, making for fast NOR 
gates and low-area pulldowns, even with minimum- 
sized pullups. As can be verified from Figure 3, a 
merge box of size 2m implements the following func- 
tion: 


A, V (Vie A a) ifl<t<m 
j=l 
2m+1—1 


(Bm+1—j7 A Sitj—m) 


C; = 


ifm<1< 2m 
j=1 


The switch settings S,,S2,S3, S54, Ss are computed 
and stored in registers during setup, based on the 
valid bits appearing at the A and B input wires. In 
general, a merge box of size 2m has switch settings 
S1,S2,...;Sm41- These stored settings continue to 
be used during subsequent cycles. These switch set- 
tings establish the electrical connections throughout 
the entire hyperconcentrator switch. Other than the 
storing of the switch settings, the operation of the 
merge box is purely combinational. 

Let us look at the operation of the merge box dur- 
ing setup. The lower-numbered A and B input wires 
have valid bit values of 1, and the higher-numbered 
A and B input wires have valid bit values of 0. If 


Figure 3: A merge box of size 8. The input wires are 
A,,A2,A3,Aa and B,, Bo,Bs,B4. The output wires are 
C,,Co,...,Cs. The switch settings are stored during 
setup in registers S,,S2,53,S4,Ss. Here we have p = 2 
and g = 3 during setup. The valid bit values on each A, 
B, and C wire are shown, as are the S switch settings. All 
conducting paths to ground are circled. 


input A; is 1, then the NOR gate output C; is pulled 
down by the single transistor whose gate is A;. The 
inverter causes output C; to be 1. Having input 
values A,,A2,...,Ap = 1 thus causes the outputs 
C1, C2,...,Cp to be 1. 

The switch settings S act as steering signals, send- 
ing the B values B,, Bo,..., Bm to the output wires 
Cy+1,Cp+2,---;Cp+m. The S values are computed 
and stored during setup so that only the setting S,44 
is 1, corresponding to input Api being the lowest- 
numbered A with a valid bit of 0. (If no input wire 
A; is 0, then we have p = m, and only switch S,41 is 
set to 1.) The S values are defined by the valid bits 


on the A wires as follows: 
Si = Aj 
5; = A;-1 AA; forl<1t<m 
Sm+1 = Am 


Of the two-transistor pulldown circuits, only column 
p+1 may possibly pull a diagonal wire down to 0, since 
only switch setting S,,1 is high. Similarly, a diagonal 
wire C; may be pulled down only by input wire A; 
or the conjunction B;_, A Sp41. The only NOR gate 
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which may be pulled down by input B, has output 
wire Cp+1, and in general the only NOR gate which 
may be pulled down by input B; has output wire Crank 
For example, suppose that, as in Figure 3, the input 
wires have the following valid bits during setup: 


Ai, 4g = 1 
A3,A4 = O 
B,,B2,B3 = 1 
B, = 0 


Then we have p = 2, q = 3, S3 = 1, and all other 
S; = 0. There are five valid messages passing through 
the merge box, and there are exactly five conduct- 
ing paths to ground, circled in Figure 3, one for each 
of the first five diagonal wires, C1, C2, C3, C4, Cs. 
These paths to ground cause output values of 1 on the 
corresponding output wires Cy, C>2,C3,C4,Cs. The 
remaining three diagonal wires, Cog,C7,Cg, are not 
pulled down to ground by these input values, and the 
output wires C¢, C7, Cg all have the value 0. 

Now we look at the message bits that arrive af- 
ter setup. The switch settings S were computed and 
stored during setup, and they remain unchanged in 
their registers. Just as during setup, a bit with the 
value 1 which enters through input wire A; directly 
pulls down the diagonal wire C;, regardless of the S 
values. A bit with the value 1 which enters through 
input wire B; may pull down only the diagonal wire 
C+: because the only switch setting which is 1 is 
Sp+1- The only difference in the merge box operation 
between setup and later cycles is that the S values, 
which are always used to steer the B values to the 
appropriate C' outputs, are computed and stored only 
during setup. In the cycles following setup, the merge 
box is a combinational circuit, reading the registers 
holding the switch settings S. 

Recall that in Section 2 we required that all bits in 
an invalid message must be 0. We now can see the 
reason for this restriction. Suppose that in our above 
example, in which we had A3 = 0 and S3 = 1 during 
setup, we had that at some cycle following setup A3 = 
1 and B; = 0. We would expect C3 to be 0 in this 
case, since B,, which is routed to C3, is 0. Since Ag 
is 1, however, C3 is pulled down, and C3 becomes 
1. The requirement that Ag and A, be 0 after setup 
eliminates spurious pulldowns. 


4 The Hyperconcentrator Switch 


In this section, we give the recursive construction 
for assembling merge boxes into a hyperconcentra- 


tor switch. We also show that a signal incurs exactly 
2 [log n| gate delays through the switch. 

A hyperconcentrator switch with input wires 
X1,X2,..-.,Xn, of which k contain messages, and out- 
put wires Y;,Y2,...,Yn routes the k valid messages 
to the first k output wires Y,, Y2,..., Y;. Since valid 
messages are identified by a valid bit of 1 during setup, 
a hyperconcentrator switch may be viewed as a net- 
work that sorts 1’s and 0’s, with 1’s before 0’s in the 
output. The switch is set during setup, with subse- 
quent bits following these established electrical paths. 

We use recursive merging to sort the messages, solv- 
ing the subproblems at each level of the recursion in 
parallel. By knowing in advance the size of the prob- 
lem, we know in advance exactly how sets will be di- 
vided and merged. We can thus build the division pro- 
cess into the hardware and successively merge larger 
sets of bits through cascades of parallel merge boxes. 


Figure 4 shows the organization of a 16-by-16 hy- 
perconcentrator switch. There are four stages through 
which the bits cascade, from bottom to top in the fig- 
ure. By this construction, an n-by-n hyperconcentra- 
tor switch, composed of [log,n]| stages of combina- 
tional merge boxes, is itself a combinational circuit. 
Signals incur exactly 2 [log,n] gate delays in an n- 
by-n hyperconcentrator switch. During setup, the S$ 
switches in each merge box are computed and stored in 
registers. These switches establish electrical paths for 
messages in each merge box. Since there are no other 
switches between merge boxes, the S switches actually 
establish the paths through the entire hyperconcentra- 
tor switch. As in the individual merge boxes, message 
bits that enter after the valid bit follow the established 
paths through the hyperconcentrator switch. 

Figure 1 shows the layout of a 32-by-32 hypercon- 
centrator switch, using 44m nMOS MOSIS design 
rules. In order to provide enough drive for the pull- 
down transistors of the next stage, the inverters fol- 
lowing the NOR gates in each merge box are actu- 
ally inverting superbuffers. Timing simulations have 
shown that the propagation delay through this circuit 
is under 70 nanoseconds in the worst case. 


5 Applications 


This section discusses two applications of the hyper- 
concentrator switch. One application, the one for 
which the switch was designed, allows us to use the 
available clock period more efficiently in bit-serial 
routing networks. Another application is its use in 
building a superconcentrator switch. | 

We can replace small, simple switches in a bit-serial 
routing network by concentrator switches to success- 
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Figure 4: A 16-by-16 hyperconcentrator switch with four 
stages of merge boxes. Individual merge boxes are oriented 
as in Figure 2, with input wires A entering the bottom 
left, input wires B entering the bottom right, and output 
wires C’ leaving the top left and top right. Messages flow 
from bottom to top. The output wires C,,C2,...,Cm ofa 
merge box of size m are the input wires of a merge box of 
size 2m, either A, A2,...,Am or Bi, Bo,...,Bm. Valid 
bit values are shown entering the first stage and leaving the 
last stage of the cascade. The electrical paths established 
within the merge boxes and the switch during setup are 
shown in heavy lines. 


fully route more messages in a single clock cycle, thus 
using the available clock period more efficiently. The 
routing network switches we shall consider route valid 
messages either left or right, based on an address bit 
immediately following the valid bit. Such a routing 
scheme is used, for example, in a butterfly network. 
An address bit of 0 indicates that the valid message 
should be routed to a left output of a switch, and 
an address bit of 1 indicates that the valid message 
should be routed to the right. 


Consider the 2-input, 2-output butterfly node 
shown in Figure 5. A single level of a routing net- 


Figure 5: A 2-input, 2-output butterfly node. The se- 
lectors and 2-by-1 concentrator switches ensure that both 
input messages reach output wires if their address bits 
specify that they are going in different directions. If the 
messages contend for the same output wire, the concen- 
trator switches ensure that one makes it through. With 
randomly chosen address bits, we expect 3n/4 of the n 
messages to be successfully routed through this node. 


work such as a butterfly would typically have several 
such nodes side-by-side. The node contains two simple 
2-by-1 concentrator switches, depicted as trapezoids, 
one with outputs going left and one with outputs go- 
ing right. Each simple concentrator switch is preceded 
by a selector circuit that, given an input valid bit and 
an address bit, produces a new valid bit which is 1 if 
and only if the input valid bit is 1 and the address 
bit matches the output direction of the concentrator 
switch. If two valid messages with equal address bits 
enter a butterfly node, only one is successfully routed. 


The problem with this scheme is that it does not 
use the available clock period efficiently. The sim- 
ple node uses only a few levels of logic, so the delay 
through it is only a few nanoseconds. But because 
of the time required to get signals on and off chips in 
current technologies, we cannot distribute a clock with 
a frequency high enough to match the short delay of 
this node. In fact, the clock period we can distribute is 
at least an order of magnitude greater than the delay 
through this node. This node therefore performs no 


useful work in at least 90 percent of each clock cycle. 


Now consider the generalized n-input, n-output 
butterfly node shown for n = 8 in Figure 6. Like four 
simple butterfly nodes of Figure 5 laid side-by-side, it 
has a total of 8 input wires and 8 output wires, with 
4 outputs going left and 4 outputs going right. But 
here we use two n-by-n/2 concentrator switches, one 
with outputs only going left and one with outputs only 
going right. 

The advantage held by the larger node is that at the 
same clock speed as the simple nodes, it can success- 
fully route more valid messages in each clock cycle. 
The clock speed remains the same because the ad- 
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Figure 6: A generalized butterfly node with n inputs 
and n outputs, shown here for n = 8. There are two 
n-by-n/2 concentrator switches. With randomly chosen 
address bits, we expect n — O(,/n) messages to be success- 
fully routed through this node. 


ditional delay introduced by the larger concentrator 
switches is just soaked up by the unused portion of the 
clock period. These nodes use a larger portion of the 
available clock period. Since the simple nodes leave 
so much of the available clock period unused, we can 
even scale these concentrator switches up considerably 
before the delay introduced exceeds the original clock 
period. To see that more valid messages are routed 
by the larger node, assume that a valid message ar- 
rives at each input wire of each switch and that the 
address bit is 0 with probability 1/2, independent of 
the address bits of other messages. A simple calcu- 
lation shows that we can expect the simple nodes to 
successfully route 3n/4 of the n arriving messages. A 
more complex calculation [2] shows that the expected 
number of valid messages successfully routed by the 
larger nodes with n input wires is n — O(,/n). Intu- 
itively, the larger nodes successfully route more valid 
messages because they have more freedom in mapping 
inputs to outputs. 

The approach of replacing many small rout- 
ing nodes by fewer nodes with larger concentrator 
switches is used by the cross-omega network [5]. Part 
of the cross-omega network is based on a truncated 
butterfly network. Single wires of the butterfly net- 
work are replaced by bundles of 32 wires, and the 
simple butterfly network nodes are replaced by nodes 
like that of Figure 6, but with 32 inputs, 32 outputs, 
and two 32-by-16 concentrator switches. 

Fat-trees serve as an example of a class of rout- 
ing networks that makes use of concentrator switches 
whose pin counts increase with the network size. The 
interested reader is referred to [4] and [7] for details. 
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Figure 7: A superconcentrator switch built out of two hy- 
perconcentrator switches. The hyperconcentrator switch 
Hr is set up the connect the first / reverse input wires 
21,22,...,41 to the 1 good reverse output wires, which 
also serve as the output wires of the superconcentrator 
switch. The k valid messages are then routed by the hy- 
perconcentrator switch Hp to the wires Z1, Z2,..., Z, and 
through the switch Hp to the first k good output wires. 


Another application of hyperconcentrator switches 
is in building superconcentrator switches. An n- 
by-n superconcentrator switch has n input wires 
X1,X2,..., Xn and n output wires Y;, Yo,..., Y,. For 
any 1 < k < n, disjoint electrical paths may be es- 
tablished from any set of k input wires to any arbi- 
trarily chosen set of k output wires. Superconcen- 
trator switches are useful in fault-tolerant systems. If 
some of the output wires of a concentrator switch may 
be faulty, we can use a superconcentrator switch that 
routes signals to only the good output wires. 


We can build a superconcentrator switch out of two 
full-duplex hyperconcentrator switches Hp and Hp, 
as shown in Figure 7.° (After setup in a full-duplex 
hyperconcentrator switch, signals can travel along the 
established paths simultaneously in both forward and 
reverse directions. Extending the design of the hyper- 
concentrator switch to make it full-duplex is straight- 
forward.) The output wires of the switch Hp (a “for- 
ward” hyperconcentrator switch) feed directly into the 
reverse input wires of the full-duplex hyperconcentra- 
tor switch Hp (a “reverse” hyperconcentrator switch). 
Suppose there are / good output wires of the super- 
concentrator switch. Before setup of the superconcen- 
trator switch, the switch Hr sets up electrical paths 
from its first | reverse input wires Z;, Zo,...,Z to 
the / good reverse output wires. These paths are es- 
tablished by assigning a 1 to each forward input wire 


’This construction is shown in [11]. 
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of the switch Hr that corresponds to a good output 
wire, assigning a 0 to the forward input wires corre- 
sponding to faulty output wires, and running a setup 
cycle of the switch Hp. 

Setup of the superconcentrator switch is then just 
setup of the hyperconcentrator switch Hp. The k 
valid messages are routed through the switch Hr to 
the wires Z ,,Z2,..., 2, and then along the reverse 
paths through the switch Hp to the first k good out- 
put wires. | 


6 Concluding Remarks 


In this section, we describe a circuit containing the 
hyperconcentrator switch which has been fabricated. 
We also discuss how the switch can be implemented 
in domino CMOS, and how the switch can be used 
to build larger concentrators. Finally, we pose some 
open questions. 

We have implemented a 4um nMOS 16-by-16 hyper- 
concentrator switch, which was fabricated by MOSIS. 
The chip contains programmable selector circuitry 
preceding the hyperconcentrator switch so that an in- 
dependent routing decision can be made for each in- 
put, as in Figures 5 and 6. Each of the 16 selectors 
includes a UV write-enabled PROM cell, described in 
[3]. The bit value stored in each PROM cell is com- 
pared with an address bit in the input message to 
determine whether the message is going in the correct . 
direction. The device is currently under test. 

The hyperconcentrator switch can be implemented 
using precharged design methodologies, such as 
domino CMOS, instead of ratioed nMOS technology. 
Although the output inverters used in domino CMOS 
design are already present in the merge box circuit, 
the conversion from ratioed nMOS to domino CMOS 
design is not as straightforward as it may seem at first 
glance. The functions which define the switch settings 
S are not monotonic increasing, a violation of domino 
CMOS design principles. This problem occurs only 
during setup. 

One solution involves an alternative computation of 
the S values during setup, as follows: 


Sy —_ 1 
S; Aj-1 


for2<271<m+l1 


It can be verified that these S values produce the cor- 
rect valid bit values at the merge box output wires 
during setup. These S values, however, are not the 
switch settings stored in the registers. To ensure 
that subsequent bits are correctly routed through the 
merge box, we compute the switch settings as in the 


ratioed nMOS design and store them in the registers 
during setup, without connecting them to any pull- 
down circuits. After setup we connect the switch set- 
tings to the pulldowns, as in the ratioed nMOS case, 
and the routing of subsequent bits takes place just as 
in the ratioed nMOS circuit. 

The hyperconcentrator switch can be used as a 
building block in large concentrators. For example, 
replacing the comparators in an arbitrary sorting net- 
work by n-by-n hyperconcentrator switches yields a 
large hyperconcentrator. (Actually, only the first level 
of comparators must be replaced by hyperconcentra- 
tor switches; merge boxes suffice at all subsequent lev- 
els.) We have also found that efficient partial concen- 
trator switches can be built from hyperconcentrator 
chips. An (n,m,a) partial concentrator switch has n 
inputs, m outputs, and a fraction a such that any 
set of k < am valid messages is successfully routed 
to output wires. Using 3,/n hyperconcentrator chips, 
each with ,/n inputs, we can build an (n, m, a) partial 
concentrator switch with a = 1— O(n3/4/m) [2]. 

It is natural to wonder whether a simple design for 
a concentrator switch exists when we relax the con- 
straint that all the valid messages arrive at the same 
time. A crossbar switch has the capability of allowing 
valid messages to come and go at any time, but switch 
setup can be expensive. It may be that a concentrator 
switch can be designed that allows new messages to be 
routed in batches while preserving old connections. 
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Abstract 


A hardware solution to low-level (semantic) concurrency 
extraction is considered. The focus is cn the reduction of data- 
flow inhibitors of concurrency in sequential instruction streams. 
The reduced procedural dependency techniques of prior work are 
combined with new high-speed reduced data dependency 
techniques to yield a new machine model executing standard 
sequential code in a data-flow-like manner. Also considered is a 
conceptually simple form of branch prediction. The model is 
simulated both with and without the branch prediction technique, 
using a wide range of benchmarks as input. The results are 
presented and analyzed, showing the model's functionality and 
performance improvement. 


1 Introduction 


Computer system performance may be improved via 
application of concurrent or parallel processing techniques at 
many levels. This paper explores the use of low-level or semantic 
concurrency extraction techniques applied at the implementation 
stratum of processor design; the concurrency is extracted directly 
from the machine code, with no software pre-processing. It is 
assumed that the machine code is a sequential instruction stream, 
consisting solely of branches and assignment statements'”’, in 
which each instruction takes one cycle to execute. The specific 
problem addressed by this work is increasing the concurrency 
extractable from sequential instruction streams, thereby improving 
performance, while keeping the hardware extraction techniques 
used both high-speed and efficient. 

Low-level concurrent operation of a computer allows machine- 
level instructions to execute independently when the semantic 
restrictions between them allow; e.g., two assignment instructions 
may execute concurrently if they share no operand sources and 
sinks. Using low-level concurrency allows multiple instructions to 
be issued at once, typically more than one per execution cycle. 
Low-level concurrency was first exploited in the CDC 6600 [10] and 
the IBM 360/91 [1, 12], although neither machine issued more than 
one instruction per cycle. Also, neither computer employed 
significant techniques to reduce the ill effects of branches. Since 
these machines were built, some research progress has been 
made by Tjaden [11] in the quantity of low-level concurrency 
extractable. More recently, with the advent of lower-cost hardware, 
the utilization of low-level concurrency’ is being 
re-examined [4, 5, 7,13, 16], resulting in new concurrency 
extraction techniques. In [15] Uht examines improved concurrency 
extraction techniques in which the negative effects of branches are 
minimized (this did not include branch prediction methods). In this 
paper the techniques are extended to include both reduced 
assignment statement constraints and a form of branch prediction. 
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The prediction technique requires no backtracking or state 
restoration upon discovery that a branch target prediction was 
wrong. 

The initial applications for this work are both super- and 
mainframe-(large general purpose) computers, aiding in the 
speedup of both scientific and general-purpose instruction 
streams. Future applications are widespread. With the continuing 
decline of hardware costs, such techniques may be realizable in 
single-chip machines. 

In the next section the problem is further described and 
refined, previous work is reviewed, and the approach of the 
solution is outlined. In Section 3 details of the new approach are 
given. In Section 4 both simulation and hardware estimate data are 
presented and analyzed. Lastly, a summary is given in Section 5, 
overviewing the achievements of this work. . 


2 Previous Work and New Approach Outline 


2.1 Dependency Types 

Concurrency extraction starts as dependency detection. Two 
instructions are dependent if their execution must be ordered, due 
to either semantic or resource dependencies. 

A semantic dependency exists between two instructions if their 
execution must be serialized to ensure correct operation of the 
code. This type of dependency arises due to ordering relationships 
occurring in the code itself. There are two forms of semantic 
dependencies, data and procedural. 


2.1.1 Data Dependencies 

Data dependencies arise due to instructions sharing source 
and sink names in certain combinations. Examples of data 
dependencies are shown in Figure 1. Referring to Relation 1, a 
data dependency exists between instructions 1 and 2 since 
instruction 1 modifies A, an input to (source of) instruction 2. 
Therefore instruction 2 must not execute in a given iteration until 
instruction 1 has executed in that iteration. 

Examples of the three possible types of data dependency 
relations [2] are shown in Figure 1. Under normal circumstances 
(when only one copy of a variable exists in the machine at any 
given time) any one of the relations existing implies that a data 
dependency exists between the two instructions. But, as 
demonstrated by Tjaden [11] and Wedig [16], if multiple copies of a 
variable exist, then Relations 2 and 3 (collectively known as 
shadow effects) do not apply. As an example, consider the Relation 
2 situation in Figure 1. Say two copies of A exist, then instruction 1 
sources from the first copy and instruction 2 writes (sinks) into the 
second copy; therefore instructions 1 and 2 may execute 
concurrently. It is desired to have a machine execute such that 
shadow effects do not inhibit execution, consequently multiple 
copies of variables should be provided. 


Relation 1: Relation 2: Relation 3: 
1 A=B+H+iti1 C=A * 2 A=B+i1 
2 C= A * 2 A=B8B+ 1 A=C€C * 2 


(in all of the examples, the common use of 
the variable A creates the data dependency) 


Figure 1: Examples of the data dependency relations.. 


2.1.2 Procedural Dependencies 

Procedural dependencies arise as the result of the presence of 
branches in the input code. Any type of instruction may be 
procedurally dependent on one or more branches, but only on 
branches, and they must be previously-occurring branches, since 
procedural dependencies are unidirectional (not reflexive) [11, 14]. 
For example, an instruction which is between a forward branch 
and its target is procedurally dependent on the forward branch, 
since its execution is conditional upon how the branch executes. 
An instruction after the target is not procedurally dependent on the 
forward branch, since it is guaranteed to execute regardless of 
how the branch executes, true or false [11]. The type of procedural 
dependency mentioned above is one of several types, descriptions 
of which are in [14]. 

For emphasis, we re-iterate: the union of data and procedural 
dependencies in code forms the set of semantic dependencies 
existing in the code, and exists independently of any and all 
hardware the code might run on. 


2.1.3 Resource Dependencies 

Resource dependencies exist solely due to hardware 
restrictions of a particular machine’s realization. For example, if in 
a given cycle, three add instructions are eligible for execution, and 
only two adders are available, then a resource dependency exists. 
In this work, unless otherwise noted, resource dependencies are 
assumed not to exist. 


2.1.4 Related Work 


Riseman and Foster [9], and Fisher [4] have shown that much 
concurrency is extractable if the procedural dependencies can be 
reduced. Tjaden made the first attempt to reduce procedural 
dependencies, allowing more concurrency. In [14, 15] procedural 
dependencies are further reduced, using a domain model of 
control flow. This model, along with the relevant hardware 
structures appearing in the machine paradigm of [15] (called 
CONDEL-1}, is used in the implementation described herein (called 
CONDEL-2), (CONDEL stands for CONcurrent Directly Executed 
Language machine, the emphasis being on CONcurrent.) 

The data dependency models used traditionally enforce all 
three of the data dependency relations enumerated in Section 
- 2.1.1. This has been necessary due to either a lack of duplicate 
sinks or the use of restrictive algorithms. Less constrained 
techniques [11,16] result in either the partial or complete 
elimination of shadow effects, but suffer from unpleasant 
implementation features. Specifically, the algorithms for instruction 
execution are essentially sequential, requiring many steps per 
cycle, thereby negating any performance gain resulting from 
concurrency extraction; this is particularly true (many steps 
needed) for the execution of branches. The prior techniques also 
only allow one iteration of an instruction to execute per cycle, are 
potentially costly, and, more importantly, are incompatible with the 
reduced procedural dependency techniques of CONDEL-1. 

Classical machines frequently seek to increase performance 
via branch prediction, in which branches are assumed to execute 
in one direction before their conditionals have been fully 
evaluated [6]. This decreases pipeline breaks, improving 
performance. Another similar technique is to conditionally execute 
code along both control flow paths from a branch, eliminating the 
wrong branch when the branch conditional has been fully 
evaluated. Two problems with these techniques are: rarely can 
more than one branch prediction be active at any given time, and 
some state restoration (backtracking) must take place upon a bad 
prediction. To date, no known branch prediction scheme solving 
these problems has been implemented, particularly in low-level 
concurrent machines. In a form of CONDEL-2 a type of branch 
prediction is allowed which does not require state restoration, and 
allows multiple branches to execute ahead of time. It is described 
in Section 3.5. 

Other work in the area of hardware low-level concurrency 
extraction is being done by Torng [13], Patt’s group [5], and 
Cocke [3]. 


2.2 Instruction Streams 
There are two possible ways a CPU’s execution unit can 
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examine a program’s instruction stream during the program’s 
execution, dynamically or statically. The dynamic instruction 
stream is the code as seen in an order determined by the run-time 
contro! flow of the code, i.e., as indicated by the contents of a 
program counter (PC) in a traditional machine. The static 
instruction stream is the code as stored in memory, which is 
normally just the source code order of the program (see Figure 6). 
Tjaden’s and Wedig’s[11, 16] schemes both examine the static 
instruction stream. 

Executing the static representation of a program has two major 
advantages over executing its dynamic representation. If an entire 
loop fits in CPU storage, then memory accesses due to instruction 
fetches decrease dramatically. Also, both branches and other 
types of instructions can execute simultaneously without changing 
the program’s results, reducing the effects of procedural 
dependencies. The static instruction stream is used in this work. 


2.3 Problem Definition - Narrowed Version 


What is needed is a hardware algorithm for semantic 
concurrency extraction that exploits a maximai amount of 
concurrency (not including branch prediction). The model must be 
high speed, in that the total critical path gate delays must remain 
low so as not to negate the effect of increased concurrency; the 
algorithm should also exhibit reasonable cost. It is also desired that 
some form of branch prediction be developed that does not require 
backtracking. These requirements specifically indicate a need for 
efficient representations of instruction execution state and 
semantic dependencies (partially developed in[11, 15, 16]), and 
new parallel, high-speed, and affordable techniques to perform the 
basic concurrency calculations. 


2.4 Previously Developed Elements, Used in the New Model 

The requirements for efficient dependency and_ state 
representations are satisfied by modifications of structures 
proposed by Tjaden and Wedig, as developed by Uht in [14, 15]. 

The original data dependency matrix, used in CONDEL-1, is 
upper right triangular, indicating all possible data dependencies 
between two instructions in a single element. Examples of the new 
data dependency matrices are shown in Figure 7, with their 
associated code in Figure 6, indicating the data dependencies 
between two instructions separately. The matrices are composed 
of binary elements, and are nXn square. A 1 in a matrix indicates a 
data dependency between the instructions corresponding to the 
row and column indices, e.g., matrix element DD, .=1 indicates a 
Relation 1 data dependency between instructions 7 and 2. Upper- 
right triangular versions of the matrices in Figure 7 exist for 
procedural dependencies [15].. Note that the static instruction 
stream model is used. The dependencies are calculated at run- 
time, as described in[14]; this eliminates extra storage and 
memory bandwidth that would be necessitated with a compile-time 
calculation. 

Wedig [16] also uses the static instruction stream model. He 
employs an /nstruction Queue (see Figure 2) to hold a portion of 
the static instruction stream. Instructions enter at the bottom and 
are shifted up, into lower-numbered Instruction Queue rows, as the 
upper instructions finish executing. Any necessary decoding is 
performed relatively statically [14], one instruction at a time, as an 
instruction enters the Queue. Each row of the Instruction Queue 
holds the code data corresponding to instruction i, including 
instruction i’s opcode and operand identifiers, as well as a jump 


destination address if the instruction is a branch. The instructions 


in the Instruction Queue are accessed in parallel. 

Wedig [16] also proposes the Advanced Execution (AE) matrix 
to hold the dynamic instruction state. Each row of the matrix 
(AE, ,) contains the execution state of instruction i for a subset of 
its total number of iterations. The nominal execution order of the 
instructions corresponding to the Advanced Execution matrix rows 
is shown in Figure 3. This enforces the basic sequentiality of the 
serial input code when necessary (when dependencies are 
present). AE, . is set either upon an actual execution of instruction j 
in iteration j, or Upon execution of one or more branches in the 
Instruction Queue. Setting Advance Execution elements as the 
result of. branch execution is called virtually executing the 


corresponding instructions. A register called the b-element stores 
an integer indicating the total number of iterations that each 
instruction in the Instruction Queue is to execute (really or 
virtually). The b-element is incremented when a backward branch 
executes true (enabling a new iteration for execution), and is 
decremented when an entire iteration has been executed. 


executed instructions 


Instruction 1 
Instruction 2 


to rest 
of CPU 
Instruction n 
new instructions 
Figure 2: Basic Instruction Queue. 
begin 
; > 
a a - 0 - Advanced 
Oo io ie EcGcULtOR 
matrix 
1 t 
Q __ ja sate te ee elemen 
end 


Figure 3: Nominal instruction iteration execution order 
in AE matrix. 


An example of code execution with the AE matrix is shown in 
Figure 4. In the figure, AE AE, |: and AE, are set to indicate 
actual or rea/ execution of Vie cortesponding’ instructions, i.e., an 
assignment actually occurred or a branch was actually taken. The 
instructions are all issued in the same cycle since there are no 
unresolved dependencies between them. Since b = 2, instruction 1 
is fully executed. AE, , and AE, , are set as a result of the forward 
branch (instruction 3) executing true (the branch is taken) in 


iteration 1. Since instructions 3 and 4 are in the domain) of the 
forward branch, they must be kept from really executing, and thus 
are virtually executed in their first iterations, i.e., their AE elements 
are set, but they are not issued for execution (no real execution 
takes place). Therefore, virtual execution of an instruction in a 
given iteration is the same as disabling the instruction from really 
executing in the iteration. 


Code AE matrix AE matrix 

(before 2 (after 2 

b=2 executes executes 

‘ true) true) 
iteration #: 1234 1234 

ie A=A+i1 1000 1100 
2. IF A> QO GOTO 5 0000 1000 
3 B=C+D 0000 1000 
4, E = F + G 0000 1000 
5. W=A 000 0 1000 


Instructions 3 and 4 are procedural ly 
dependent on instruction 1. 


Figure 4: Code execution and the AE matrix. 
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In Wedig’s shadow effects reduction model, a matrix of 
registers called the Shadow Sink (SSI) array is used to store 
multiple copies of instruction sinks, or outputs. The matrix has the 
same dimensions as the AE matrix (nXm), and each element of SSI 
corresponds to the similarly located AE element; in other words, 
there is one sink register for each AE-represented iteration of each 
instruction held in the Instruction Queue. 


2.5 Basic Execution Algorithm 

During execution cycles, the dependency information is 
combined with the dynamic execution state held in the Advanced 
Execution matrix to generate a list of instructions to be executed in 
the current cycle. These instructions are issued for execution. 
This list of executing instructions and the domain information is 
used to update the Advanced Execution matrix, whence the cycle 
repeats. 


2.6 Approach of Solution 

Our approach is to develop a new machine model, CONDEL-2, 
that uses the best features of Tjaden’s and Wedig’s approaches, 
the reduced procedural dependency model of CONDEL-1 [15], and 
new ideas to solve the problem. Specifically, CONDEL-2 employs 
the static instruction stream model, the basic execution algorithm 
of [15] (see Section 2.5), and Wedig’s Shadow Sink matrix. 
Modifications of Wedig’s AE matrix [16] and the data dependency 
matrix [14,15] are also used. New parallel high-speed logic is 
described that both performs the concurrency calculations, and 
links instruction outputs (sinks) to other instructions’ inputs 
(sources). Other logic and structures are used to decouple sink 
storage from instruction execution, both improving performance, 
and allowing the implementation of a simple form of branch 
prediction, further improving performance. 


3 New Solution Description 

In this section the major elements of the new model CONDEL-2 
are overviewed; details may be found in [14]. In [14] the Relation 1 
data dependency in conjunction with a reduced set of procedural 
dependencies are shown to constitute a minimal set of semantic 
dependencies for virtually all sequential code. The reduced set of 
procedural dependencies was established via both exhaustive 
consideration of contro! flow types and a recursive proof. This 
section describes the hardware necessary to implement the 
minimal semantic dependencies (with restrictive array accesses, 
described in Section 3.4). The solution description concludes with 
an example of the operation of the hardware. (The reader may find 
it helpful to refer to Figures 3 and 6 - 8 as the hardware is 
introduced.) 


3.1 Basic Elements of the Solution 

The basic problem is to determine which, if any, sink element 
of the instructions in the IQ should be used as the source element 
of an instruction in the IQ (it may be the same instruction). In any 
case the model must allow for independent writing of multiple 
instruction iterations’ sink operands. The SSI matrix of Wedig [16] 
satisfies this requirement. It is helpful to consider the model of 
program execution vis-a-vis the AE matrix presented by Wedig 
[16]; see Figure 3. (For the remainder of the discussion, one 
source per instruction is assumed"). The presentation is easily 
generalized to accommodate multiple sources.) The directed line 
shows the nominal or seria/ order of execution of the sequentially 
biased code in the 1Q. A specific iteration of an instruction uses as 
a source a sink generated previously and residing in either memory 
or the SSI. Therefore the previous sink is somewhere serially 
previous to the specific iteration, back along the line. The 
particular SSI word to be used is indicated by both the data 
dependencies and the execution state of the relevant instructions. 

It has been determined [14] that knowing just the Advanced 
Execution state is insufficient, as the validity of an instruction’s 
sink cannot be determined from the AE state. What is needed is a 
separation of the rea/ from the virtual execution state held in the 


e),, the CONDEL models, branch domain indicators are stored in matrices 
similar to those holding dependencies, see [14, 15]. 


(Adding a source requires adding a like amount of linking togic, as well as 
requiring an additional data dependency matrix. 


AE matrix. Therefore two new matrices are used, the Real 
Execution (RE) and the Virtual Execution (VE) matrices, described 
in Section 3.2. Note that AE = RE + VE; this is a Jogic equation. 

Only one serial data dependency (that of Relation 1) is needed 
for all instructions, with the exception of array accesses (see 
Section 3.4). This data dependency occurs when the sink of one 
instruction’s iteration is the same as the source of a serially later 
instruction’s iteration. The data dependency information kept in 
the DD matrix of CONDEL-1 is separated into multiple square 
matrices in CONDEL-2, one per possible source-sink pair, to allow 
for individual access to an instruction’s Relation 1 data 
dependencies (those of an_ instruction’s source on_ other 
instructions’ sinks). In the new model, an instruction in a given 
iteration may be data dependent on the same instruction in a 
previous iteration, e.g., an instruction of the form: 1 =1+ 1. 

It is both conceptually and notationally convenient in many 
cases to look at the AE, RE, and VE matrices (or any nXm matrix) 
as one-dimensional vectors with length nm, with the elements 
column ordered as indicated by the line in Figure 3. This is then 
just the serial ordering of the instructions’ iterations. For the 
remainder of this document, either serial or row and column 
indexing of an array may be used for the array; the default is the 
latter; if serial indexing is used, it will be noted. 

The Virtual and Real Execution matrices and the separated 
Data Dependency (DD) matrices are the primary new structures 
needed to solve the problems. The key element tying them together 
is the new linking logic, described in Section 3.3.1. 


3.2 The Hardware Matrices 

With the exception of the dependency matrices previously 
discussed, all of the hardware matrices in CONDEL-2 are similar to 
the Advanced Execution § matrix. There is a_ one-to-one 
correspondence between each element in the AE matrix and each 
element in the other matrices, each element indicating some state 
corresponding to a particular instruction’s iteration. The matrices 
are of dimension nXm, as is the AE matrix. The new matrices are 
individually described as follows. 


3.2.1 Real Execution (RE) Matrix 

Each element of RE is binary. RE. j= (is set to one) iff 
instruction lQ. has really executed in iteration j. An instruction 
really executes if, for an assignment statement, an assignment has 
really occurred; or for a branch, a conditional has really been 
evaluated and a branch decision made. 


3.2.2 Virtual Execution (VE) Matrix 

Each element of VE is binary. VE, .=1 (is set to one) iff 
instruction IQ, has virtually executed in ‘iteration j. An instruction 
virtually executes if it is to be disabled (for execution) by a branch’s 
true execution (the branch is taken). 


3.2.3 Shadow Sink (SSI) Matrix 

Created by Wedig [16], each element of SSI is typically the size 
of an architectural machine register, i.e., is large enough to hold 
variables’ values. SSI; . is loaded with the sink value (result) of an 
assignment instruction IQ. having executed in iteration j. Variables’ 
values are normally held in SSI at least until they have been copied 
to memory. Since there are multiple locations (in SSI) to store 
instruction results (i.e., there are multiple copies of variables), 
basic shadow effects need not occur. 


3.2.4 Instruction Sink Address (ISA) Matrix 


Each ISA element is large enough to hold addresses (e.g., main 
memory addresses) of variables. There is one ISA element for 
each SSI element. ISA, . holds the sink address of the sink value 
held in SSI. ; For non-atray write instructions, ISA, , = AA, where 
AA is operand A’s address (held in the IQ). For array write 
instructions, ISA is determined for each iteration at run-time. 

The previous model of execution (CONDEL-1) is modified so 
that now during every cycle each eligible SSi value is written into 
memory at the location pointed to by the contents of .the 
corresponding ISA element. The determination of eligibility is 
discussed later. 
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3.2.5 Advanced STorage (AST) Matrix 

The AST matrix is a binary matrix with the same dimensions as 
the AE matrix. AST, .is set to 1 if either VE, .is set to 1 (instruction 
i receives a virtual’ execution in iteration '), or SSI. , has been 
written to memory. In other words, AST, . equals 1 iff it may be 
considered that SSI. . has been stored; conceptually it is similar to 
the AE matrix, in that either real or virtual writes cause the matrix 
elements to be set. 


3.3 Hardware Logic 


3.3.1 Basic Linking and Executable Independence Logic 

The fundamental state storage structures necessary to solve 
the problem have been described. The primary logic that generates 
the source-to-sink links and calculates executable independence 
is now outlined, The problem may be re-formulated to a more 
concrete specification as follows. It is required to have nm sets of 
less than nm output enabling lines each, one set per source per IQ 
instruction iteration, each line of which. potentially enables 
(connects) a serially previous sink word to the instruction 
iteration’s source input. These lines are designated 


SEN), (SEN stands for Sink ENable); 
where: 


- u is the serial index to the IQ instruction iteration 
under consideration for execution; 

t indicates the serial SS! element under 
consideration for linking to an input of u, it is always 
serially previous to u; 

- z indicates the source of. instruction / that 
corresponds to a particular SEN matrix, it is a source 
index; 

(i = row(u);j = col(u)). 

The SEN elements are calculated simply from the RE, VE, and 
DD state. An instruction iteration u can only link (take a source 
from) an instruction iteration t if: t is both really executed and data 
dependent on u, and all instruction iterations serially between u 
and t are virtually executed (if they are data dependent on u). For 
each u and z combination, at most one SEN? is equal to one. If 
none equal one, then either the source is not in SSI, ie., it is in 
memory, or the source has not yet been produced. The rough 
location (whether or not in memory) of a source is given by the 
Source From Storage (SFS, .) indicators, u and z defined as 
above. The SFS values are calculated from DD and VE. A source 
should be taken from storage if for all serially previous iterations, 
no valid sink exists in Shadow Sink storage; otherwise, the source 
should be taken from an appropriate SSI element. | 

It follows that iteration j of instruction IQ, (with serial index u) is 
data executably independent (all data dependencies resolved) if 
either its source is in memory or one SEN element is set (i.e., a 
valid sink exists in SSI). Complete Semantic Executable 
Independence is determined from this and the procedurally 
executably independent conditions, given in [14]. 


3.3.2 Decoupling of Sink Storage from Instruction 
Execution 


New logic and structures have been developed to allow 
memory updates to be decoupled from instruction execution. The 
write enable lines are called Write Sink Enable, and there is one 
per Shadow Sink element. The system works as follows. Basically, 
each WSE,, (Write Sink Enable for serial iteration u) equals one 
during a cycle iff the corresponding SSI_ is to be written into 
memory (at location ISA). The WSE logic is also used to update 
the contents of the Advanced Storage matrix every cycle; if WSE 
is true, then AST | is set to indicate that Shadow Sink u has been 
stored into memory. The combination of the WSE logic and the ISA 
and AST matrices allow the decoupling of instruction execution 
from memory updating. This in turn allows the form of branch 
prediction described in Section 3.5, as well as reducing the 


(9) complete versions of the logic, including the restrictive array accesses, may 
be found in [14]. 


negative effects of certain array access sequences. The detailed 
derivation of the WSE logic may be found in [14]. 


3.4 Less Restrictive Array Accesses 

As in CONDEL-1, array accesses are restrictive in CONDEL-2, 
but not to the same degree. In the implementation of CONDEL-2, 
data dependency Relation 3 (common sink) type array accesses 
may be executed concurrently, due to the presence of multiple sink 
copies (Shadow Sinks). However, since all array reads are made 
from memory ) Relation 1 and 2 type array accesses may not 
execute concurrently. In other words, any two array accesses 
involving both an array read and an array write to the same array 
must be sequentialized; otherwise (with only array writes or only 
array reads taking place) the accesses may proceed concurrently. 


3.5 Implementation and Discussion of Branch Prediction 


Classically, branch prediction techniques have been used to 
conditionally execute code beyond branches in the dynamic 
instruction stream ordering. Since such execution is conditional, a 
certain amount of code-backtracking or state-restoration must be 
made if the prediction is subsequently found to be wrong. This 
complicates the hardware of machines using such techniques, and 
can reduce performance in branch intensive situations. Also recall 
that with the dynamic instruction stream ordering, and considering 
the degree of hardware complexity needed with multiple 
simultaneous branch predictions, usually only one branch is 
conditionally executed at a given time. | 

The model described thus far (CONDEL-2) is strictly 
deterministic, in that no branch prediction is used. A modification 
to CONDEL-2 is now described that allows a simple form of branch 
prediction; this form is called Super Advanced Execution (SAE). 
The basic rule of SAE is as follows: 


Instructions within an innermost loop assume that the 
Backward Branch comprising the loop will always 
execute true. 


This means that inner loops will be presumed to execute an infinite 
number of iterations (in reality as many as there are in the AE 
matrix [m]). Note that forward branches within an inner loop may 
therefore also execute ahead of time, increasing the degree of 
branch prediction possible. Inner loops are flagged by the 
hardware at run-time, using a Backward Branch Domain 
matrix [14]. The way SAE works in the hardware is now described. 

Referring to Figure 5, note that b=3, and therefore normally 
only those instructions’ iterations in columns 1-3 are allowed to 
execute. Indeed, they must execute for the output of the code to 
be correct. With SAE these instructions (indicated by X’s and T’s) 
execute as before, but in addition those instructions’ iterations 
(indicated by S’s) to the right of column 3 (to the right of the b 
pointer) and within the inner loop are also allowed to execute. This 
is possible, with slight modifications of the Sink ENable and 
Semantically Executably Independent logic, by considering the 
other instructions’ iterations (indicated by V’s) to be virtually 
executed"; thus, no links are made to these iterations, and the V 
iterations allow S iterations to link to X or T iterations. The Write 
Sink Enable (WSE) logic is not modified, however, so that only 
those sinks whose instructions’ iterations are guaranteed to 
actually execute (i.e., those sinks to the left of and including the 
column pointed to by the b element) are eligible to be written into 
memory. This means that instructions’ iterations S can execute 
ahead of time, writing to their corresponding sink in the Shadow 
Sink matrix, but the sink is not copied into memory at least until the 
instruction’s iteration becomes an, X instruction iteration"), 


(this can be avoided if o(n*m?) address comparators are provided to match 
array sources with array sinks, the addresses of which are not known until execute 
time; in this case the dependencies with previous array read instructions need not 
hold. The technique of CONDEL-2 uses much less hardware and is more practical; 
no comparators are used (for a similar execute-time function). 


the instructions in the T region are also considered to be virtually executed by 
instruction iterations in the S region. This is so that T sinks are not used as inputs to 
S instruction iterations. Otherwise, T instruction iterations are both viewed, and 
allowed to execute, as normal (X) instruction iterations. 


O This can occur only upon the inner loop’s Backward Branch actually executing 
true. ; 
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Therefore no state-restoration or backtracking is needed. See [14] 
for implementation details of SAE. 


The execution algorithm is the same as in the basic CONDEL-2 


machine, with the following addendum. When an outer backward 


branch executes true in a given iteration, it causes all future state 
(SAE) to be reset. This is necessary to maintain potential 
dependencies. 

Summarizing: 


1. No state-restoration or backtracking is necessary for 
the branch prediction technique of Super Advanced 
Execution. 


2. In this technique, multiple branches, both forward and 
backward, may execute ahead of time. 


ene 


AE Matrix - (b = 3) 
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Figure 5: Super Advanced Execution and the AE matrix. 


3.6 Example of CONDEL-2’s Instruction Issuing Mechanism 


A brief, simple example is now presented illustrating the 
concepts and logic of CONDEL-2 including Super Advanced 
Execution. The major logic structures are shown after several 
cycles of the execution of a smail piece of code. The code 
segment used for the example is shown in Figure 6. The 
associated dependency matrices are displayed in Figure 7. bp! 
holds source 1 data dependencies, DD? holds source 2 data 
dependencies, and DD® holds the Relation 3 data dependencies 
(only needed for array accesses). 

Array "A" has the values: 


A(1)=4 A(2)=3 A(3)=2 A(4)=1 A(5)=5 
The AE matrix iterations are numbered: 
column #: 

| 1 2 3 4 5 
we we ee we ee ee we ee ee em 

1 1 7 13 #19 25 

2 se 8 14 20 26 

row #: 3 3 9 15 21 27 

4 4 10 16 22 28 

5 5 11 #17 23 = «229 

6 6 12 18 24 30 


The concurrency structures’ contents and logic values 
represent the machine state at the beginning of the cycle, with the 
following exceptions. The SSI values shown are those occurring 
after the SEI (Semantically Executably Independent) indicated 
instructions have executed. The SEI values shown in a cycle are 
those occurring just prior to the updating of AE, etc., in the same 
cycle. The SAEVE matrix indicates those iterations assumed to 
have virtually executed for Super Advanced Execution. Also note 
that only some of the SEN matrices are shown, due to space 
limitations. 

The steady-state instruction execution rate (five instructions 
executing per cycle) is reached in cycle 5 (see Figure 8). This is an 
instance of saturation, in which every instruction within a loop 


I# Instruction Sink Src. 1 Src. 2 
1. I[=0 I 0 a 
Z. I=I+i I I 1 
3 B=A(1) B A I 
4, — C=B*2 C B 2 
5; D(1)=C D I C 
6. IF I<6 GOTO 2 5 I 6 
Instructions 2-6 comprise the inner loop. 
_ Figure 6: Example code segment. 
pp! pp? pp* 
010011 00100 0 110000 
010011 001000 110000 
00010 0 000000 001000 
000000 000010 000100 
00000 0 000000 000010 
000000 000000 000001 
(the procedural dependency 


matrices contain all 0's) 
Figure 7: Example dependency matrices. 


executes in one iteration every cycle. In this cycle, the first column 
of the structures will be retired (the storage matrices shifted left 
and b decremented) since the first column of AE is all 1’s. 
Therefore more state space is made in the storage matrices to 
continue to allow five instructions to execute per cycle, until the 
loop terminates. Also, the first SEN matrix indicates that instruction 
.5 should use the first iteration copy of instruction 2’s sink as its 
source 1. Note that the execution of instruction 2 in iteration 4 is an 
example of Super Advanced Execution. 


4 Evaluation Data and its Analysis 


4.1 Introduction 


Measurement data evaluating the new solution follows. The 
solution is critiqued via both simulations and hardware estimates. 
In the following sections the benchmarks are briefly described, the 
performance measurements are given with analyses, and hardware 
_ estimates are presented. 


4.2 Benchmark Descriptions 
Two sets of benchmarks were used, a General Purpose set and 
a Scientific set. The General Purpose benchmarks contain a 


relatively high degree of control flow complexity; the benchmarks 
perform a wide variety of tasks and are representative; 


RE VE AE 
10000 01100 11100 
11100 00000 11100 
11000 00000 11000 
10000 00000 10000 
00000 00000 00000 
11000 00000 11000 

SENG , 

SEI SAEVE 
00000 00011 0 - 

00010 00000 1--- 

00100 00000 Q0--- 
01000 00000 0 - 

10000 00000 ---- 

00100 00000 ---- 


4.3 Performance Evaluation - Simulation Results 


4.3.1 Introduction 


_ Simulations were run both to verify functionality of the 
hardware algorithms and to measure their performance, 
particularly with respect to standard sequential execution. To 
these ends the execution of a variety of benchmarks was simulated 
in software assuming both the new hardware algorithms previously 
described, and, for comparison purposes, a more restrictive 
algorithm. The simulator mimicked the hardware structures and 
logic present in the various machine models. All of the simulations 
produced the correct results, so functionality has been verified. 
The main performance measurement is the speedup S; for a given 
benchmark, S is defined as the number of cycles necessary to 
execute the benchmark purely sequentially, divided by the number 
of cycles necessary to execute the benchmark concurrently. For a 
set of benchmarks, S is the average of the component 
benchmarks’ speedups, all weighed evenly. 

The machine model independent-variable is data concurrency 


type dct. The data dependency and branch prediction models 
implemented in the simulator are indicated by the dct value, 
defined as: 


«1 - maximally restrictive data dependencies; all data 
dependency relations shown in Figure 1 hold; note that 

this model of data dependencies, as implemented in 
CONDEL-1, precludes the use of Super Advanced 
Execution (SAE) 

e 2 - reduced data dependencies without SAE; only data 
dependency Relation 1 holds for non-array accesses, 
and no branch prediction is used 

e 3 - reduced data dependencies with SAE; same as 2, 
but using the branch prediction technique of Section 
3.5 


In the remainder of this section the simulation results of the 
machine model of Section 3, CONDEL-2, are presented for both 
sets of benchmarks. The very reduced procedural dependency 
model of[15], i.e., procedural concurrency type (ect) C, is 
assumed throughout; the results of the Scientific set are 
independent of pct. For all of the simulations, the Instruction 
Queue length n is fixed at 16 or 32. 


4.3.2 General Purpose Set Simulation Results 


The simulation results are summarized in the plot of Figure 9. 
Within the plot, the three curves correspond to the different data 


b=3 
SSI AST WSE 
Ox xX XxX 11100 00000 
123 4x 11100 00000 
432xx 11000 00100 
86x xX x 10000 01000 
8 x X X X 00000 10000 
x XX XX 00000 00000 
5 18 
SEN, » SENS’, 
~ 0 ~- 000-- 
- Q0O---- 001-- 
- Q---- 000-- 
- 1 - -- 000-- 
- see 000 -- 
Sa | Cems Tecate Shean Sees Ages 0dg--- 


Figure 8: Example concurrency structures and logic: cycle 5. 


Dhrystone [17] is included. The Scientific benchmarks exhibit a 
very low degree of control flow complexity, perhaps too low to be 
realistic. They consist of the 14 standard Lawrence Livermore 
Loops [8], with the number of loop iterations of each loop reduced 
by about a factor of 10, to reduce simulation time), All of the 
benchmarks were hand-coded into an intermediate assembly code; 
unless noted otherwise, the coder mimicked a typical compiler ) 


(ig anything, this makes the simulation results more conservative, since better 


- performance of the models typically occurs as the number of loop iterations 
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increases (reducing the effects of startup execution transients). 


The following simple compiler features were assumed: mapping of assignment 
expressions to low-height binary execution trees, little re-use of registers, and HLL 
statement translation as described in [14]. 


concurrency types (dct values). The data corresponding to the 
restrictive data dependency model of CONDEL-1 (dct=1) is 
included in the plot for comparison purposes. The range of the 
speedups for individual benchmarks with m =8 is from 1.59 to 2.77 
for dct =2, and from 1.61 to 4.79 for dct =3. In the plot, it is seen 
that reducing data dependencies improves performance, as does 
the use of Super Advanced Execution. Increasing Advanced 
Execution matrix width m improves performance, when SAE is 
used, as it allows more iterations to be active at any given time, and 
with the branch prediction technique of SAE more iterations are 
likely to be used. Note that no prior model implements either of the 
most advanced concurrency types: very reduced procedural 
dependencies, or reduced data dependencies with Super 
Advanced Execution. : 


4.3.3 Scientific Set Simulation Results 


The Scientific benchmark set simulation results are shown in 
Figures 10 and 11. Figure 10 includes the results of all of the loops. 
Figure 11 includes the results for just those loops for which loop 
capture occurred (Loops 1-6, 11, and 12). The plot of Figure 
11 demonstrates the benefits of loop capture. In both plots the 
performance improvements of the reduced data dependency 
techniques are seen to be significant, particularly with Super 
Advanced Execution. Also note that a wider Advanced Execution 
matrix (larger m) gives greater. performance in the same situations 
as with the General Purpose benchmark set results. The individual 
speedups for each of the loops ranged from 1.66 to 9.55 for dct = 2, 
and a range of 1.66 to 11.69 for dct = 3. From the data, it was seen 
that many of the benchmarks are resistant to performance 
enhancements. Six performance inhibiting mechanisms have been 
identified [14]. With the elimination of these mechanisms via 
relatively simple software preprocessing techniques (extent in a 
compiler), 9-12 of the loops should execute in saturation (see 
Section 3.6); this situation currently occurs with only 3 of the loops. 


4.4 CONDEL-2 Hardware Estimates 


4.4.1 Hardware Cost : 

The hardware cost of the reduced data dependency hardware 
is O(Kn*m?), where K is about 7. (We assume unlimited fan-out, 
fan-in of 10 [except for some wired-OR connections], and 6 gates 
per storage element.) The main contributors to the cost are the 
Sink ENable and Write Sink Enable logic. Being conservative 


Benchmark set = General Purpose 
pe = C: very reduced proc. dependencies 
nstruction Queue length (n) = 16 


GB — £4 dct = 3: reduced data dependencies with SAE 
A dct = 2: reduced data dependencies without SAE 
o-——-——© dct = 1: restrictive data dependencies 


s 
° 


speedup -S 


4 
Advanced Execution matrix width-m 


Figure 9: GP performance summary, pct = C. 


: | | 
. ( Loop capture occurs when the loop is completely contained in the Instruction 
ueue. 


735 


Benchmark set = Livermore Loops 
All loops are included. 
Instruction Queue length (n) = 32 


G — 4 dct = 3: reduced data dependencies with SAE 
~A dct = 2: reduced data dependencies without SAE 
e———© dct = 1: restrictive data dependencies 


Advanced Execution matrix width-m 


Figure 10: Scientific set performance summary, 
all loops included. 


Benchmark set = Livermore Loops 
Only loops fitting in the IQ are included. 
Instruction Queue length (n) = 32 


G — £ dct = 3: reduced data dependencies with SAE 
2 “a dct = 2: reduced data dependencies without SAE 
@-———© dct = 1: restrictive data dependencies 


Advanced Execution matrix width-m 


Figure 11: Scientific set performance summary, 
with loop capture. 


(pessimistic), and taking n= 32 and m=8 (large enough to extract 
most of the concurrency of the Scientific benchmark set), the cost 
is about 1 Mgates. If General Purpose code is the primary 
application of a target machine, then n and m may be reduced 
(while the machine will still capture most of the concurrency), 
giving a vastly reduced hardware cost. 

The prior techniques have the following costs. 


e Tjaden’s model: O(n7R), R is size of reassignable 
storage 


¢Wedig’s model: O(n*m + Rlog,n), R is size of 
reassignable storage 


The costs of these models may or may not be greater than 
CONDEL-2, depending on the choice of parameters; however, 
neither model is as functionally advanced as CONDEL-2., , 


4.4.2 Hardware Delay 


The hardware delay is low because CONDEL-2’s logic and 
update algorithms are inherently parallel; limited fan-in and fan-out 
should not increase the delay excessively, depending on the logic 
' family used and the choices of n and m. This compares very 
favorably to the prior models of Tjaden and Wedig, in which the 
algorithms were basically sequential, leading to extremely large 
delays. In{14,15] CONDEL-1, a similar kind of machine, is 
estimated to have a critical path delay of 8 gate-delays; this is the 
same as a Cray-1, and less than any other machine examined. A 
preliminary analysis shows that CONDEL-2’s critical path delay is 
about the same as CONDEL-1’s, given similar amounts of hardware 
resources (Processing Elements, etc.). 


5 Summary 

The major concepts presented in this document are 
functionally sound and provide significant performance gains over 
standard sequential execution techniques. There are indications 
that performance may be further improved via_ software 
concurrency enhancement. 

A significant speedup of sequential imperative code has been 
demonstrated through the use of low-level concurrency extraction. 
Reduced data dependencies enhance the concurrency detectable. 
Although a complete set of restricted resource measurements 
must yet be made on CONDEL-2, it has been found that the 
efficiency approaches 1 when execution saturation occurs, where 
efficiency is the speedup divided by the number of peak resources 
(Processing Elements) used. CONDEL-2 exhibits improved 
performance over both sequential machines and the concurrent 
models of Tjaden and Wedig with reduced hardware delay 
compared to their models; the hardware cost may or may not be 
comparable, depending on machine parameters. With the 
previously mentioned constraints, and disregarding branch 
prediction, CONDEL-2 extracts all semantic concurrency from an 
input imperative instruction stream; a parallel logic algorithm is 
used. A simple branch prediction technique has been implemented 
which requires no backtracking and allows multiple branches to 
execute ahead of time. 

CONDEL-2, with or without Super Advanced Execution, 
achieves data-flow execution of typical sequential code (consisting 
of branches and assignment statements, and assuming restrictive 


array accesses) with minimal deterministic control-flow 
constraints. 

Planned future work includes: 

eincreasing the degree of branch prediction 


implemented in the model; 

emaking a detailed design study of CONDEL-2, 
including investigating reductions in its hardware cost; 

e investigating compile-time techniques to increase the 
concurrency extractable (as mentioned in Section 
4.3.3); 

eexamining the performance effects. of 
resources (limited processing elements); 

eand performing simulations of more benchmark 
programs. 


reduced 
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FAULT LOCATION IN ato"! NETWORKS 


Wei Young 
Department of Computer and Information Sciences 
University of Alabama 
Birmingham, AL 35294 


Abstract -- A fault detection and location procedure for N'0-1 
networks is presented in this paper. Test vectors, with length 
independent of the size of the network, are first generated; then 
single fault is detected and various strategies, depending on the type 
of the fault, are used to locate it. Using other test vectors with 
length proportional to the size of the network is also discussed. 


1. Introduction 


In this paper, we present a fault diagnostic procedure for 
rearrangeable 0'Q-! networks with N(=2") input/output terminals. 
This procedure also applies to Benes networks which are deducible 
from 9°! networks as discussed in [1]. It is much simpler and 
faster than the method in [2] which involves complicated 
computation using the looping algorithm. 


We assume individual switching elements are not accessible 
and use the following notations throughout this paper. An N=2? 
input/output "OQ! network is described by sE®sE!...sE™ 
1yE®...uE2"2, where s and u are the shuffle and the unshuffle 
respectively and (E')’s are switching element stages indexed from 0 
to 2n-2.. _Input/output lines are indexed from 0 to N-1 and (i}=(in, 
iin-2:--ig! is the ith input/output of the jth stage, if the index of the 
stage should be emphasized, where i, ji,-9---l9 is the binary 
representation of i. Both indexing schemes are viewed from top 
down in figures. A path in a network, under a certain set of 
switching element settings, is a route from an input terminal to an 
output terminal, through intermediate links and connections of 
switching elements under that set of settings. We also use x to 
denote the complement of x. We shall use the fault model and 
similar fault classifications in [3]. A brief review is given below. 


A fault can occur at either a link or a switching element. A 
faulty link is one being stuck-at-zero (s-a-0) or stuck-at-one (s-a-1). 
On the other hand, we have sixteen possible states for a 2x2 cross 
point switching element, as shown in Table 1. Only states S,, and 
S, are considered to be valid. Let S be the set of all sixteen possible 
states in Table 1. We define the cartesian product SxS as the set of 
functional states. Assuming that the first valid state is S;) and the 
second S;, a functional state (S,,S;) of a switching element is 
interpreted as the switching element responds in states S; and Sj 
while it should be in states Sj, and S, respectively. In states Sg, Sj, 
So, S4, Sg, Sg and Sg we have logically unidentified outputs denoted 
by "-" and in Sg, Sz, Sg, S41, $13, Sy4 and Sy5, we have logically 
erroneous outputs denoted by "¢". The output values of "-" and 
"6" depend on the circuit implementation. However, an arbitrary 
assignment of 0 or 1 to them will not affect our capability of 
differentiating normal outputs from faulty outputs. It is clear that 
the cardinality of SxS is 256, only (Sj9,S;) is considered normal and 
the remaining 255 states are all considered faulty. Here we only 


consider the case of single fault, i.e. either a link stuck fault or a 


switching element fault, but not both. 


2. Fault Detection 

Let ®i=i,jOi,oD---Dig, where © is the exclusive-or 
operator. Also let a==(89,81,---,An.1)) where ai and 0<i<N, 
and b==(bp,bj,...,by.,), where b;=1 if ©i=0 and 0 if i=1, and 
0<i<N, be the two test vectors in our diagnostic procedure. Notice 
that a=b or ab; for all i. It is easy to see that each switching 
element receives both 0 and 1 as its inputs when the input vector of 


the network is a or b, if all the switching elements are set to Sy or. 


Ss respectively. 
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A switching element in states S,,. and S, is shown in Figure 1. 
Table 2 and Table 3 show its normal and faulty output patterns 
when tested in S,g and S, respectively. It is clear from the tables 
that we need only to apply both (0,1) and (1,0) to each switching 
element in order to detect a fault. There are two test phases; in 
phase-1, all switching elements are set to Sj) and in phase-2, Ss. 
Figure 2 and Figure 3 show the normal inputs and outputs of a 
network in phase-1 test and phase-2 test with a and b applied when 
N=8. Notice that each input vector to a switching element is either 
(1,0) or (0,1) and so is every output vector. It is easy to see that the 
output vector is the same as the input vector of a network in a 
phase-1 test. But in a phase-2 test, the output vector is the 
complement of the input vector because input i is routed to i after 
the middle stage and then to i (i with the ig bit complemented) at 
the output side and inputs i and i? are complements to each other. 
This shows that both input and output vectors are very easy to 
generate. 


It is obvious that we only need to apply a and b in one of the 
two phases to detect a link stuck fault. However, we may have to 
test both phases to detect a switching element fault. In any case, 
four tests are enough for detection. 


3. Fault Location 


Since a switching element can produce none, one or two faulty 
outputs in either phase of the test, there are several combinations. 
We divide them into three categories, similar to the cases in [3]. 
(1)One faulty output appears in the four tests. (2)Two faulty 
outputs appear in either phase-1 or phase-2 test. (3)Two faulty 
outputs appear in the four tests, one from phase-1 and the other 
from phase-2. 


There are twelve faulty functional states in category 1. 
Table 4 shows each faulty functional state, the position and pattern 
of the faulty output, and the phase of test in which a fault occurs. 
It is clear that every switching element on the path leading to the 
faulty output could be faulty. A process, similar to the one in [3], 
using the principle of binary search, can pinpoint the fault in 
[log(2n-1)| steps. Then, by supplying the faulty switching element 
identical inputs, we are able to distinguish the ¢¢ case from the -- 
case. In summary, for category 1, the location and the functional 
state of a faulty switching element can be determined within 
6+2[log(2n-1)] tests. 


For category 2, every state of Sp, S,, 54, Ss, Sg, Sz, Sg, Sys, 
S45 in phase-1 and So So, S5, Sg, Sq, S10) Sup Sia, 945 in phase-2 
produces two faulty outputs associated with two paths. In most 
cases, except the one in which the faulty switching element is in the 
middle stage, the two paths have two switching elements in 
common, as the following theorem shows. 


Lemma 1: For a switching element in stage j, 0<j<n-1, there 
is another switching element in stage 2n-2-j, which takes the two 
outputs of the former as its inputs in either phase-1 test or phase-2 
test. 


Proof: Let i and i? be the two output links of a switching 
element in stage } where O0<j<n-1. In the phase-1 test, they go to 
iylolp-p-lyn and ij-lolpaj4y on the output side of the middle 
stage respectively after n-1-j shuffles. Now n-1-j unshuffles take 
them to i and i°, which are the two inputs of a switching element, 


on the input side of stage 2n-2-j. In the phase-2 test, they go to 

nig payee I j+1 and i). loin: -1}41 on the output side of the middle 
oe respectively after n-1-j shufiles and exchanges. Now n-1-j 
ane asst and exchanges take them_ to .Ig and i, 
vey -lg on the input side of stage 2n-2-). loge lath in ther case, 
the lemma is true.O 


Theorem 2: In either phase-1 test or phase-2 test, two paths 
can only meet on at most two switching elements. 


Proof: If two paths meet on a_switching element, whose 
indices of input/output links are i and i°, in the middle stage, then 
they do not meet in any other switching element. Because, in the 
phase-1 test, a two paths take on the links i, _; 9..-igi,_1-- Aye and 
in-j-2"° Nola j-1 00 the output side of the ith stage and on the 
input side of a (2n-2-j)th stage, where O0<j<n-1. It is clear that 
in every stage, except the middle one, the two ig bits are always 

‘different and they are never on the least significant bit. Hence they 
only meet in the middle stage. Similarly, the conclusion is proved 
for phase-2 test. 


If they meet in a switching element in the first half of the 
network, then they also meet in the second half as proved in the last 
lemma. We claim that they do not meet anywhere else. Notice 
that, in the proof of the above lemma, the ig bits are always 
different and not on the least significant position between stages j 
and 2n-2-j, hence they do not meet there. Beyond stages j and 2n- 
2-j, the two paths do not meet either, because unshuffles only move 
the ig bits away from the least significant bit position. Due to the 
symmetry of the network, we have the same conclusion for the case 
that if the two paths meet in the second half. 


Therefore, the two paths, if they ever meet, can meet only on 
at most two switching elements.O 


Example: Figure 4 shows two faulty paths meeting in the first 
and the last stage in the phase-2 test. 


Now we describe a procedure to differentiate the two possible 
faulty switching elements in the following theorem, if the two paths 
have two switching elements in common. 


Theorem 8: The two switching elements common to the two 
faulty paths can be differentiated in two tests. 


Proof: The two faulty paths clearly pass two switching 
elements in the middle stage. By the last theorem, they must be 
different. If the fault occurs in the phase-1 test, set the two 
switching elements in the middle stage to S; while leaving all others 
in Sj9 and test the network with a and b. If the faulty switching 
element is in the second half of the network, the faulty outputs 
clearly will appear in the original faulty output teminals. If the 
faulty switching element, with output links i and i9, is in the first 
a say in stage i, where jSon-1. The two output links go to 

--Iglp----1j41 and ij. -lgip-pel j+1 00 the output side of the middle 
sae and also on the output side of the network through the n 
unshuffles following the middle stage. However, in the original 
phase-1 test, the two faulty outputs are on 1,...igi,1.- ij4, and 
1j---1gly-1---1j41 Which are clearly different from the two faulty 
outputs obtained from the above modified phase-1 test. Therefore, 
we only have to observe the positions of the faulty outputs in the 
modified test to determine which switching element is really at fault. 
If the fault is in the phase-2 test, we set the two switching elements 
in the middle stage to Sj9 leaving all the rest in S, and test the 
network with a and b. A similar argument will give us the same 
conclusion. 


Therefore, in any case, we need only two tests with either of 
the two modified control settings to locate the fault.O 


Ezample: Figure 5 shows the two paths coming out from a- 


switching element in the first stage when the settings of the two 
switching elements in the middle stage are complemented. 
Comparing to Figure 4, we see that the switching element in the 
first stage is the same, but the output terminals from the two paths 
are different. 


Theorem 4: In category 2, the location and the functional 
state of a faulty switching element can be determined within ten 
tests. 


Proof: Conduct the first four basic tests. Then find the 
common switching element(s) of the two faulty paths using the last 
theorem, if necessary, to obtain a single switching element. After 
the above six tests, the faulty switching element should have been 
located. If both the faulty outputs are binary vectors, the 
functional state is obvious from Table 1. If not, we have to conduct 
the test process discussed before to differentiate ¢¢ from -- and after 
these extra tests, the functional state should be obvious. Notice 
that we may need two more tests for a fault in another test phase. 
Therefore, we need six tests to locate the fault and two more for 
each test phase to determine the functional state, which is a total of 
ten tests.0 


Category 3 consists of the cartesian product of 
{S5,93,5g,911539,5 14} and {81,53,54,57,9,9,5,3}- Every member in 
the first set produces one fault in phase-1 and every member in the 
second set produces one fault in phase-2. Again we calculate two 
faulty paths leading to two faulty outputs, but one from each 
phase. In order to locate the faulty switching element, we need the 
following theorem. 


Theorem 5: The two faulty paths can have at most one link 
and two switching elements in common in category 3. 


Proof: Consider a faulty switching element in category 3. It 
is clear that if it produces one fault in the phase-1 test and the other 
in the phase-2 test but both on one of its output links, then the 
switching element, having this faulty link as one of its inputs, in the 
next stage, if any, is also on both faulty paths. If the faulty 
switching element produces faults on different output links in 


different phases, the two faulty paths must join together on one of 
its input links. Therefore, the switching element, having this link as 
one of its outputs, in the previous stage, if any, is also on both 
faulty paths. Accordingly, two faulty paths pass at least one 
common link and may go through one, two or more common 
switching elements. Depending on the position of the link, we can 
prove, in each case, at most one common link and hence two 
common switching elements may appear as follows: 


(1)The link is in the first shuffle, say at the position i of the 
input side of Oth stage. It goes to i in the phase-1 test and to i? in 
the phase-2 test. Then n-1 shuffles (left circular shifts) and 
switching stages take them to igi,.j...ij and igi,.y...i, respectively 
on the output side of stage n-1 while for a stage j between 0 and n- 


‘1, the two indices should be ip.j.j---igip-y--ip-j and ip.jg---lolpa--dp.j 
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respectively on the output side of that stage. We can see that the 
two paths do not intersect in the first half of the network after the 
first shuffle, because the ip bits are always different. Continuing 
traversing in the second half, we have n-1 unshuffles and switching 
stages. which take the two paths to i, and i 
ntl lol pal ken 42 respectively at tie Shiphe'd ide “OF? stage KE; 
where n-1<k< 2n-2, and to i,_41,-9..-Ig and i igh yin respectively on 
the output side of stage 2n-2. Then followed by the last unshuffle, 
they go to igpi,.1...11 and igi, )...1, respectively. In any stage, the i, 
bits are always different. Therefore, the two paths can join only at 
one link. 


(2)The link is in the last unshuffle. Since an N'0N-! network is 
symmetrical with respect to the middle stage, we obtain another 
0-1 network if we view the former in the reversed direction, that 
is interchanging the roles inputs and outputs. Therefore, this case is 
the same as (1), if considered in reversed direction. Consequently, 
the claim is also true in this case. 


(3)The link is in the first half of the network but not in the 
first. shuffle, say at input position 1 of the jth stage where 0<j<n. 
The two paths coming out from the jth stage are on i and i? 
respectively. Shuffles and swiching, stages take them to i, ,. 
1---Igln-1---In-k+j and i, 4j-1-lolna--ln-k4j OD the output side ot 
stage k, j<k<n. They differ from each other at least on the ig bit. 
Now unshuffles and switching stages route them to different 


positions in each stage of the second half, since the i, _,,; bits of the 
two paths are always different from each other. Hence the two 
paths do not join on the right side of the faulty link. For the left 
side, i goes to output position igi,_)...1, of the (j-1)th stage, on the 
input side of the.same stage, the two paths take on positions igi 
yey and ipl, y---1y respectively. Then we have j-1 shuffles and j- 
switching stages. The two paths do not join in this part either, 
because the 1, bits are different from each other. Therefore, the two 
paths coincide only on the faulty link. 


(4)The link is in the second half of the network but not in the 
last unshuffle. Due to the symmetry of 9'N-! networks, this case 
viewed reversely is the same as (3). Hence the claim is also true 
here.O 


Example: Figure 6 shows two faulty paths in category 3. As 
we can see, there are two switching elements and one link in 
common. 


Excluding the one switching element case, we are guaranteed 
to locate two possible faulty switching elements by the above 
theorem. This category can further be divided into six cases defined 
in Table 5 and Table 6 as in [3]. Case F and link stuck fault are 
not distinguishable as suggested in the proof of the above theorem. 
Processes similar to those in [3] are used to derive the following 
result: the location and functional state of a faulty switching 
element can be determined within eight tests for case A, ten for 
cases B and C and twelve for cases D and E. 


4. An Alternative Procedure 


If we allow more bits for test vectors, some speedup may be 
achieved. Instead having the previous a nd b, we shall use n+1 bits 
for each input/output. Numbers between 0 and 2N-1 are used when 
their binary representations (n+1 bits) being mentioned for the 
purpose of convenience. Let e=(e,€},..-,ey.3), Where e—=i+1 for 
0<i<N, be our test vector, that is e; will be fed into input terminal 
iin atest. Similarly we have two phases, again, called phase-1 test 
and phase-2 test. In phase-1, all switching elements are set to Sj 
except those, in the middle stage, whose (@i)’s of the indicies of 
their upper input/output links are equal to 1, being set to S,. In 
phase-2, the settings are all complements to those of phase-1. In 
other words, all switching elements are set to S, except those, in the 
middle stage, with (@i)’s of the indices of their upper input/output 
links equal to 1, being set to S,,. Figure 7 and Figure 8 show the 
normal inputs and outputs in each phase of test when N=8. In 
either phase, the test vector is e. Outputs can easily be calculated as 
follows. Let ©i=i,_)©1,..8---Oi;, ie. the exclusive-or of all bits 
except the Oth bit. In phase-1, the input terminal i goes to link i on 
the input side of the middle stage. If ©i=—0, it goes to i on the 
output side_and goes also to the output terminal i. If Qi=—1, then 
it goes to i? on the output side and goes to the output terminal i?. 
Therefore, on the output terminal side, position i has e==i+1, the 
number at input i, if ©i=0 and has en iP+1, the number at the 
neighbor of input i, if Oi==1. However, in phase-2, it is more 
complicated. Input terminal i goes to (i,.;..-1,i9)=i" (this means 
that every bit is complemented except the Oth bit) on the input side 
of the middle stage. The value of ©i° depends on n and Oi. We 
have four cases: (1) n even and ©i=0, then © i9=1. (2) n even 
and ©i=1, then Ci9=0. (3) n odd and Oi=0, then Qi°=0. (4) 
n odd and ©i=1, then ©19=1. Hence, on the output side of the 
middle stage, i goes to i9 in cases (1) and (4) and to i in cases (2) 
and (3). Finally_on the output terminal side, i goes to i in cases (1) 
and (4) and to i® in cases (2) and (3). Therefore, output terminal i 
has e==i+1 in cases (1) and (4) and has ego ==19+1 in cases (2) and 
(3). Thus both the control settings and the outputs of the network 
in each phase can be calculated efficiently. 


It is obvious that each switching element in the tests receives 
two different numbers as inputs under normal conditions. Any 
switching element fault or link stuck fault will show eventually at 
the output side of the network. Therefore, in this procedure, we 
need only to perform two basic tests, phase-1 with e and phase-2 
with e, in order to detect any single fault. However, we should bear 
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in mind that there is a factor n+1 left out in this count. Some 
notations are modified, for example, ¢¢ is replaced by a number 
between 0 and 2N-1, -- by -, 00 by O and 11 by 2N-1. A basic 
theorem without proof is given below. 


Theorem 6: Two paths (in either phase) have at most one 
switching element in common. 


Example: Figure 9 shows two paths in the phase-1 test of the 
procedure. As we can see, they meet at exactly one switching 
element. 


Based on the above theorem, similar diagnostic procedure can 
be developed. In many cases, the latter is faster due to more 
information from more bits and special settings for switching 
elements. 


5. Conclusion 


A fault diagnostic procedure for rearrangeable 0°"! netwroks 
has been presented. The number of tests in most cases is a constant, 
which makes this procedure very efficient compared to the one in [2]. 
Faster procedures are expected if more information is used to 
conduct tests. However, the number of tests may very with the size 
of the network. Therefore, a mix of the above may achieve better 
efficiency. 
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Figure 1: (a)Syq, (b)S5. 
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Figure 9: Two Paths in phase-1 test 


Figure 8: Phase-2 Test 


Table 2: Faults, Test Inputs and Outputs in Sjo. 


Table 1: 16 State of a Switching Element. 
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Table 3: Faults, Test Inputs and Outputs in Ss. Table S$: Six Cases-in cuteeary 3 
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Abstract 


This paper investigates the effect of Cluster Interconnection 
Network (CINs) failure on the performance of a Cluster-based 
Supersystem [1]. Three different cases of memory request distribu- 
tion have been treated. The first case assumes uniform distribu- 
tion while the second employs the concept of favorite memory. 
Further generalization of the second case is made in the third case 
by adding a private memory and associating one with each proces- 
sor. System performance degradation due to the failure of a single 
Cluster IN and due to the failure of multiple Cluster INs has been 
evaluated analytically in terms of available Bandwidth. 


I Introduction 


It is a well known fact that the switching speed of logic has 
almost attained its physical limit and only incremental speed 
improvement can be expected in future. 
Multiprocessor/Multicomputer based supersystems 2-8: offer a 
great potential for substantial performance enha9 offer a great 
potential for substantial is how to design an appropriate Intercon- 
nection Network to cater to communication between various func- 
tional modules in the system [10]. Several interconnection schemes 
have been proposed [11] and there are relative advantages and 
disadvantages of each one of them. A single shared bus scheme is 
the simplest and economical while cross-bar switches represent 
another end of the spectrum. The bus scheme suffers from the con- 
tention problem while the complexity and the cost of a crossbar is 
proportional to its total number of cross-points. As a compromise 
between the two, several multistage networks [10-12] and multibus 
systems [13-14] have been proposed which could provide a reason- 
able performance at an acceptable cost. 


In [1] we described and analyzed Cluster-based Supersystems 
which have a hierarchy that reduces the network complexity. 


Conflict-free access within each cluster is secured by relatively 
smaller crossbar switches. To maintain the hierarchy and the regu- 
larity of the structure, clusters are connected by a network similar 
to the one used for interconnecting processors within each cluster. 
These systems take full advantage of program locality and they 
have been shown to perform very close to the crossbar-based sys- 
tems for higher rate of favorite requests. However, the analysis 
carried out in |’ assumed that the system operates in a fault-free 
mode which may not be realistic . In reallity, cluster-based super- 
systems are prone tu faults that could at least degrade their per- 
formance. 
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In this paper we have dealt with the situation where one or 
more cluster [Ns could fail. Basic assumptions and other essential 
background material is included in section II. Performance degra- 
dation due to the loss of one cluster IN has been calculated in sec- 
tion III. In section IV, the performance degradation is determined 
in the event of total loss of all cluster INs. Concluding remarks 
have been included in section V. 


IT Background 
In the analysis of the cluster-based supersystems ‘1,, we 
assume the following model to exist: 
(i) The multiprocessor system is synchronous with N processors 
and N memory modules , in which the processors issue their 


requests at the beginning of a memory cycle. 


Memory cycle time is the same as the processor cycle time 
and is constant. 


The requests generated in a cycle are random and are 
independent of one another. 


The requests issued in successive cycles are independent of 
the requests issued in the previous cycle. 


The requests which are not accepted are discarded. 
of the 


Propagation delay interconnection network is 


neglected. 
wW is the probability with which a processor generates a 
request in a cycle. 


The fourth assumption is unrealistic because a discarded 
request will indeed be resubmitted in the next cycle. However, this 
assumption simplifies the analysis and produces negligible 
discrepancies from the actual results [17]. 


It is also assumed that N = n.k [1]. This enables us to have 
k - clusters with each cluster consisting of n - processors. Crossbar 
switches of size n x n are used for each cluster ( cluster IN ) while 
a kxk global crossbar switch is used to interconnect all k - clusters 
together (Fig.2). All the processors forming an individual cluster , 
have to pass through an associated single bus to gain access to the 
global IN. 


The system’s performance is analyzed on the basis of the 
analysis of crossbar [15,14| and shared-bus schemes , both of which 
have been included here just for the completeness of the text. 


Analysis of a crossbar 


Consider an NxN crossbar , and let p, be the probability 
that a request is delivered to an input of the crossbar. The proba- 
bility that this request is directed to a particular output of the N 
outputs is p,/N. The probability that this particular output is 
not requested by any of the N inputs is (1 — p,, ae The proba- 
bility that this particular output is requested by at least one input 


ie (1) 


sa 1— (1 - pon 


This is the well known formula for the crossbar [15]. 
Shared bus analysis 


We adopt the model for the shared bus introduced in [16] 
which views the shared bus as a cascade of an Nx1 crossbar and a 
1xN crossbar. Let py be the probability that the bus is requested 
by a processor out of the N processors tied to the bus , then the 
probability that none of the N processors request the bus is 
(1 — po) . The probability at least one processor will request the 
bus is 1 — (1 — pp) 
Hence the rate of request at the bus 
N ¢ 
= A= (1 Py) (2) 


Rate of request at a particular output line out of the N out- 
put lines tied to the bus 


N : 
= (1 -— (1 — po) VN (3) 
It may be noted that if there is only one output line tied to 

the bus then eq.(3) reduces to eq.(2). 


Fault-free crossbar-based and cluster-based Multiprocessors 


The fault-free analysis of the crossbar-based and the cluster- 
based multiprocessors has been carried out in [1] for three different 
cases (based on the distribution of requests among the memory 
modules) namely: Equally likely , favorite memories , and favorite 
with private memories. To avoid the duplication, analysis for the 
fault-free system will not be included here. Some data is presented 
in the form of Tables (Tables I, II, III) and graphs (Figs. 5, 6, 7) so 
that a comparison between performance of the fault-free and the 
faulty systems could be done. 


It may be noted that minor modifications in analysis given in 
[1] is needed to allow for the restriction imposed in this paper, 
namely, a processor is internally connected only to its private 
memory. This change is deemed necessary for a fair comparison 
between the fault-free and faulty cases. 


Faulty cluster-based multiprocessors 


The analysis performed here, deals only with the failure of 
the cluster interconnection networks. There are k nxn CINs while 
there is only one kxk global IN and the global network is most 
important from the point of view of keeping all the clusters con- 
nected together. Hence, it is desirable to have a reliable global IN 
by designing it with appropriate care, or alternatively to duplicate 
the global IN and keep the second global IN as a standby spare. 
For our analysis, the global IN is assumed to be fault-tolerant and 
failures are assumed to occur only in the CINs. 


In the event of failure of a cluster IN requests from the pro- 
cessors to the memory modules in that cluster are directed through 
the global IN. Although the system is still operational , its perfor- 
mance will be degraded. This degradation will be the least when a 
single cluster IN fails , and the most when the cluster INs in all 
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the clusters fail. Bearing in mind that the performance of the sys- 
tem for any number of failed cluster INs will lie between these two 
limits , we are only going to concern ourselves with the analysis of 
these two extreme cases. 


Il] Failure of a single cluster IN 


Failure of a single cluster IN results in a situation where two 
groups of memory modules are formed and each group has a 
different rate of request (Fig.3). The first group consists of those 
memory modules which are part of the cluster with failed IN , 
while the remaining memory modules residing in the remaining 
healthy clusters form the second group. 


Let 


A ve : 
Pert = the probability that a processor issues a request to a 
memory module external to its own cluster. 


Po = the rate at which an input of the global IN is 
requested. 


/ 


Po = the rate at which an input of the cluster IN is 
requested. 


p;; = the rate of request at a memory module due to 
requests from favorite processors directed through the cluster 
IN. 

Py. — the rate of request at a memory module due to 


requests from favorite processors directed through the global 
IN (in the event of failure of the cluster IN). 


Pat = the rate of requests at a memory module due to 
requests from nonfavorite processors through the global IN. 


3.1 Equally likely case_ 


This is the case where the requests generated by a processor 
are uniformly distributed among the memory modules. 


(a) First group analysis ( group! ) 


Let p..4, be the probability that a request generated by a 
processor of group! will be directed through the global IN, hence 


Pezti = My (4) 
( since no requests can be made through the cluster IN ) 

Let po, be the rate at which the input of the global IN 
(accessible by processors of group1) is requested, hence 


Pq, =~ 1- (i - Pest) (5) 
(1) 


Let p,;° be the rate of request at a memory module of 
groupl due to requests issued by processors in groupl. 


(1) _ 

Pi = Pain (6) 
Let pi) be the rate of request at a memory module of 
groupl due to requests issued by processors in the second group. 


(7) 


Hence the total rate of request at a memory module of group1 


1 


Second group analysis ( group2 ) 


1 1). (1). 
Pa — Pi Pa (8) 


(b) 


Let p.i.2 be the probability that a processor of group2 will 
issue a request to a memory module external to its own cluster, 
hence 


Pest2 = (& — 1)b/k (9) 


Let po, be the rate at which an input of the global IN 
(accessible by processors of group2) is requested, hence 


Poe ae ee (1 = Pisis) (10) 


Let p?) be The rate of request at a memory module of 


group2 due to requests issued by the processors of group2 but not 
belonging to the same cluster. 


(11) 
k-1 


Let pi?) be The rate of request at a memory module of 
group2 due to requests issued by processors of groupl. 
(2) 


Py = Pay/N (12) 
Po = Wk (13) 
Let pi?) be The rate of request at a memory module of 


group2 due to requests issued by processors of group2 and in the 
same cluster. 


pi?) =1—- (1 — po/n)” (14) 
Hence the total rate of request at a memory module of group2 
2 (2 2 2) (2 2) (2) 
ey aa Pia’ + Py + ps ) Pia Py ) Pia Ps 
2 2) (2) (2 

Bi Pa Oy By By (15) 
The Bandwidth 

| = BW = p,.n + py.(N — n) (16) 


Table IV gives the degraded performance of the cluster-based mul- 
ticomputer system for the equally likely case. 


3.2 Favorite memories case 


In this case processors (favorite) communicate more often 
with memory modules (favorite) in the same cluster as the proces- 
sors. Let s_ be the probability that a processor addresses a partic- 
ular favorite memory given that it has generated a request. 


(a) First group analysis” 
Pezt1 = u (17) 
Pay = 2 = Pests) (18) 
Pro = §Pa1 (19) 
Let Dees be The rate of request at a memory module of 
groupi due to requests issued by nonfavorite processors. 
Po2 , 
Pace hehe) (20) 
k~1 
Hence the rate of request at a memory module of groupl 
- 1 1) 
= Pi = Pro + Dare ~ Pjo Pay (21) 
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(b) Second group analysis 


Pesta — (1 — ns) (22) 


Po2 ~ es (1 ~ Peat2) (23) 


Let we be the Rate of request at a memory module of 


group2 due to requests issued by the non favorite processors in 
group2. 


Poe 
2 k 2 
pia = [1 - (1 - ——)* ?)n (24) 
k=1 
Let er be the Rate of request at a memory module of 


group2 due to requests issued by the nonfavorite processors in 
groupl. 


pi?) = ((1 — ns)——-]n (25) 
k—-1 

Po = ns (26) 

ppp i= (1 — Do/n)” (27) 


Hence the total rate of request at a memory module of 


group2 
= Po = PR t pi) + pia) - Pps Parr = Pri Pag 
> Dap Papa * PhPapaPer (28) 
BW = p,.n + pa(N — n) (29) 


The results for this case are given in Table V. 


3.3 Favorite with private memories case 


A private memory is added to each processor and each pro- 
cessor accesses its private memory through an internal bus connec- 
tion. 


Let m be the probability that a processor requests its private 
memory given that it has generated a request. 


(a) First alysi 
Pest1 = (1 ~ m is (30) 
Po. ~ Le (1 ~ Peati) (31) 
Pro ~ 8Pe1 (32) 


The rate of request at a memory module of group! due to 
requests issued by nonfavorite processors 


Poe 


1 k 1 
=p), = [1 -(1- i Vn (33) 
k-1 
Hence the total rate of request at a memory module of 
groupl 
= = (1 (1) 
= egy Pfo = per = ProPuze (34) 
(b) Second group analysis 
Pesta = (1 — ns — m)W (35) 


The rate of request at a memory module of group2 due to 
requests issued by the non favorite processors in group2 


Poe 


(37) 


The Rate of request at a memory module of group2 due to 
requests issued by the nonfavorite processors in groupl 


Po. 


=p) = [1 - ns) \/n (38) 
Po = ns. (39) 
Py; = 1 - (1 — p/n)” (40) 


Hence the total rate of request at a memory module of 
group2 


(2) 


2 2 2 
ee (2) (2) 


nfl PsePanfe ~ PsiPnyi 


2) (2 (2) (2 
a Pet P tr aa PyiP Ly ePata 


(41) 


BW = p,.n + p,(N — 2) (42) 


Table VI indicates the performance degradation for this case. 
IV Failure of the INs in all the clusters 


In this situation, all requests generated by processors will be 
delivered to memory modules through the global IN (Fig.4). 


Hence, 
Pest ~ M (43) 
Pg =1- (1 - Pee)” (44) 
4.1 Equally likely case 
The total rate of request at a memory module 
k 
=p = [i — (1 — pg/k) \/n (45) 
BW = DN. (46) 
Results for this case are given in Table VII. 
4.2 Favorite memories case 
Pro ~ Ped (47) 
| Po 
Pay = (1 - (1 -— (1 - sn) ——)* "Yn (48) 
k-1 
The total rate of request 
es, Pfo + Pus ~ ProPnft (49) 
BW = PN. (50) 


Table VIII indicates the degradation of performance for this 
case. 


4.3 Favorite with private memories case 


Again, with the exception of requests to the private 
memories, all requests are delivered through the global IN. 


Pest ~ (1 ~ m )b (51) 
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Pc ae (1 Peat ) (52) 
Pio = Paf (53) 

Pg | 
Pay = (1 - (1 - (1 — sn) ——)* "yn (54) 

k-1 
The total rate of request 

= P= Pro + Pog = ProPnf (55) 
BW = PN. (57) 


The degraded performance for this case is given in Table IX. 


V Conclusi 


Performance degradation of cluster-based supersystems due 
to failure of cluster INS has been analytically determined. Two 
limiting situations of only one cluster IN failure and all the cluster 
INs failures have been considered. The analysis has been carried 
out for three different cases : Equally likely, Favorite memories, 
and Favorite with Private memories. It has been concluded that 
the Favorite memories scheme is still superior over the other two 
schemes (Figs. 5, 6, 7), in spite of steep performance degradation - 
for the same values of the system parameters. 
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Table VI. Degraded Performance of the Cluster-based 
Multiprocessor System with Private and Favorite 


Table VII. Degraded Performance of the Cluster-based 
Multiprocessor System Due to Failure of All Cluster INs 


iy = 1). 


BW 
k= k=4 ik 8) 
1.50 5.25 
1.50 5.25 
1.50 5.25 
1.50 5.25 
1.50 5.25 
1.50 5.25 | 
1.50 5.25 | 


Table VIII. Degraded Performance of the Cluster-based 
Multiprocessor System with Favorite Memories Due 


to Failure of All Cluster INs (us = 1). . 


BW 
: —_ k=2 ._ _k=4 k = 8 
1.99 3.94 
1.99 3.96 
1.99 3.97 
2.00 3.97 
2.00 3.97 
2.00 3.97 
2.00 3.98 


Table IX. Degraded Performance of the Cluster-based 
Multiprocessor System with Private and Favorite 
Memories Due to Failure of All Cluster INs (i = 1). 
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Fig. l. 


Crossbar-Based Multiprocessor System 


Fig.2. Cluster-Based Multiprocessor System 


Fig.3 Cluster-based Multiprocessor System 
with a Failure Cluster IN 
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Fig. 5 System Performance for the Equally likely Case 
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RESOURCE SHARING INTERCONNECTION NETWORKS 
IN MULTIPROCESSORS 


Jie-Yong Juang and Benjamin W. Wah’ 


ABSTRACT 

In this paper circuit-switched interconnection networks for 
resource sharing in multiprocessors, named resource sharing 
interconnection networks, are studied. Resource scheduling in 
systems with such an interconnection network entails the 
efficient search of a mapping from requesting processors to free 
resources such that circuit blockages in the network are minim- 
ized and resources are maximally used. The optimal mapping is 
obtained by transforming the scheduling problems into various 
network-flow problems for which existing algorithms can be 
applied. A distributed architecture to realize a maximum-flow 
algorithm using token propagations is also described. The pro- 
posed method is applicable to any general network configuration 
modeled as a digraph in which the requesting processors and free 
resources can be partitioned into two disjoint subsets. 


1. INTRODUCTION 

In this paper we investigate the problem on the sharing of 
computing resources in multiprocessors and the distributed 
scheduling of shared resources in a circuit-switched interconnec- 
tion network. 

A resource is a processing element to carry out a designated 
function. Examples include a general purpose processor, a spe- 
cial functional unit, a VLSI systolic array, an input/output dev- 
ice, and a communication channel. A resource is accessible by 
any processor via an interconnection network. A request gen- 
erated by a processor can be directed to any one of a pool of free 
resources that are capable of executing the designated task. An 
interconnection network is an essential element of these systems 
as it interconnects processors and resources. Its function is to 
route requests initiated from one point to another point 
connected on the network. The network topology is dynamic, 
and the links can be reconfigured by setting the network's active 
switching elements. The notable characteristic of these networks 
is that they operate with address mapping. That is, a request is 
initiated with a specific destination or a set of destinations, and 
routing is done by examining the address bits. Routing of 
requests is usually done in parallel. As classified by Feng [8], 
these networks include the single or multistage networks and the 
crossbar switch. Examples are the banyan, indirect binary n- 
cube, cube, perfect shuffle, flip, Omega, data manipulator, aug- 
mented data manipulator, delta, baseline, Benes, and Clos. 
Examples of systems designed with interconnection networks are 
Trac, Staran, C.mmp, Illiac IV, Pluribus, Numerical Aero- 
dynamic Simulation Facility (NASF), the Ballistic Missile 
Defense testbed, MPP, and Connection Machine. The perfor- 
mance of resource sharing systems under address mapping has 
been studied by Rathi, Tripathi and Lipovski [21], Fung and 
Torng [10], and Marsan, et al [17]. 
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Wah proposed a network with distributed scheduling intel- 
ligence, called resource sharing interconnection network 
(RSIN) [23,22]. Instead of using an address-mapping scheme, 
which requires a centralized scheduler to seek and give the 
address of a free resource to a request before it enters the net- 
work, the request is sent into the network without any destina- 
tion tags. It is the responsibility of the network to route the 
maximum number of requests to the free resources. In this way 
the scheduling intelligence is distributed in the network. Distri- 
buted resource scheduling avoids the bottleneck of a centralized 
scheduler [21]. The objective of a good scheduling scheme is to 
avoid network blockages and to maximize resource utilization, 
which requires an efficient algorithm at each switching node to 
collect the minimum amount of status information. 

The PUMPS architecture (see Figure 1) for image analysis 
and pictorial database management [3] is a typical example of 
resource sharing multiprocessors in which VLSI systolic arrays, 
each realizing an image processing function, are organized into a 
pool of resources. Most dataflow architectures can also be con- 
sidered as resource sharing systems. The processing units are the 
pool of resources, and a RSIN connects them to memory cells. In 
a resource sharing system with load balancing, processors are | 
considered as resources, thus requests generated are queued at 
the processors as well as the resources. An imbalance of work- 
load at the processors will be balanced by a load balancing 
scheme. 

The design of a RSIN with optimal resource scheduling is 


_ studied in this paper. The results are derived with respect to 


multistage interconnection networks, called multistage resource 
sharing interconnection networks (MRSIN), and are applicable to 
any general network configuration modeled as a digraph in 
which the requesting processors and free resources are parti- 
tioned into two disjoint subsets. Central to the design of such an 


Interprocessor 
communication network 


Host 
processors 


Resource sharing 
interconnection network 


VLSI systolic arrays 


Figure 1. PUMPS: An example of a multiprocessor with shared 


systolic arrays. 


interconnection network is the development of an efficient distri- 
buted algorithm to disseminate status information through the 
complex interconnection structure. The algorithm to be 
presented is simple, efficient, and independent of the interconnec- 
tion topology. 


2. RESOURCE SHARING INTERCONNECTION NETWORKS 

The assumptions of the RSIN used in this study are sum- 
maried as follows. 

(a) Circuit switching is assumed rather than packet switch- 
ing for the following reasons. First, packet switching is used in 
conventional networks with address mapping because it allows a 
network path to be shared by more than one request con- 
currently. Reducing the packet delay by balancing loads among 
alternate paths is less critical in a RSIN because a request can 
always search for another available resource if a path is blocked. 
Moreover, the overhead of rerouting a packet is higher than that 
of rerouting a resource request [16]. Second, owing to the 
resource characteristics, a task cannot be processed until it is 
completely received. The extra delay in breaking a task into 
multiple packets may decrease the utilization of resources, and 
hence increase the response time of the system. 

(b) One or more types of resources may exist in the system. 
A RSIN connecting only one type of resources is called a homo- 
geneous RSIN, while a RSIN connecting multiple types of 
resources is a heterogeneous one. 

(c) A priority level may be associated with a request to 
show the urgency of the request. A preference value may be 
associated with a resource to show the desirability of being used 
for service. The costs of allocation are inversely related to the 
priorities and preferences. 

(d) Each request needs one resource only. 

(e) A processor can transmit one task at a time to the 
resources. Other tasks arriving during the task transmission 
time are queued. The circuit between a processor and a resource 
can be released once the request has been transmitted. The pro- 
cessor can continue to make other requests, while the resource 
will be busy until the task is completed. 

We have not investigated the problem on the selection of 
the number of resources in each type and their placements in the 
output ports. This problem has been studied by Briggs et al., 
who have considered the problem of choosing the number of 
resources in each type in which one resource is connected to each 
output port and one resource is requested each time [2]. We 
have not considered the case in which more than one resource or 
multiple types of resources are requested by one request. Here, 
the scheduling algorithm is dependent on the number of 
resources in each type, the way that resources are distributed to 
the output ports, and the network characteristics. Further, 
deadlocks may occur, and distributed resolution of deadlock 
may have a high overhead. | 

The goal of the scheduling algorithm is to find a request- 
resource mapping such that the total cost is minimized. In the 
special case in which all requests are of equal priorities and all 
resources have equal preferences, the scheduling problem 
. becomes the mapping of the maximum number of requests to the 
free resources. 

The maximal request-resource mapping may be hampered 
by blockages in the system. In a conventional address-mapped 
interconnection network, blockages may be caused by conflicts in 
either the same resource being requested by more than one 
request or a network link requested by two circuits. In a RSIN, 
a resource conflict can be resolved by rerouting all but one 
request to other free resources. However, this may not always 
lead to better resource utilization because the allocation of one 
request to a resource may block one or more other requests from 
accessing free resources. A scheduling algorithm that schedules 
requests according to the state of the network and resources is, 
therefore, essential. As an example, consider an 8-by-8 Omega 
network” in Figure 2a with switch-boxes that can be individu- 
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ally set to either a straight or an exchange connection. Processors 
Pi, P3- Ps» P7 and pg are requesting one resource each, and 
resources fr}, 13, fs, '7 and rg are available. The circuits between 
P2 and rg and pg and rq have been established previously. pg is 
not making request, and r2 is busy. All free resources will be 
allocated if one of the following request-resource mappings is 
used: {(p.. r3), (p3. rs), (ps, Yq), (p7. T,), (pg. rs)} or {(p1. r3), 
(p3. rg), (ps. r7), (py. Tr). (ps. rs5)}. But if the mapping {(p,. r), 
(p3. rs), (ps. 13). (pz. r7). (pg, rg)} is used, a maximum of four out 
of five resources can be allocated, since the path from pg to rg is 
blocked. Simulation results showed that the average blocking 
probability can be as low as 2% for a MRSIN embedded in an 8- 
by-8 cube network [22,12]. If a heuristic routing algorithm is 
used, the average blocking probability increases to 20%. Further 
degradation occurs if the network is not completely free. 


3. OPTIMAL RESOURCE SCHEDULING IN MRSIN 

To bind a request to a resource in the system, a RSIN deter- 
mines a mapping from pending requests to free resources, and 
provides connections to as many request-resource pairs as possi- 
ble. In this section, methods to optimize request-resource map- 
pings are discussed. Exhaustive methods that examine all possi- 
ble ordered mappings have exponential complexity. In a homo- 
geneous MRSIN, suppose x processors are making requests, y 


Resources 


Processors Switching Boxes 


stage 


Figure 2a. A MRSIN embedded in an 8-by-8 Omega network 
(dark paths in the network show circuits that are 
already occupied; processors p,. p3, Ps, P7 and pg are 
making requests; resources 1, 13, T5, f7 and rg are 
available). 


Figure 2b. The flow network transformed from the MRSIN in 
Figure 2a using Transformation 1 (the number associ- 
ated with each arc is the amount of flow assigned to it 
by the maximum-flow algorithm; all arcs have unit 
capacity). 


resources are available, and the network is completely free. The 
scheduler has to try a maximum of Cj*y! (for x2y) or Cyex! 
(for y2x) mappings to find the best one, where C,) is the number 
of combinations of choosing i objects out of j objects [22, 12]. 
Suboptimal heuristics can be used but is only practical when x 
and y are small. es 

In this section we transform the optimal request-resource 
mapping problem into various network-flow problems for which 
many efficient algorithms exists [11, 13]. The basic concepts of 
flow networks are briefly reviewed first. 


3.1. Flow Networks 

A flow network is usually represented by a digraph in 
which each arc is associated with a capacity and possibly a cost. 
Let D = (V, E) be a digraph with two distinct nodes s (source) 
and t (sink). A capacity function, c(e), is defined on every arc of 
the graph, where c(e) is a non-negative real number for all e€E. 
A flow function f assigns a real number f(e) to arc e such that 
the following conditions hold. 
(1) Capacity limitation: 


0<fle)<c(e) (1) 


(2) Flow conservation: Let o(v) (resp. B(v)) be the set of incom- 
ing (resp. outgoing) arcs of vertex v. For every v€ V, 


for every arc e€E 


—f V=s 
L fle)— 2 fle)=} F vet @) 
e€ a(v) e€ B(v) QO otherwise 


The capacity constraint restricts the amount of flow that can be 
assigned to a link, and flow conservation implies that an inter- 
mediate node in the network does not absorb or create flows. 

A legal flow is a flow assignment that satisfies the capacity 
and flow-conservation constraints. In a network flow problem, it 
is necessary to find the legal flow that optimizes a given objective 
function. For example, in the maximum-flow problem, it is 
necessary to find the maximum amount of flow that can be 
advanced from source to sink in a flow network G(V,E,s, t, c) 
under the capacity and flow-conservation constraints. The prob- 
lem can be formulated as a linear program. 


Maximum-Flow Problem 
Maximize F 
subject to: 
(1) Flow conservation (Eq. (1)): 
(2) Capacity limitation (Eq. (2)). 


Many other examples of network flow problems, including the 
minimum-cost flow and the trans-shipment problems, can be 
found in the literature [11]. 

An s-t path is a directed path from s to t. Insertion of a 
dummy node in a path will increase the path length but will not 
affect the flow assignment. Increasing the length of s-t paths in 
this way such that all s-t paths are of equal length is called s-t 
path equalization in this paper. All s-t paths are assumed to be 
equalized when the network is loop-free. 


3.2. Optimal Resource Mapping in Homogeneous MRSIN 

A switch-box in a MRSIN is a crossbar switch without 
broadcast connections. The following theorem shows that the 
Setting of a non-broadcasting switch is equivalent to a legal 
integral flow assignment in a flow network of unit capacity. 
Note that an integral flow is a flow assignment in which the 
amount of flow assigned to each link is of integral value. 
Theorem 1™*: For any MRSIN, there exists a flow network for 
which a legal integral flow is equivalent to a valid request- 
resource mapping. 


** The input ports are numbered in a way different from Lawrie’s Omega net- 
work [15] because all resources are homogeneous, and the permutation of re- 
questing processors will not affect the resource utilization. Broadcast connection 
is not needed in the switch-boxes since each request needs one resource. 
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To use existing algorithms to solve a flow problem, a 
MRSIN has to be transformed into a flow network such that the 
optimization of request-resource mappings is equivalent to the | 
optimization of the corresponding objective function in the flow 
network. To this end, additional nodes may be introduced, the 
capacity of a link may be greater than one, and a cost may be 
associated with a link. 

The following transformation produces a flow network 
such that the optimal request-resource mapping can be derived 
from its maximum flow. 


Transformation 1: Generate a flow network G(V.E,s, t.c) 

from a homogeneous MRSIN. 

(T1) Create three node-sets P, X and R for processors, switch- 
boxes and resources, respectively. Introduce two additional 
nodes: source s and sink t. Let 


vi={s,t} VUPUXUR 


(T2) Add an arc from the source to every node associated with a 
processor. Denote this set of arcs by S. 


S = {(s,v) | v € P} 


Add an arc between every node associated with a resource 
and the sink. This set of arcs is called T. 


T={(v.t) | v € R} 


For each link in the MRSIN that connects two switch- 
boxes, or a processor to a switch-box, or a switch-box to a 
resource, add an arc between the corresponding nodes in the 
flow graph. Denote this set of arcs by B. 


B={(v.w) | ve PUX,.weé X UR} 


Define E' =S UTUB 
(T3) Assign link capacities according to the following function. 


0 associated link is occupied 
c(e) = or non-existent in the MRSIN 
1 associated link is free 


associated processor does not generate request 
associated processor generates request 


ofe) = : 


0 associated resource is unavailable 
cle) ~ |1 associated resource is available 
(T4) Obtain arc-set E by removing those arcs with zero capacity. 


E = E'— {e | e € E’, c(e) = 0}. 


Obtain node-set V by deleting those nodes that are not 
reachable from s. 


V=V'—{v | av) U B(v) = Sv EV'} 


Applying the above transformation to the MRSIN in Figure 
2a results in the flow network in Figure 2b. The following 
theorem shows that Transformation 1 can be used to find the 
optimal request-resource mapping. 


Theorem 2: In a homogeneous MRSIN, the number of resources 
allocated by a mapping is equal to the amount of flow that can 
be advanced from the source to the sink in the flow network 
obtained by Transformation 1. 


From Theorem 2 and a known result that the maximum 
flow of a network with integral capacity is integral [11], we 
conclude that the optimal mapping can be derived from the max- 
imum flow in the transformed flow network. 

Many algorithms have been developed to obtain the max- 
imum flow in a flow network. The algorithm by Ford and Fulk- 
erson [9] is a primal-dual algorithm in which the flow value is 
increased by iteratively searching for flow augmenting paths 
until the minimum cut-set of the network is saturated. At this 


*#* All theorems are stated without proof due to space limitation. 


point, no more flow can be advanced since the minimum cut-set 
is the bottleneck. A flow augmenting path is an s-t path through 
which additional flow can be advanced from the source to the 
sink. When arc e on the s-t path points in the direction as the s-t 
path, additional flow may be advanced through e if the current 
flow assigned to e is less than c(e). In contrast, if arc e points in 
the opposite direction, then additional flow may be pushed 
through the s-t path by canceling its current flow. Advancing 
flow through an augmenting path in this way will always 
increase the total amount of flow, and the flow-conservation and 
capacity limitations will not be violated. For example, in Figure 
3, an original flow f is assigned along the path s-a-d-t. Then 
path s-c-d-a-b-t is a flow augmenting path. Advancing one unit 
of flow through this augmenting path results in a new flow 
assignment f '. Two units of flow are pushed through two 
Separate paths s-a-b-t and s-c-d-t according to this assignment. 
In the MRSIN, advancing flow through an augmenting path 
is equivalent to a resource re-allocation, i.e., a permutation of the 
possible request-resource mappings. Consider the MRSIN in Fig- 
ure 4, which is the counterpart of the flow network in Figure 
3." The original flow f is equivalent to the request-resource 
mapping {(p,.rqg). (p..rpJ}. The allocation of resource r, to 
request p, is blocked according to this mapping. The existence of 
the flow augmenting path s-c-d-a-b-t shows that this blockage 
can be removed. Advancing flow through this augmenting path 
results in a new mapping {(p,.r,). (p,.rg)} and the allocation of 
both resources. As another example, applying the maximum- 
flow algorithm to the flow network in Figure 2b, the flow 


Figure 3. An illustration of advancing flow through a flow aug- 
menting path. (All arcs have unit capacity. The initial 
flow is assigned to path s-a-d-t. The flow augmenting 
path s-c-d-a-b-t is indicated as dashed lines. The final 
flow assignment is obtained after advancing a unit of 
flow through the flow augmenting path and is indicated 
as dotted lines.) 


Figure 4. Resource re-allocation corresponding to flow augmenta- 
tion in Figure 3. (Initially, one resource is allocated— 
dark lines. The flow augmenting path is indicated as 
dotted lines. Two resources are allocated after re- 
allocation—dotted lines.) 


*##* The switch-boxes are combined with the processors or resources in the flow 


network in Figure 3, but will not affect the discussion. 
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assignment as shown in the figure is obtained, and the request- 
resource mapping {(p,.r3), (p3.r5). (ps. rg). (pz. 11). (pg. r7)} is 
derived. Note that the optimal mapping may not be unique, but 
suboptimal mappings are eliminated. 

Finding a flow augmenting path from the source to the sink 
in a flow network is the central idea in most maximum-flow 
algorithms. The improvement lies in the efficient search of the 
flow augmenting paths [5, 4]. For example, in Dinic’s algorithm, 
the shortest augmenting path is always advanced first with the 
aid of an auxiliary layered network, and hence the computa- 
tional complexity is bounded by O(1E1!*) for general networks. 
In our case, the links have unit capacity, and the time complex- 
ity is reduced to OC! V12*1EI) [11]. 


3.3. Homogeneous MRSIN with Request Priority and 

Resource Preference 

In a homogeneous MRSIN with request priority and 
resource preference, each request is associated with a priority 
level, and each resource is assigned a preference value. Many 
application-dependent attributes, such as workload, execution 
speed, utilization and capability, can be encoded into request 
priorities and resource preferences. The objective of resource 
scheduling here is to maximize the number of resources allo- 
cated, while allowing requests of higher priority to be allocated 
and resources of higher preference to be chosen. However, it is 
not necessary for requests and resources to be allocated in order 
of their priorities and preferences. The allocation of a resource 
to a request may be blocked by requests of higher priority, and 
the resource may be allocated to a request of lower priority. A 
similar argument applies to resources. 

With respect to a flow network, the request priority can be 
considered as the cost of carrying a flow through the path associ- 
ated with this request. The resource preference can be con- 
sidered similarly. As a result, the request-resource mapping 
problem in this class of MRSIN can be transformed into finding a 
flow assignment to minimize the total cost of flows in the net- 
work. 

Consider a flow network G(V, E, s, t,c, w) in which w(e), 
the cost per unit flow. is associated with arc e€E. In the 
minimum-cost flow problem, a legal s-t flow assignment is 
sought that allows a given amount of flow F to be circulated 
from source to sink with the minimum cost. The objective is to 
determine the set of least expensive s-t paths through which the 
fixed flow F can be advanced. The constraints in this problem 
are the same as those in the maximum-flow problem. The prob- 
lem may be defined in a linear programming formulation. 


Minimum-Cost Flow Problem 
Minimize }\ w(e) f(e) 
eCE 
subject to: 


(1) Flow conservation (Eq. (1)); 
(2) Capacity limitation (Eq. (2)). 


In allocating resources, the objective is to find a correspond- 
ing flow network whose optimal flow leads to an optimal 
request-resource mapping. The main idea behind the transfor- 
mation is to embed priority and preference information into the 
objective function by proper cost assignments on links. The 
amount of flow to be circulated can be considered as the number 
of requests pending for allocation. However, this amount may 
exceed the capacity of the flow network or the number of avail- 
able resources, and additional paths have to be introduced to 
prevent overflow. A possible transformation is given as follows. 


Transformation 2: Generate a flow network G(V,E,s, t.c, w) 
from a homogeneous MRSIN with request priorities and resource 
preferences. 
(T1) Create node-sets P, X and R for processors, switch-boxes 
and resources, respectively, and introduce special nodes: 
_ source, s, sink, t, and a bypass node, u. Let 


v'={s.t.u} UPUXUR 


(T2) Create arc-sets S, T and B as in Step (T2) of Transforma- 
tion 1. Add an arc from the node associated with a proces- 
sor to the bypass node, and connect the bypass node to the 
sink. This set of arcs is denoted as L. 


L={(v,u) | ve P}U {uD} 


Define E' =S UTUBUL 
(T3) Define capacity function c as in Step (T3) of Transforma- 
tion 1. In addition, define 


1 ex(u,t) 
cfe) ~ latu) = e=(ut) 
(T4) Define cost function w that represents the cost of advancing 
one unit of flow through a link as follows. 


0 for e€B 
max(ymaxt1, Gmaxt1) for e€L 

w(e) = Y max—Yp for e€S, p€P 
Qmax—Gw for e€T, wER 


@ Byogorks 
cael Sees 
oe Bogs 6 
oe eV EtOn 


Stage 0 


OO Oe 


Processors Stage l Stage 2 Resources 


Figure 5a. A MRSIN with request priority and resource prefer- 
ence (highest priority is 10; highest preference is 10; 
dark paths in the network are already occupied: pro- 
CeSSOFS p3,. Ps and ps are making requests; resources rj, 
13, fs, T7 and rg are available). 


Figure 5b. The flow network transformed from the MRSIN in 
Figure 5a using Transformation 2 (non-zero flows 
assigned by the out-of-kilter algorithm are shown as 
dashed lines in the figure; return flow from t to s is 
added by the out-of-kilter algorithm: all arcs have 
unit capacity; cost of arc is zero except where indi- 
cated). 


where Ymax is the highest priority level, Yp is the priority of 
request from processor p, qmax is the highest preference 
level, and q,, is the preference of resource w. Note that 
any cost function that is inversely related to priorities and 
preferences canbe used... 

(TS) Create arc-set E and node-set V as in Step (T4) of 
Transformation 1. 

(T6) Set the total flow F to the number of requests. 


As an example, in the MRSIN in Figure 5a, each request is 
attributed a priority level, and an available resource is given a 
preference value. The preference and priority levels range from 
1 to 10. A minimum-cost flow network obtained from Transfor- 
mation 2 is shown in Figure 5b. The following theorem shows 
the correctness of Transformation 2. 


Theorem 3: The optimal request-resource mapping on a homo- 
geneous MRSIN with request priority and resource preference 
can be derived from the minimum-cost integral flow of the flow 
network, which is obtained by Transformation 2. 


Edmonds and Karp have developed a scaled out-of-kilter 
algorithm to obtain the minimum-cost flow of a general flow 
network in polynominal time [5]. For a flow network of 0-1 
capacity, the time complexity is bounded by O(| Vi*lE!/?). 
Furthermore, in the minimum-cost flow assignment obtained, the 
flow assigned to a link is integral if the links have integral capa- 
cities. As an example, applying the minimum-cost flow algo- 
rithm on the flow network in Figure 5b results in the request- 
resource mapping {(p3. rs), (ps. 1). (pg. r7)}.. The selected paths 
are shown as dashed lines in Figure 5b. Note that the 
minimum-cost flow obtained may not be unique, and alternate 
minimum-cost flows are possible. However, alternative map- 
pings will not improve the cost of allocation. 


3.4. Optimal Resource Scheduling in Heterogeneous MRSIN 

A heterogeneous MRSIN consists of multiple types of 
resources, and a processor may generate a request of a given type 
of resource. Such a MRSIN is equivalent to a flow network car- 
rying different types of commodities. A multicommodity flow 
network has multiple source-sink pairs, each of which is associ- 
ated with one type of commodity. A flow coming out of a 
source of a given commodity can only be absorbed by the sink of 
the same type of commodity. Flows of different commodities 
may share a link as long as the total flow does not exceed the 
capacity of the link. 

For a flow network with k types of commodities, there are 
k source-sink pairs, (s/,t'), for i=1 to k. Let F' be the total flow 
of the ith commodity and f'(e) be the flow of the ith commo- 
dity on edge e. The search for the maximum flow can be formu- 
lated as follows [1]. 


Multicommodity Maximum- Flow Problem 


Maximize }F! 
i=1 
subject to: 
(1) Flow conservation: For i= 1, ...k 
—Fi vy=gi 
> flo—- > fle=) F ve=ti 


e€ a(v) e Biv). : ry) 
(2) Capacity limitation: 


0< J file) <cle) for all e€E 


i=1 


otherwise 


A multicommodity flow network may be visualized as the 
superposition of k single-commodity flow networks. Each layer 
in the superposition represents a single-commodity flow. To 
obtain the optimal request-resource mapping in a heterogeneous 
MRSIN without priority and preference, a transformation simi- 
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lar to Transformation 1 can be applied to obtain a layer for each 
type of resources, and the layers are superposed to form a mul- 
ticommodity flow network. 

The optimal mapping for a heterogeneous MRSIN with 
request priorities and resource preferences can be obtained by 
transforming the problem into the multicommodity minimum- 
cost flow problem. Let w‘(e) be the cost per unit flow for the ith 
commodity on edge e. The problem can be formulated as fol- 
lows. 


Multicommodity Minimum-Cost Flow Problem 
Minimize }\ }\ w'(e) file) 


i=1 e€E 
subject to: 
(1) Flow conservation: For i= 1....,k 
—Fi yv=si 
> filo- + fle=) F vet 
e€ a(v) e€ B(v) O otherwise 


(2) Capacity limitation: 


k 
0< > file) < cle) foralle€E 
i=1 
The equivalent flow network consists of k source-sink pairs and 
k bypass nodes, where k is the number of types of requested 
resources. As for requests without priority, the flow network 
can be regarded as the superposition of k single-commodity flow 
networks, and Transformation 2 can be applied to each of them. 
The problem of finding the maximum integral flow in a 
multicommodity flow network of general topology has been 
shown to be NP-hard. Fortunately, circuit switched loop-free 
interconnection networks have transformations that belong to a 


restricted class of multicommodity flow networks in which the 


optimal flow values are always integral [6]. For this class of 
flow networks, the integral multicommodity optimal flows can 
be obtained efficiently by the Simplex Method, which has been 
shown empirically to be a linear-time algorithm [18]. 


4. ARCHITECTURE OF MRSIN TO SUPPORT OPTIMAL 
SCHEDULING 

Two architectures to carry out the optimal resource 
scheduling algorithms have been studied. In the first approach, a 
dedicated monitor is responsible for resource scheduling (see Fig- 
ure 6). It maintains the status of the interconnection network 
and resources. The monitor enters a scheduling cycle when there 
are pending requests. Requests received during a scheduling 
cycle will not be processed until the next cycle. In a scheduling 


Requests Resource Status 


Switching 
controls 


Processors Resources 


Resource sharing 
interconnection network 


Figure 6. A monitor architecture 


to carry out optimal resource 
scheduling ina RSIN. 
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cycle, a flow network is generated, and the optimal request- 
resource mapping is derived. Then the monitor sends an 
acknowledgement to each requesting processor that has been 
allocated a resource, notifies resources that are allocated, and 
establishes paths in the network. The implementation is sequen- 
tial, and the overhead is measured by the number of instructions 
executed in the algorithm. 

A distributed architecture, on the other hand, distributes 
the scheduling intelligence in the switch-boxes of the intercon- 
nection network. Optimal scheduling is achieved through 
cooperations among processes in the switch-boxes. No transfor- 
mation to a network-flow problem is necessary because the 
network-flow algorithm is carried out in a distributed fashion in 
the switch-boxes. The complexity of the process in each 
Switch-box is central to the design of the distributed architec- 
ture. Our previous study shows that the maximum-fiow algo- 
rithm for homogeneous MRSIN without priority and preference 
can be efficiently implemented in a distributed fashion [14]. For 
systems with heterogeneous resources or with priorities and 
preferences, there is no improvement over a monitor except for 
reasons such as fault tolerance and modularity. 

In the following sections, we describe a distributed realiza- 
tion of Dinic’s maximum-flow algorithm to obtain the optimal 
request-resource mapping. 


4.1. Dinic’s Maximum-Flow Algorithm 

Dinic’s algorithm is based on the flow augmentation method 
described in Section 3.2. It improves over Ford and Fulkerson’s 
algorithm by advancing flow through the shortest augmenting 
path, which can be found from a layered network derived from 
the original flow graph. A procedure summarizing Dinic’s algo- 
rithm is as follows. 


procedure dinic (V, E): 

/* The algorithm has two alternating phases. The algorithm 
alternates between these two phases until no more flow can 
be augmented. */ 

(1) Initialization: Flow network with an initial flow assignment. 

(2) Layered Network Construction Phase: 

(2a) Include s in the first layer: 

(2b) Construct the next layer: 

(2c) If layer is empty, then go to Step (4); 

(2d) If t is not in this layer, then go to Step (2b); 

(3) Maximal-Flow Search Phase: 

/* Determine an increment to the flow assignment by finding 
the maximal flow in the layered network. */ 
(3a) Search for s-t paths: 
(3b) If an s-t path does not exist; then go to Step (2a); 
(3c) Advance flow through this path: Go to Step (3a); 
(4) Stop: Maximum flow assignment has been obtained. 


In the layered network, nodes of the original flow network 
are organized into stages. The first stage consists of the source 
node(s) of the network, and the remaining stages are constructed 
iteratively. A stage consists of nodes that are not included in the 
previous stages and have either an unsaturated arc or an arc with 
non-zero flow originating from nodes in the previous stage. 
These two types of arcs, called useful links, are transformed to 
arcs in the layered network. Depending on the direction of the 
associated useful link, its capacity in the layered network can be 
either the remaining capacity or its current flow. As a result, 
nodes in a layered network are arranged into disjoint subsets, 
Vo, .... Ve, such that no arc points from V; to V; (i < j). 

A legal flow in a layered network is said to be maximal if 
every (s,t)-directed path in the layered network is saturated. 
Note that a maximal flow in a layered network is not necessarily 
the maximum flow because flows in opposite directions in the 
Same arc are not considered. Moreover, computing the maximal 
flow is easier than computing the maximum flow. In Dinic’s 
algorithm, the maximal flow is obtained by a depth-first search. 

Since the amount of flow that can be advanced through an 
arc in the layered network is the net increase of flow to the asso- 


ciated arc in the original network, the maximal flow obtained in 


the layered network is a net increment to the existing flow. pe 
Further, the maximum flow of a flow network is finite. Hence, iT <4=—e 
the maximum flow can be obtained in a finite number of itera- . ana | A = — > ~ 
tions in constructing the layered network. eo = M Ae @ 

An example illustrating the construction of a layered net- _\ [| } 
work is shown in Figure 7. Figure 7a is a flow network associ- 
ated with a MRSIN in which three processors, p,. pz and p,, are | C) 
making requests and three resources, r;, r3 and r4, are available. v 4 a | a: TE — 
The flow assignment shown by darkened arcs in Figure 7a results| I—-Y YO 
in a mapping such that p,; is mapped to r4 and p, is mapped to ry. 7 al. 
The request generated by p2 is blocked. Figure 7b is a layered 
network constructed from the flow network in Figure 7a. The OT ‘ [ « mas ems ae @ 
layered network shows that there is a flow augmenting path @ pa S| \_} NS Ej = —< | ~y 
from pz to r3. This path includes the arc leading from node 6 to —_ 

y 


node 5, which is associated with the arc leading from node 5 to ? \ 
node 6 in the original network (see Figure 7a). This flow aug- ‘ 


menting path shows that all three resources can be allocated if p, @ as Pas =a . oe on oe ® 
is re-allocated to r3 and pz is re-allocated to rj. om) = = - -—<}+—-O 
4.2. A Distributed Architecture for Homogeneous MRSIN oe Status Hus 

without priority as 

A distributed MRSIN embedded in an 8-by-8 Omega net- ; ) 7 ? 
work is shown in Figure 8. The scheduling intelligence is distri- ' . 
buted in the switch-boxes (NS). A processor is connected to the Processors RQ MRSIN RS Resources 


network through a pasos oa (RQ), and a resource is moni- 
tored by a resource server (RS). A common status bus connects ree ; Spat 
these components together. Autonomous process in each com- Figure 8. A distributed MRSIN embedded in an 8-by-8 Omega 
ponent communicates with other processes by passing tokens via network. 
direct links, and are synchronized by exchanging status via the 
Status bus. 

A scheduling cycle begins when there are pending requests 
and ready resources. A request generated in the middle of a 
scheduling cycle has to wait until the next cycle. The procedure 
indicating the flow of phase transitions in a scheduling cycle is 
shown as follows. 


procedure mrsin (P, X, R); 

/* During a scheduling cycle, the network alternates between 
two phases until all request tokens are blocked. */ 

(1) Initialization: MRSIN with pending requests and ready 
resources; 

(2) Request-Propagation Phase: 
/* In this phase, each requesting RQ sends a request token to 
the network, which will eventually arrive at the resources if 
there are free resources to allocate. */ 
(2a) Requesting RQs send tokens; 
(2b) Propagate tokens to next NSs; 
(2c) If all request tokens are blocked, then go to Step (4); 
(2d) If request tokens are not received by the RSs associated 

with free resources, then go to Step (2b); 

(3) Resource-Acknowledgment Phase: 
/* The RSs associated with free resources and receiving 
request tokens will send resource tokens to the network to 
acknowledge the acceptance of requests. When a resource 
token is received by a requesting RQ, the RQ and RS will 


(a) A flow network (transformed from a 4-by-4 MRSIN) in form a matched pair, and the path connecting them is 
which flow is advanced through paths s-p,-4-7-r,-t and s- registered. The network returns to the request-propagation 
pq-5-6-r-t. . phase when all resource tokens are blocked. */ 


(3a) Propagate resource tokens to RQs; 

(3b) If all resource tokens are blocked, then go to Step (2a); 

(3c) Register path; Go to Step (3a); | 
(4) Stop: Optimal request-resource mapping has been obtained. 


Vo Vv 1 Ve Vs V, V; Vs V, 


Note that the flow of this procedure is exactly the same as the 
control flow of Dinic’s algorithm. 

To carry out Dinic’s algorithm by token propagations, a 
switching element has to follow several token-propagation rules. 
If a link is free, a switch-box will deliver any request token 
received from the processor side to the resource side. These 
directions are reversed if the link is already registered. If there 
are more than one path to send the request token, the token is 
: duplicated and sent along all paths. A resource token is expected 
(b) The layered network derived from the flow network in (a). from where a request token was delivered. However, no dupli- 

The (s,t)-path s-p2-4-6-5-7-r,-t is equivalent to an aug- cation of resource token is done, and one path is chosen at ran- 
menting path in the original flow network. dom. It can be verified that these token-propagation rules 
correctly implement the layered-network construction process 

~ and the search for the maximal flow. 


oe = ew pee = 
es @ @ aw ow ew = oe 


Figure 7. An illustration of a layered-network construction (all 
arcs have unit capacity). 
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To avoid chaos in token propagations, processes residing in 
different switch-boxes have to be synchronized on phase transi- 
tions. This synchronization can be achieved by broadcasting the 
status of each process on the status bus. The status of the net- 
work can be one of four independent possibilities: request- 
pending, resource-ready, request-token-propagation, and 
resource-token-propagation, and can be represented as a Boolean 
vector (RP, SR, RTP, STP). Each process will synchronize the 
broadcast of its status on the bus with other processes. The 
result is a wired-OR of the statuses broadcast by all processes. 
The occurrence of events that cause a phase transition can be 
detected by observing the status bus. For example, the transi- 
tion from a _ request-propagation phase to a _ resource- 
acknowledgement phase is known to every process when each 
observes a change of the status vector from (1,1,1,0) to either 
(1,1,1,1) or (1,1,0,1). 

We have shown that every process involved can be carried 
out by a simple sequential machine and is realizable in logic cir- 
cuits [14]. With this design, there are two factors that contri- 
bute to a significant speedup as compared to a monitor architec- 
ture: (a) the augmenting paths are searched in parallel, and (b) 
the time complexity is measured in gate delays instead of 
instruction execution cycles. As a result, the scheduling algo- 
rithm will run at least 100 times faster than a software imple- 
mentation of the network flow algorithm. 


5. CONCLUSIONS 

A RSIN is suitable to support resource sharing in multipro- 
cessors. Optimal request-resource mapping in RSIN is obtained 
by maximizing the number of communication paths that inter- 
connect pairs of processors and resources. In this paper, we have 
transformed various request-resource mapping problems into 
network flow problems for which efficient algorithms exist. 
Table 1 is a summary of the results we have obtained. The pro- 
posed method is independent of the interconnection structure 
and is applicable to all configurations in which the requesting 
processors and free resources are partitioned into two disjoint 
subsets. In particular, the method is applicable to networks with 
multiple paths between source-destination pairs, such as the data 
manipulator [7], the augmented data manipulator [19], and the 
Gamma network [20]. The resource utilization, however, will 
depend on the network configuration, the resources available, the 
permutation of the various types of resources, and the permuta- 
tion of the requesting processors. 


Discipline 


Equivalent 
Optimal |Flow Problem 
Scheduling| Algorithm 


. Max-Flow 


Ford-Fulkerson, Linear 
Dinic oc Sa ee 
Distributed Algorithm | Yes | CNA 


Distributed 
Architecture 


Synchronized by 


Implementation Broadcasting 


Table 1.Summary of optimal resource scheduling schemes for 
resource sharing interconnection networks. 
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ABSTRACT 


We consider linear, data-driven, arrays in which each cell 
is augmented with the capability of recognizing and skipping 
Operations that involve zero operands. This type of networks is 
shown to be efficient for highly sparse matrices, despite the 
potential data conflict that may result due to irregular zero dis- 
tributions. Two different approaches, which take advantage of 
the sparsity of the matrix, are considered; The first is aimed at 
the reduction of the execution time, and the second is aimed at 
the reduction of the number of cells in the network. 


1. INTRODUCTION 


Many techniques have been suggested for the efficient 
solution of sparse linear systems through clever, but highly 
irregular, storage and manipulation of the non-zero coefficients 
in the system [1]. These techniques, however, are not suitable 
for parallel processing, which requires, in general, a rather reg- 
ular pattern of computation. In this paper, we consider a fun- 
damental operation in iterative solution schemes for linear sys- 
tems. Namely, the multiplication of a matrix by a vector. 


The computation involved in the matrix/vector multipli- 
cation is quite regular, and thus appropriate for regular VLSI 
networks. That is systolic [2] and data-driven [3] arrays. 
Obviously, the benefits of this type of networks may become 
visible only when the size of the matrix is large. However, 
large systems that appear in practice are usually sparse, and 
hence seem inefficient for solution on regular VLSI networks 
due to the potential loss of resources in operations which 
involve zero operands. 


In order to avoid the waste of resources caused by trivial 
Operations, and. in the same time, retain the advantages of fast 
specialized cells and efficient local communications, we suggest 
to add to each cell in the network the capability of recognizing 
and skipping trivial operations. With data-driven synchroniza- 
tion, this results in a shorter average cycle for each cell, which 
may reduce the execution time of the entire network. 


Two different approaches for the multiplication of sparse 
matrices by vectors are introduced. In the first approach, the 
same number of cells which would be needed if the matrix 
were dense is used, but a considerable speed-up is obtained by 
skipping trivial operations. In the second approach, the non- 
zero elements of the matrix are grouped in few stripes which 
are almost parallel to the diagonal of the matrix, and a specific 
cell is assigned to perform the operations associated with the 
elements of a particular stripe. This approach is aimed at the 
reduction of the number of cells without slowing down the 
speed of the computation. 


2. COMPUTATIONAL CELLS WITH 
DATA DEPENDENT OPERATIONS 


Consider a data driven version of the systolic network 
given in [2] for the multiplication of ann Xn banded matrix A 
by a vector x. For simplicity, we assume that the number of 
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upper diagonals of A is equal to the number of lower diago- 
nals, namely B,, and hence, the band-width of A is 
B=2B,+1. The network, called from now on MV, is shown 
in Fig. 1. 


Fig. 1 - The network MV, with B, =2. 


Every cell in MV, has three input ports and two output 
ports, namely J;, 2, J3,O, and Op, respectively. Its operation 
may be described by the following cycle which is repeated 
indefinitely (Here, [J] denotes the content of port J, and O «a 
means that @ is written on port O ). 


CYCLE J: 


(1) Wait until data is available on J;, J, and J3; 
a=([J,] ; B =[J,] >y =] 


(2) p=atBp* y 
(3) Wait until previous data on O,; and O; is consumed ; 
O2z-B ; OO, p—p. 


Each internal communication link directed from cell g to 
cell k may be regarded as a queue. Only cell g may write on 
this queue and only cell k may read (and delete) its front ele- 
ment. The maximum capacity of this queue will be denoted by 
d. In addition to internal links, the network has external links 
for communication with a host system. More specifically, the 
host supplies the elements of the k'? diagonal of A, 
—B, <k <B,,, on port J3 of cell k, the elements of the vector x 
on port J, of cell B,, and the elements of the result vector y 
(initialized to zero) on port J; of cell —B,. With this it is easy 
to see that the elements of the product vector y=Ax are pro- 
duced on port O, of cell B, . 


Now assuming that 6% of the elements in the band of A 
are zeroes, then it is clear that €% of the resources in MV, are 
wasted in the execution of trivial operations in step 2 of 
CYCLE 1. In order to reduce this waste, we may attempt to 
skip the multiply/add operatior. whenever [7,)=0. More 
specifically, consider a network MV 2, which is identical to 
MV , except that each cell executes the following cycle instead 
of CYCLE 1. 


CYCLE 2: 
(1) Asin step 1 of CYCLE 1 


(2) If (y =O) Then2.1)p=a 
Else2.2)p=a+B* y 
(3) Asinstep 3 of CYCLE 1 
An execution of CYCLE 2 which goes through step 2.1 is 
called a trivial execution of the cycle, otherwise the execution 


is called non-trivial. Trivial executions of CYCLE 2 may, or 
may not, be shorter than non-trivial executions, depending on 


the waiting time spent in steps 1 and 3. Hence, the total execu- 
tion time 74 of MV, depends primarily on the effect of data 
conflict on the execution of individual cells. 


One way of obtaining an upper bound on the execution 
time of data driven networks of the type discussed above is to 
force some hypothetical synchronization on the computation, 
such that execution alternates between two phases. Namely a 
communication phase and a processing phase. We call the 
resulting computation a pseudo systolic computation and we 
call a communication phase followed by a processing phase a 
global cycle. Given that the additional synchronization may 
only slow down execution, then it is clear that the execution 
time of a data driven computation is bounded by the execution 
time of the corresponding pseudo systolic computation. 


For example. a pseudo systolic version of MV;,, called 
MV ,,. may be obtained by replacing step 2 in CYCLE 1 with 


(2) Wait for a synchronization signal SYNC 
p=atB* y 


Where we assume that all the cells in MV,, are connected to a 
hypothetical controller that issues the signal SYNC after it 
detects the termination of a communication phase. That is 
when all the cells are waiting in step 2. With this it is easy to 
see that the execution time 7, of MV, is bounded by the execu- 
tion time 7';,, of its pseudo systolic version, namely 


T, 87, = 2 (B, +n) (1, +T,,) 


where 7,, is the time for a floating point multiply/add opera- 
tion (step 2) and 7, is the time for the transmission of one data 
item between two cells. 


Tracing the execution of MV. is more complicated than 
MV’, because of possible trivial executions of CYCLE 2. A 
pseudo systolic version of MV>, namely MV>,, may be 
obtained if CYCLE 2 is replaced by the following cycle: 
CYCLE 2P: /* Pseudo Systolic Version of CYCLE 2 */ 


(1) Wait until data is available on J,, 75 and J; ; 
ao=7,]) :;B=2]: y=Us] 


(2) If (y = 0) then 
2.1) wait until previous data on O,; and O2 are consumed 
2.2) O;-@ ; O.-B 
2.3) Go To step 1 

(3) Wait for a synchronization signal SYNC 

(4) p=a+B* y 

(5) Wait until previous data on O, and O> are consumed ; 
O;-p: O28. 


In other words, trivial operations are first skipped until a 
non-trivial operand is found in [J3], then the multiplication is 
performed. As in MV,, , we assume that all the cells are con- 
nected to a controller that issues the signal SYNC in such a way 
that execution alternates between communication and process- 
ing phases. During the communication phase, the data is 
moving in the network until each cell is either blocked due to 
lack of data (step 1, 2.1 or 5), or is blocked in step 3 (with 
[73] #0). When this state is reached, the controller issues 
SYNC and all the cells which are blocked at step 3 executes the 
multiplication (step 4), simultaneously, while the other cells 
remain idle until the end of the global cycle. 


In order to isolate the effect of internal data conflict from 
any delay caused by slow communication with the host, we 
assume that external inputs on J/3 of cells B_,.....B, as well as 
on J, of cell —B, and J, of cell B, are available when needed. 
We also assume that MV,,, terminates execution, for a specific 
input matrix A, in N2 global cycles. With this, we may define 
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the function a:[—B, .B, ]x[1,.N,]-7A such that a(k t) is the 
element of A that appears on port /3 of cell k at the beginning 
of the processing phase of the t'” global cycle. 


Although a data item is always available on /3 of any cell 
during a specific processing phase, only those cells which 
receive the corresponding elements of the vectors y and x on 
I, and J>, respectively, perform a multiply/add operation, 
while the other cells remain idle. Let M, be the subset of cells 
which are not idle during the rt’ processing phase and define 
the t‘* computation front as the set of elements of A which 
are operated upon during this phase. More precisely, 


CF, ={olk.t)\k €M,} 


Note that the members of CF,, for any ¢, are non zero elements 
of A. 


The succession of computation fronts represents the pro- 
gress in the execution of the pseudo systolic computation. 
More specifically, given a certain matrix, we may connect the 
elements of each front by a piece-wise linear curve and thus 
obtain a visual picture that describes the propagation of the 
computation. For example. we show in Fig. 2 the computation 
fronts corresponding to the specific given matrix pattern. Note 
that the concept of computation fronts is the same as that sug- 
gested in [6]. However, by allowing irregular fronts, we are 
able to model computations that depend on the value of the 
input as well as its availability. 


In [5], we give a method for the automatic construction of 
computation fronts for pseudo systolic computations. Clearly, 
the number of such fronts is equal to the number of global 
cycles needed to complete the computation. 
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Fig. 2 - The Computation Fronts for MV > p 


3. A NETWORK FOR MATRICES WITH 
NON-ZERO DIAGONAL ELEMENTS 


Most of the matrices resulting in practical applications 
have non-zero diagonal elements, and it may be shown [5] that 
MV > is not suitable for this type of matrices because it does 
not allow computation fronts to be parallel to the diagonal of 
the matrix. In this section, we introduce a network, MV3, 
which allows the fronts to be parallel to the diagonal. 


For simplicity, we introduce MV, for dense, banded, 
matrices. It is composed of B cells (see Fig 3). Each cell has 
two input ports and two output ports, and is equipped with a 
counter "Ct" and an accumulator "Acc". The cycle of a cell may 
be described as follows: 


1 0 OO ayy 42 @33 G64 465 B66 67 G68 Gn9o An» 
§ 
2) O a2; @22 423 424 975 476 477 478 79 ANWan} 
8 
(32 032 433 434 435 Age 4g7 Aggy Ggq 4g 4131) Ox 
*1 
(4 eta 243 244 445 G46 9597 G98 G9 Go Gon A429 
x2 
(582 G54 455 856 457 4198 2109 210,10 910.11 210,12 215,13 415,14 
0; 
x3 
x4 
x5 Or 
I, 


Fig. 3- MV, after two cycles (B = 5) 


CYCLE 3: /* Initially, Ct = B, + 1. This allows data 
to fill-in during the first B, cycles */ 


(1) Wait until data is available on J, and J,; 
a=[J,]:B=[J/2] 

(2) p=[Acc]+a* B 

(3) Wait until previous data on O, and O2 are consumed 

(4) If Ct=B THEN Oia ;O2-p ;AccO ;Ct=1 


ELSE O,*-a@ ;Accep ;Ct=Ct +1 


The elements of the vector x are applied to port J, of cell 
B, and the elements in the rows k, k+B,k+2B.,..., of the 
matrix A are concatenated and applied to port /2 of cell k (see 
Fig 3). Given this input, and assuming pseudo systolic syn- 
chronization, it may be easily seen that the elements yj.....¥g3 of 
the result vector are produced on ports O2 of cells 1,....B, 
respectively, after B,+B global cycles. The elements 
Yao, 23 are produced on the same ports at the end of cycle 
B,+2B, and in general, y,34).--.¥;p+3> 7 =0,1,.. are produced 
at the end of cycle B, +(r +1)B. In other words, 


Pigg = (By BB) Ug Fe Fe) 


where, B=((n —1)+B )+1. Similar to the case of MV, the exe- 
cution time of MV 3 may be reduced for sparse matrices, if step 
2 in CYCLE 3 is replaced by a conditional statement that skips 
trivial operations. We call the resulting network MV,. The 
construction of the computation fronts for the pseudo systolic 
version of MV, provices a means for the estimation of its 
speed-up over MV 3, obtained by skipping trivial operations. 


4. A NETWORK WITH REDUCED NUMBER OF CELLS 


Numerical Techniques for the solution of partial 
differential equations are major sources of large sparse 
matrices. The matrices resulting from these techniques have a 
band with B = O(WVn ), and hence the number of cells needed 
in MV, and MV, grows with the size of the matrix. 


In order to obtain a network with a number of cells 
independent of n, we use the property in these matrices that 
the number of elements in each row is bounded by a constant 
which is independent of n. This property permits the inclusion 
of all the non-zero elements of the matrix into a Stripe struc- 
ture that contains few stripes which are, almost, parallel to the 
diagonal. A specific cell is then assigned to perform the opera- 
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tions associated with each stripe. 
structure of a matrix. 


First, we define a stripe 


Definition : A Q-stripe S of ann Xn matrix A isa set of posi- 
tions S = {(i.o(i)) | i =1,...1} where o is a non-decreasing 
function. The prefix Q (stands for "Quasi") is used to distin- 
guish Q-stripes from a more restrictive definition of stripes that 
is given in [4]. However, only Q -stripes are referred to in this 
paper, and hence the prefix Q will be omitted. 


Definition: A stripe structure of A is a sequence of stripes 
S,={G ,0,@))}, ..., S,={@.o0,( ))} such that 


Oo; G)< Oo, 43% ) i=l, k 
and S,U..U S,, contains all the positions of nonzero elements in 
A. That is a; , #0 only if @.j)€S,U.U S,. For example, we 
show in Fig. 4 the stripe structure of a specific matrix. Note 
that some zero elements are included in each stripe because, by 
definition, a stripe should contain a position for each row. 
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Fig 4 - A stripe structure 


Given a stripe structure of a matrix A, we may use the 
network MV. shown in Fig. 5 for the multiplication of A by a 
vector. The elements of the vector x are fed to port J, of cell 7 
and the elements of the result vector y, initialized to zero, are 
fed to port /> of cell 1. Successive elements in a particular 
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Fig 5 - The network MV, (7=4). 
stripe S, are fed to port /3 of cell k, and along with each ele- 
ment 4; g,(;) € S; supplied on /3, the value of o, (i ) is supplied 
on port J 4. 
In order to provide the flexibility needed to deal with 
sparse structures, we assume that each cell contains a counter 
CX that keeps track of the index of the data received on /9. 


Noting that data on J; and J, are always available, the opera- 
tion of each cell may be described by 


CYCLE 5: /* Initially CX =0 */ 


(1) Wait until data is available on /2; 
a = [13]; j =(J,]: n= [J7,] 


(2) While (CX < j) Do 
Wait until data is available on J, : 
&=[7,):CX =CX+1:0,¢€& 
(3) n=enta* € 


- More descriptively. after cell k receives a; g,(;), it contin- 


ues to transmit the components of x from J, to O, until it 
finds x,,(;)- At this time the inner product is computed (step 


3) and the result is written out. It is easy to see that the ele- 
ments of the result vector y =Ax are produced on O> of cell 7. 
However, the network ney operate correctly only if 
o, (i) < O,4,@) and o, (i) < o, (i +1), which are satisfied 
by the definition of a stripe structure. Note that we assumed, 
implicitly, that the queues on the communication links which 
carry the x-data stream in MV, do not overflow. In [4], the 
maximum size, d, of these queues is found to be equal to the 
maximum separation between the stipes of the matrix. 


A pseudo systolic version MV's» of MV. may be obtained 
by inserting a "wait for SYNC" in step 3. The conditions for 
two elements of A to be in the same computation front may be 
easily derived [4] and used for the systematic construction of 
the fronts. 

5. AN EXAMPLE 


In order to compare the different networks presented in 
this paper, we consider the matrix that is generated from the 
discretization of a second order partial differential equation on 
the grid shown in Fig 6. For this matrix, n=270 and B =39, 
but only 16% of the elements within the band are non-zeroes. 


The diagonal elements of the matrix are non-zeroes, and seven 


non-overlapping stripes may cover i ae non-zero elements. 
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Fig 6 - A finite element grid 


In Table 1, we compare the execution time of the different 
networks for this specific matrix. The number of global cycles 
is estimated by the construction of the computation fronts. 
The execution time of each global cycle, however, depends on 
the execution time of its communication and processing phases. 
The time of a processing phase is, roughly, 7,,. However, the 
time of a communication phase depends on the communication 
activities during that phase. Although we did not discuss a 
way for the estimation of the communication activities, we 
give in Table 1 an estimate for these activities obtained by an 
analysis of the data profiles which are associated with the com- 
putation fronts. The details of this type of analysis may be 
found in [4] and [5]. 


For MV ,, and queue size d=1, execution terminates in 60 
global cycles instead of 292 for MV;,,. That is, a speed up of 
4.867 is obtained (the increase in the: communication time is 
small). Noting that only 16% of the operations in MV,, 


100. non-zero operands, and thus that only one out of 
1 


"96: =6.36 operations is useful, it is clear that, by skipping 
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number of 
global 
cycles 


size of execution 
comn. 


queues 


number 
of 


cells time 


5597 +5597. 


2707 +3057. 


2927 +2927. 


607, +4257. 


S47 


2797 +3057. 


+4927. 


Table 1 - Performance of the different networks 


4.867 _ 
636 =76% of the 


inefficiency in MV,;,. Of course, internal data conflict accounts 
for the unretrieved 24%. 


trivial operations in MV ,, , we can retrieve 


The result for MV, is obtained assuming that each y- 
stream communication link may buffer only one element, and 
that each x-stream communication link may buffer at least 12 
data items. Clearly, the merit of MV. is not due to the speed 
up it achieves as much as it is for its small number of cells. 


6. CONCLUSION 


Pseudo systolic computations and irregular computation 
fronts are two tools that are used effectively in the analysis of 
the networks introduced in this paper. The alternative for this 
type of analysis is the simulation of the computations. Such 
simulation, however, requires explicit assumptions about the 
parameters 7,, and 7., which are strongly dependent on tech- 
nology and architectural details. 


Our primary interest is in sparse svstems that result from 
the solution of partial differential equations, which is one of 
the major sources of sparse matrices. For this reason, the net- 
works suggested in Sections 3 and 4 take advantage of some 
specific properties in these matrices. However, the concept that 
we introduce is quite general. Namely, the inclusion of the 
non-zero elements of a Sparse matrix in a pattern which is reg- 
ular enough to allow for the efficient manipulation of the 
matrix on VLSI networks. The results reported in Section 5, 
and other similar results [4,5], support our argument. 
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ON THE SYSTOLIC DETECTION OF SHORTEST ROUTES 


Uwe Schwiegelshohn and Lothar Thiele 
Technical University Munich 
Institute of Network and Circuit Theory 
D-8000 Munich 2, Arcisstr.21, GERMANY 


Abstract -- This paper presents a parallel algo- 
rithm for the solution of the single-pair, single- 
source and all-pairs shortest path problems. The 
first time, all essential properties of the shor- 
test paths can be computed on a pure systolic pa- 
rallel processor in O(n) time and on O(n?) area. 
If O(n*) area is spended, all single-pair shor- 
test paths can be stored within the same time 
bound. 


I. Introduction 


In this paper we are concerned with the well 
known shortest path problem. The single-pair, 
single-source and all-pairs problems entail fin- 
ding shortest paths between two specified verti- 
ces, from one vertex to all others and between 
all pairs of vertices, respectively. 

We concentrate on the systolic model of com- 
putation. In order to conform to the basic VLSI 
restrictions the parallel computer consists of 
Single processing elements which have some local 
memory and are able to compute a simple set of 
instructions. Global communication and a severe 
System overhead are avoided as no shared memory 
is allowed. 

In order to solve the all-pairs shortest path 
problem on a parallel computer the use of the Ford- 
Bellman-Moore, the matrix multiplication and the 
Floyd-Warshall algorithms have been proposed [1-6]. 
Lakhani [3] uses the Ford-Bellman-Moore algorithm 
to solve the all-pairs problem in O(n?) time. How- 
ever, the complex architecture requires a compli- 
cated communication scheme and area consuming 
interconnections between processing cells. In 
[1,2,4,5,6] it is shown that the distances of the 
all-pairs shortest paths can be computed in O(n) 
time and on O(n?) area if a parallel version of the 
Floyd-Warshall algorithm (FWA) is used. 

But, in almost any application of the shortest 
path (SP) problem the detection of the optimal 
route itself is indispensible. 

The new approach uses the same square arrange- 
ment of nxn identical processing cells as [1,2,4] 
or [5,6]. The following extensions are essential: 
- In comparison to [7] within the same kind of 

loops of the FWA the predecessor and depth of 

each vertex w.r.t any SP is determined. 

¢ In order to deal with the trees of SPs a con- 
version to an edge-based description is perfor- 
med. Sorting operations are used to compute the 
desired path functions. 

The first time, the following quantities can be 
computed on a systolic processor in O(n) time and 
on O(n?) area: distance matrix, selection of a shor- 
test route with minimal depth, predecessors and 
depth of any vertex w.r.t any SP, single-source 
trees of SPs as ordered lists of edges, adjacency 
matrix of a tree of SPs, single-pair SP as a list 
of edges, all-pairs SPs as n? lists of edges 
(if O(n?) area is spended). 
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II. The Extended Floyd-Warshall rithm 


Let us first clarify the notation used in the 
subsequent paper. We assume that each edge (i, j)cE 
has an associated integer valued length 1(i,j). 
A path from vertex r to t can be regarded as a tupel 


PXa(insigs+++oi,) of vertices i, with ke1, ip-r, 
ict and (i, ,,i,)eE for all 12j2k. The path pk 
has the depth k which means that it consists of 


‘k edges and k+1 vertices. The length 1 of a path 
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is the summation of the lengths of its edges. 
In order to clarify the extended FWA the following 
definitions can be used: 


K . , : : 
1 (r,t) = { P |k21,ig-r,ip=t, i js for all 1sj<k } 


set of paths at stage m from vertex r to t 

where all intermediate vertices are < m 

d (r,t) = Min{ 1(P“) | P*en(r,t) } 

= shortest distance at stage m from vertex r to t 
k K Ky_ 

A (r,t) = { Pp’ | P ell (r,t) and 1(P")=d_ (r,t) } 


set of shortest paths from vertex r to t at 
stage m 


s (rst) = Min{ k | P*eA,(r,t) ] 
minimum depth at stage m from vertex r to t 
ra(r.t) = { PX | PXea (r,t) and kes, (r,t) | 


= set of shortest paths at stage m from r to t 
with minimum depth 


aA 


"> 


Assuming that one path Pex (r,t) has been selected 
it is possible to define the following values: 
p,alrst) aA 14 


predecessor of t in the selected path from 
vertex r to t 


In extension of e.g.[7] the algorithm described 
in Fig.1 has the following properties: 
¢ Concurrent determination of the m-stage distance, 
predecessor and depth matrices 
« Concurrent selection of the all-pairs SPs at stage 
m with minimum number of edges. 
It will be seen in the next section that these 
improvements enable the computation of the shortest 
route between two specified vertices. Now, we can 
formulate the subsequent Lemma: 


Lemma: If no negative cycle exists then 

Dts J =2 (m,j) and Ebdom)=Z 4 (5m) for each j. 
Proof: We restrict the proof on aA CEN a one (m,j). 
Let us assume that PXed, (ity J)\Ep_4 (tts J) Theref ore 
pk could be partitioned into a cycle through m of 
depth b and p=(m,... »j)ell,_, (m,j) where the length 


of the cycle is 120 and b>O. It would follow that 
keatb>azs,_,(m,j) and 1(P“)=1+1(P*)a1(P*)2ad__, (m, 5) 


which contradicts the assumption. oO 


Consequently it is possible to dispense with the 
modification of the elements -(m,j) and -(j,m) at 
stage m of the algorithm of Fig.1. 

Theorem: At stage m the modified FWA detects a path 


pk from vertex r to t with PXex, (r,t). 


Proof: We assume by an inductive hypothesis that the 


theorem holds at stage m-1. The path pk is the 
resulting path from r to t of the extended FWA. We 
distinguish between two cases a) and b): 


a) Da-iret) 9 Eats t) 0 
(r,t)=-s (r,t) and d_-1(rstiqd, (r,t) 
> Pred 4 (pit) E(r st) 
b) Zy(rt) a f(r, t)=6 
> d-1(rst)od. (r,t) or 
a heii seas as s-1(rsti<s (r,t) ) 
+ P’ consists of paths P eZ 4 (rm) and 
PP ed (mst) selected of FWA at stage m-1 


We assume that there exists Pred (ijt) which 
B 


>s 


consists of Pen 4 (rom) and P el, 1 (mt) 
> d (r,t)=1(PX)=1(P™)+1(P8) 2 1(P*)+1(P?)=1(P*) 
and S (rst )=K=a+8 2 atb=k if 
a B 
P eA 1 6rom) and P cA 7 (mt) : 
Now it has been proved that the values qa? Pi and 


Ss, are evaluated correctly by the extended FWA. oa 

The extended FWA detects for the all-pairs SP 
a SP with minimum depth as well as the Ford-Bellman- 
Moore algorithm for the single-source problem does. 
In contrary, the common FWA can only assure that 


K 
P eA (r,t). 


III. Determination of a Single-Pair Shortest Path 

The detection of a SP from vertex r to t can be 
divided into two different steps: 
¢ Determination of a tree which contains the SP 
¢ Finding of the SP between the vertices r and t 

Obviously, the tree of SP with the root vertex 
r is an appropriate one for the first part. The 
tree is determined by its edges (p(r,j),j) evalu- 
ated by the FWA. These values can be stored in a 
linear array of length n to compute the desired 
path according to the second step. 

But problems arise as data sharing is restric- 
ted to neighboring processors only in a systolic 
array and the necessary data may be far apart. Up 
to n-1 sequential operations each of which requi- 
ring O(n) steps in the worst case are needed if no 
suitable order of edges exists. In contrary to [8] 
amuch simpler sorting scheme than the Euler circuit 
can be used to receive a time bound of O(n) as the 
depth s(r,j) of each node j is available from FWA. 
The following solution strategy is chosen: 
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- Sorting of the edges (p(r,j),j) according to the 
depths s(r,j). The following chain is obtained: 


(p(r,J5,)+5,) (p(rsJjo)+J55) eek (p(r,j,)»5,) 
with s(r,j,)ss(r,J,,,) for all 


¢ Seanning of the sorted chain in descending order 
of depths and selection of the SP from r to t 
according to the following procedure: 


procedure path 
for i=n down to 1 do begin 
if j,=t then begin 


Jj e 'path from r to t! 


tz=p(r,j,) 
end 


IV. Systolic Implementation 


At first, the solution to the first subproblem 
is described where Fig.2 shows the interconnection 
scheme of the array apart from the outputs T1 to T4 
and the inputs 6,In4. As this part is similar to [5] 
the internal cells are described only. The following 
notations are used: R(i,j)=[f,,f,,...] denotes a 
register within cell C(i,j) where the datas f,,... 
are stored in a common record. R(i,j).k denotes fe 


1-1 006 7n— 


The sign « denotes an assignment between cells which 
needs one time step. Corresponding data are trans- 
ferred. The sign := denotes an assignment and :=: an 
exchange of data within one cell and one instant. 


procedure C(1,m) { 2sl,msn ; R,,,=[dos,p] } 
R, « R,(1,m-1) 
R, « R,(1-1,m) 
R, ¢ R,(1+1,m+1) 
if R,.1>R,.1+R,.1 then R,:=[R,.1+R,.1 , Ro] 


The operation of this cell corresponds to the 
"if'-statement of the FWA. The executuion of the 
"and'- and ‘or'-operations can be avoided by the 
concatenation d(i,j)os(i,j) where the bits of 
s(i,j) are linked less significantly to those of 
d(i,j). The sequence of input and output operations 
of the systolic FW-array are similar to [5,6]. 


1. Solution of the single-source SP problem 


The first line of the array according to Fig.2 
contains in addition to the cells C(1,m) the sorting 
cells S(1,m). Moreover, all registers of the FW cells 
contain besides d(i,j), s(i,j), p(i,j) the corres- 
ponding node j with R=[dos,p,j]. The output T1 
successively carries the edges (p(r,j),j) of the 
tree of SPs rooted from vertex r in descending 
order of depths s(r,j). 


procedure S(1,1) — { In,R,=[¢] ; Rs=[s,p,j,6] |} 


R, «In ; R, + R,(2,2) 
if R,=1 then R,.4:-1 
else RR. :=[~,,~,0] 

procedure S(1,m) { 2smsn ; Rese7=Ls,p,j,o] } 

R, « R,(1,m-1) ; R, « R,(2,m+1 

if R,.4=1 then R,:=R, ; R,.4:=0 

if R,.1<R,.1 then R,:=:R, 

if R,.4=1 then R,.4:=1 ; R,.4:=0 
procedure S(1,n+1) { T1,R,=[s,p,j,¢] |} 

R, «¢ Rg(1,n) 3; £Tl:=R, 


Cell S(1,1) has to be initialized with 9=1 at 
the time where the elements -(r,r) are read in. 
This correct timing can be achieved also by detecting 
that R..3=r in cell S(1,1). This initialization 
propagates from m=1 to m=n+1. 


2. Solution of the single-pair SP problem 


In contrary to the foregoing array, cell 
S(1,n+1) marks all edges which belong to the SP 
from r to t. To this end, the vertex number t has 
to be read into this cell before the first edge 
of the single-source tree enters. The variable wp 
is p=1 if the edge is part of the SP from r tot. 


procedure S(1,n+1) { T1,Rs=[s,p,j,wv] ; In4,R,=[p] } 
R, « R,(1,n) 
if ‘input operation' 
if R,.3=R,.1 


then R,:=In4 

then R,:=[R,.3] ; R,.4:=[1] 
else R,.4:=[0] 

T1:=R. 


If the output Tl propagates to the left through 
the cells S(1,m) it is possible to delete the 
edges with yp=O such that the searched edges 
can be read out from S(1,1) concurrently. 


3. Solution of the all-pairs SP problem 


According to Fig.2 each row contains a copy 
of the linear sorting array just described. The 
initialization has to take place if -(1,1) enters 
S(1,1). After that, the initialization propagates 
down and right in form of a wave front. The outputs 
Tl concurrently carry the edges of the trees of SPs 
rooted from vertices l. 

If the outputs Tl are propagated to the left, 
it is possible to store in S(1,m) the SP from1 tom 
in form of a list of edges (O(n?) area is needed). 


V. Concluding Remarks 


It is noted that the above algorithms can be 
used to solve other combinatorial optimization 
problems which require the determination of a 
shortest paths or breadth-first spanning tree. 
The determination of the adjacency matrix of 
the single-source tree of SP is possible too in 
O(n) time if the cells in the first row and last 
column of the systolic array have additional 
functions. 

Recently, it has been shown by the authors [9] 
that algorithms of the form shown in Fig.1 can 
be carried out on a linear systolic array in O(n?) 
time with O(n) operating cells. The data at the 
input and output occur in lexicographical order 
and pipelined arithmetic units can be used. 
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procedure initialization 
for all pairs of vertices i,j do begin 
© sy) CL 1(i,j) if (i,j)eE 
dy (i. 5) a ‘e otherwise 
én sk of. 1 if: (i, j)eE 
So(i,J) = ft otherwise 
et _ i if (i,j)eE 
Po(i,J) = 4: otherwise 
end 


procedure modification 
for k:=1 to n do begin 
for all pairs of vertices i,j do begin 
if ( d)_,(i,§j)>d)_,(i,k)+d)_,(k, §) ) or 
ic ms (i,k)+d - (k,j) and 
si (is 5)>8¢ 4 (Ek) +8. (Ky 5) ) 
nq Shed 278g hb PTS 4g \Ke J 
then begin 
d (i, j):= 
s, (i,j): 
pCi. 5): 


d,,(i,j)=d 


oln'2 ‘oln’'1 ‘o In0 


Fig.2: Systolic array for the all-pairs SP problem 
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Abstract -- In this paper we propose a method to derive 
systolic designs with non-uniform data flow. One of the major 
difficulties in systematic design is in transforming the original 
sequential specification of a problem into a form suitable to 
VLSI implementation. Our approach to automatically restructur- 
ing a problem is based on a subset of the data dependencies 
extracted from the original problem specification. By using such 
dependencies we are able to identify chains of dependent compu- 
tations which are then converted into recurrence equations. The 
mapping of the new specification into hardware is also based on 
data dependencies. We illustrate the methodology by applying 
it to algorithms using dynamic programming. 


1. Introduction 


The development of efficient VLSI algorithms is relevant 
to many applications including signal and image processing, 
which justifies the considerable interest for this topic. In recent 
years, the problem of synthesizing VLSI design - and systolic 
design, in particular - has also received much attention. Sys- 
tematic methodologies for the derivation of systolic algorithms 
have proved useful both in finding new designs and in verifying 
the correctness of old ones. Moreover, the possibility of automat- 
ically generating a number of viable algorithms for the solution 
of a given problem enables the selection of an optimal algorithm 
among a wider set of candidates. The optimality can be based on 
such parameters as completion time 7, number of processors P,, 
chip area, etc.[9]. Most of the existing synthesis methods tend to 
minimize the completion time. 

Systolic systems generally are characterized by simple pro- 
cessing elements and regular and local communication patter. 
However, the definition of a systolic system changes according 
to different authors. Here we shall refer to synchronous systems, 
i.e. systems with a global clock for the timing of all processing 
elements, and assume that all computations are data independent. 


* This work was in part supported under ONR Contract N00014-85-K-0339 
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The first and probably most challenging part of the sys- 
tematic design process is in the transformation of the high-level 
problem specification into a form better suited to VLSI imple- 
mentation. Such a form can be, for example, a system of first- 
order recurrence equations, a uniform recurrence equation, or 
nested loops with constant data dependencies. Often, this 
transformation is obtained by using techniques similar to those 
used for software compilers: buffering of variables, addition of 
new variables, etc. However, it is not well understood how to 
select a good transformation especially for some complex non- 
numerical problems. Some authors assume that the problem is 
already given in the required form, and concentrate on how to 
map it into hardware. 


A fundamental distinction between different approaches to 
the mapping problem is whether they use data dependence 
based transformations [1, 3, 8-13] or delay operators applied to 
the mathematical expression of the algorithm [5]. For a survey 
of the various methods see [2]. 


The transformational approach based on data dependencies 
has been successfully applied to a variety of algorithms charac- 
terized by constant data dependencies. For such algorithms linear 
time and space transformations of the index set into VLSI arrays 
have been derived [11-13]. The resulting designs have constant 
and regular data flow. However, a number of more complex 
problems with non-constant data dependencies require non 
linear transformations. A typical example is the optimal 
parenthesization problem, which uses dynamic programming. 


A non-linear transformation for dynamic programming was 
indicated in [12]. However, the paper does not precisely 
describe how the transformation is obtained from the algorithm 
specification. Chen [1] presents a methodology for mapping 
algorithms expressed as systems of first order recurrence rela- 
tions into systolic arrays. The synthesis procedure allows designs 
with non-uniform communication patterns. The mapping is done 
inductively, starting at the boundary conditions. By using this 
technique, various designs, corresponding to different communi- 
cation patterns in the systolic array, are derived for dynamic 


programming. The procedure, based on a point-by-point map- 
ping, appears to be quite lengthy. In [10] a mathematical model, 
based on a sequence notation, is used to represent index transfor- 
mations. The model provides a precise specification of systolic 
designs. It allows arbitrary interconnections in systolic networks 
and is not restricted to linear transformations. The paper does not 
give a constructive methodology to find the transformations. 


This paper presents a systematic method for the design of 
VLSI algorithms. The method consists of transforming the 
high-level problem specification into a set of mutually dependent 
recurrence equations. To select the appropriate transformations 
we propose a two-step refinement procedure. The procedure first 
determines a coarse timing function of the computations in the 
Original specification of the problem, based on a maximal set of 
constant data dependencies; then uses this function to guide the 
search for an index transformation. Ordered chains of dependent 
computations are identified in the algorithm, according to such 
function, and an index transformation is applied to each single 
chain. The index transformation must be compatible with the 
timing function, i.e. all the computations whose operands are 
available first will be performed first. Subsequently, the mapping 
of the obtained system of recurrence equations into a VLSI net- 
work is performed by applying linear time and space transforma- 
tions separately to each recurrence subject to global constraints. 
The resulting design may have non-uniform data flow. 


This paper is organized as follows. Section 2 describes a 
method which helps deriving from an abstract algorithm a set of 
mutually dependent recurrence equations. Section 3 illustrates 
the method by applying it to dynamic programming. In section 4 
time-space transformations are used to map the new specification 
into a VLSI array. 


2. Deriving a systolic algorithm from the high-level specification 


In this section we discuss how to transform the high-level 
problem specification to adapt it to a VLSI architecture. The task 
of enhancing pipelining and local communication in an algo- 
rithm is generally accomplished by index transformations which 
involve adding indices to existing variables in the algorithm, or 
by renaming variables, or by introducing new variables. The 


aim of these transformations is to produce a new specification 
which is in canonic form. 


Definition. A canonic form consists of a structured set of compu- 
tations written as a recurrence relation or a nested loop. The set 
includes input statements, output statements, assignment state- 
ments, and conditional assignment statements. The recurrence 
relation or nested loop’ defines an index set 
I"={(i,,...,in) | [i si,sl?,...,1)si,<1,?}, subset of the set 
Z” of the n-tuples of integers. We assume that the following 
conditions are satisfied: | 
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CA1 - each variable in the algorithm is associated with an index 


vector i=(i,,i2,..,i,), ie. is an element of ‘an n-dimensional 
array. In other words, there is a one-to-one correspondence 
between the n-tuples in J” and the dimensions of any array used 
in the algorithm. 


CA2 - if S is an assignment Statement indexed by (i),i2,...,i,), 
and a variable on the right side of S is indexed by (/1,/2,....J,)) 
then each i, may depend only on j;. 


CA3 - a variable is used exactly once after it is generated. 


It is not always possible to transform a given problem into 
canonic form. For problems which do not have such a represen- 
tation we attempt to rewrite them into a form of many modules 
each being in canonic form, i.e. satisfying condition CA1-CA3. 


The canonic form does not explicitly specify any ordering 
among the computations; the lexicographical ordering in 7” is 
arbitrary and therefore irrelevant to our purposes. Rather, an 
implicit partial ordering of the computations is given by the data 
dependencies. The dependence vector of a variable is defined as 
the difference of the index vectors of computations where the 
variable is used and generated [7]. The data dependence vectors 


d,,...,@m_ associated to the variables of a recurrence can be 


represented as a matrix D=[d,---d,,], whose columns are 
labelled by the variable names. A variable may have many 
dependencies each corresponding to a different index vector; in 
this case there will be a column in the D matrix for each pair 
(variable, index vector), labelled by the pair itself. A partial 


ordering >p defined in /” by the data dependencies is such that 
i>p i’ if i=i'+d; forsomedjeD. 

Transformations applied to derive a canonic form do not change 
the algorithm fundamentally, but only the data dependencies 
among the variables. The new data dependencies in the resulting 
specification of the problem are therefore not characteristic of 
the problem itself but of its parallel implementation. Indeed, as 
is well known, there are in general several ways to transform a 
given problem specification, according to the previous rules, 
each way producing a different set of data dependencies. How- 
ever, not all the possible transformations lead to a feasible VLSI 
design. Moreover, different transformations may result in dif- 
ferent performances. Consider the two following examples: 


Example 1. The convolution problem is defined by : given a 


sequence x;,...,X, and a set of weights wj,...,w,, determine 
the sequence y;,...,y, such that: 


Yi = Dats We * Xi-z 


Broadcasting of x and w can be eliminated by adding one more 
index to all variables x, w, and y. At least two different index 
transformations can be applied which produce the two 
recurrences in canonic form below: the first is a backward 
recurrence, where the variable y,; , is defined in terms of y; ,-1, 
and the second is forward recurrence, where the variable y; , is 
defined in terms of y; x41. 


O<si<n; OSk&s 

Wik = Wi-1. 

Xie = Xi-1,k-1 (1) 
Vik = Vip + Win* Xi % 


Osisn; OSké&s 

Wik = Wi-1,k 

Xi = Xj-1p-1 (2) 
Vik =Yigsi + Wik Xi r 


Data dependencies corresponding to the two recurrences intro- 
duce two different partial orderings in 
I?={(i,k) | OSisn; O<k<s}, which translate into two different 
schedules for the computations and consequently into different 
systolic designs. 


Example 2 . Recursive convolution 


This problem can be expressed as: given a sequence of weights 
W1,...W,, determine the sequence y,..,y, such that: 


Yi = Deets We * Vi-r 


Of the two recurrences which can be derived by using transfor- 
mations similar to those used for the convolution problem, only 
the forward recurrence has to be considered for a systolic imple- 
mentation. The backward recurrence uses data before they are 
generated therefore cannot result in any meaningful design. 


For more complex problems, such as optimal parenthesiza- 
tion using dynamic programming, selecting a good initial 
transformation is not an obvious task and in fact straightforward 
transformations such as those applied to the convolution prob- 
lem fail. It is clear that this part of the synthesis procedure 
sometimes requires creativity and programming experience. We 
suggest a way to assist the human designer in this crucial task. 
The method we propose here relies on identifying a set of con- 
stant data dependencies in the original formulation of the prob- 
lem. 


A two-step procedure 

We assume that the high level specification of a problem is 
written in the form of a nested loop with the index set J” defined 
by: 
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={(i1,...,4,) | Ui siysl?,...,1)si, <174. 
We assume that, unlike the canonic form, the loop contains a 
variable which is an s-dimensional array. More specifically, we 
let s=n—1 and assume that the loop contains an assignment state- 
ment of the form: 


c(i®) = f (c@— d}), ..., e@— dx) (3) 
where: i°=(i 1, #2,» is) is an s-tuple of indices of the loop; 


d§=(a; 4, rr Qja-) ii, » Qj t+» sy is) ; jJ=1,...3 
and where a; , are integer constants. In other words, the index i, 
on the left side of (3) is replaced on the right side by the index 


io 


Each vector di, (j=1,..,m), represents a non constant data depen- 
dence for variable c, since its t-th component is a function of the 


two indices i, and i,. We can expand ds into a number of data 
dependence vectors, each corresponding to a specific value of 
index i, in the t-th component. Then, we associate to each pair 


(variable index vector)=(c,i,;) the set D° of all the data 
i? 


dependence vectors obtained by expanding dj,..,d°,. 


Our strategy to select the initial index transformation con- 
sists of a two-step procedure. Let 
Ie = {(ij,...45) | Ui si:sl?,...,1,'si,<12}. First, determine 
from the high-level specification of the problem a coarse timing 
function T: 1° — Z. This function will be used in the subse- 
quent step of the procedure to guide the search for a schedule of 
computations indexed by J”. We derive T from a subset of the 
data dependencies of the algorithm, namely, the subset of con- 
stant data dependencies, thus, T provides only a lower bound for 
an actual timing function for /*. Furthermore, T depends only on 
the implicit dependencies of the problem since it is derived 
before any arbitrary ordering of the computations is introduced. 
The set D- of data dependencies is non constant in the compu- 


tation space. However, the set D° defined as the intersection of 

D® forall iS e J* contains only constant data dependencies. 
17 

For a nested loop with constant data dependencies, a linear (or 

quasi-affine) timing function can be determined by applying the 

transformational method described in [11,13]. Thus, if D° is not 

empty we can derive a linear timing function T : /°—> Z which 


my J AY 


is compatible with the set D‘, i. i° >p- i* implies 


T(i5) > T(i’*). In [11, 13] necessary and sufficient conditions 
for the existence of a linear timing function are given. In partic- 
ular, it must be: 


T(d;) > 0 for each d;<-D* (4) 


System (4) may have no solution or several solutions. In this 
latter case, the one which minimizes the total execution time 
(defined as the difference between the minimum and maximum 
value of 7) is chosen. 


It obvious that if t is an actual timing function for /° then 


it must be t(i*) > T (i*) for each i*eJ*. Moreover, because of 
the monotonocity of data dependencies in D‘ , it must be 
iz 
ais) > wi) if TG) > TH). 
The partial ordering defined by T on the set of computa- 
tions /° defines a partial ordering in a subset of J” based on the 
availability of operands. Consider the set J" J” consisting of 


the n-tuples (5, i,) for any given iSeI* and for Li<i_ <i,?. For 
any two such n-tuples, we introduce the relation >; defined by: 


Gi) >7@in) <=> 
Max {T (i$ —d}),.... T(°—d,,)} > Max {T(i5—-d}),.., T(—d,,)} 


where d; and d; (j=1,..ym) are the data dependence vectors in 
(3) corresponding to values i, and ie respectively, of index i,. 

Obviously, the relation >7 is a partial ordering in J”. Thus, J” 

can be decomposed into a number of chains, i.e. subsets whose 
elements are linearly ordered (relative to each other). Of course, 
there will be only one chain if the relation >, is linear to start 
with. Each chain consists of indexed computations which have 
to be carried out one after the other in a specific linear ordering. 
Computations belonging to different chains can be carried out 
independently. Among the many chain decompositions of the 
partially ordered set J”, we choose the one in which the compu- 
tations in a chain are also sorted (in either increasing or decreas- 
ing order) according to the index i,. Let us denote by s the 
number of such chains. 


At this point, we are ready to restructure the given algo- 
rithm. We partition the computations indexed by J” into s 
separate recurrences, each corresponding to a distinct chain. In 
each recurrence the ordering for index i, is chosen according to 
its ordering in the chain. Then, we transform each recurrence 
into canonic form using the three previous rules: 1) adding 
missing indices, 2) adding new variables, and 3) renaming old 
variables. In addition we also add statements between 
recurrences to correlate variables in distinct recurrences. In fact, 
a recurrence may use a variable generated in another recurrence. 
This last step may introduce non-constant data dependencies in 
the system. 

In conclusion, the obtained new specification is expressed 
as a system of s modules, each module being a recurrence 
equation in canonic form. Non-constant data dependencies may 
occur between variables of different modules. 


3.An application to dynamic programming 


Consider the dynamic programming technique applied to 
such problems as optimal parenthesization, and shortest path. 
Both problems can be expressed by the recurrence: 

l<i<n;si<j<n 


(S) 


Ci, j = MiNjcgej F (Ci, k» Ck, j) 
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with initial conditions: 

1<i<n 

for some function f. Note that no explicit ordering for the 
indices i,j, and k is specified. If we choose the normal lexico- 
graphical ordering for the three indices and we apply some index 
transformation we can derive a canonic form from the above 
recurrence; however, the systolic implementation of it will have 
execution time O(n) since it does not overlap the computations 
of c; ; for different k. 

Let 7={(i,j,k) | 1Si,j<n;i<k</j} be the index set of the 
recurrence above; and let /7={(i,j) | 1Si, j<n} be the index set 
defined by variable c. Each pair (c, (i, j)) iS associated with a 
distinct set of data dependencies (here represented in matricial 
form): | 


Ci iat = Cj 


0 i-k 
D& j) = 
jk 0 
or in expanded form: 
0 . OO O -1 -2 i-j+1 
Dé j) = 
jit] «. 2 1 OO O.. O 


The intersection D° of all the set DG, ;) (1Si, jm) is non 
empty and is given by: 


D& j) = 


We derive a timing function T : J? Z compatible with D°. 
Since D* contains only constant data dependencies, we can 
apply the methodology discussed in [11] to determine a linear 
timing function. Conditions (4) applied to the data dependence 
vectors in D° give: 

T;>0O and T<~1 


The least integer values that satisfy the above equations are: 
T,=0 and 7T,=-1. 
Thus, the optimal time transformation is: 
TG,j)=j-i. 


Next we consider the partial ordering 
J7={(i,j,k) | i,j are fixed and i<k<j} defined by: 


>T in 


Gjk)>7 Gjk) <= 
Max{ T(i,k),T(Kj)} > Max{T(ik), TK’ j)} 


Notice that the minimal elements with respect to >7 are: 
Gi, Jj, @+j)/2) 
@, j, @+j-1)/2) and @, j, @+j+1)/2) 


if i+j is even or 


if i+j is odd. 


By repeatedly finding minimal elements after removing the 
previous minimal elements from the set we obtain a decomposi- 
tion of J? in the two chains below (here we only write the third 
component of the index vectors): 


{if (i+j) is even} 

(Gi+j)/2, @+j)/2-1, ..., i+1; 
G+j)/2+1, (i+) )/2+2, ..., j—1. 
{if (i+j) is odd} 

(i+j—1)/2, (+j-1)/2-1, ..., i+1; 
(i+j+1)/2, G+j+1)/2+1, ..., j-1; 


for i:=1 to n—1 do Qi, i+t, i41 = Ci, i445 


for /:= 2 to n—1 do 
for i:= 1 ton do begin 


We can now rewrite (5) into a form where the execution 
ordering of index k is specified according to the above chains. 
The new specification of the problem consists of two 
recurrences: the first recurrence is a forward recurrence where 
the index k varies from (i+/)/2 to i+1 (or from (+j—1)/2 to i+1 
if i+j is odd); and the second is a backward recurrence where k 
varies from (i+j)/2 to j-1 (or from (i+j+1)/2 to j—1 if i+j is 
odd) . To transform each recurrence into canonic form some 
further manipulation is necessary. We first add missing indices 
to variables on both sides of (5) and then introduce new 
recurrence variables to eliminate broadcast of a variable to dif- 
ferent destinations. We use different sets of variables for the two 
recurrences. Each recurrence initializes some of its variables 
to values generated by the other recurrence. Now equations (5) 
can be converted into the following system of mutually depen- 
dent recurrence equations: 


Ci, itl, i41-= Ci, i415 


j:= itl; 
if (+ )=even) then begin 
Al: k:=(i+j)/2; aj. 5.4 = Oi j,k 3 
oe if k= i+] then bj, j, 4 °= C41, j,j else bi je 2= Dist j.k 
Ci. j,k =f (Gi. jk» bi jk); Chik c= Ci j,k 
end 
else begin {i+j=odd}; 
k:=(i+j—-1)/2; aj. jk c= Gi, j-l,k 3 
if k= i+1 then bi. j,k = Ci+1,;,; else bik = bist, j,k : 
Ci jk =f (Gj, j.k bik) 
k= (i+j+1)/2; 
A3: if k:= j—1 then a; j,k “= Cj, j-1, j-1 else a; jk = a; fice 
AA: a 
Ci jk = f(a; jk, bi jk ); 
end 
for k:= [G+j-1)/2 - 1] downto i+1 do begin | 
a; ko a; j-l,k > | 
if k= i+1 then bj, j, i413 Cis, j, ; else bj jg = Dia, jk 3 | module 1 
Ci jk = hei jest f (4: jk Bi j.k ds | 
end; : 
for k:= |(i+/+1)/2 +1] to j—1 do begin 
if k= j—1 then Qi ik = Cj, j-1, j-1 else Qi ik = Qj jt, k ; | module 2 
by j,k = Bit, j,k | 
Ci j,k = h Ce H2i5f (aj. j,k Di ik )): | 
end; 


AS: Ci, 5,5 = A (Gi, j, i415 Ci, j, j-1)5 


end, 
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_From the above specification we extract distinct sets D,; and D2 
of local data dependencies for the two modules. 


c’ a’ c 0a b 
0 0 -1 0 0 -1 
D, = | 0 1 0 D, = 0 1 0 
-l O 0 1 0 0 


Non-constant dependencies between variables of distinct 


recurrences are defined by statements Al to A5 in the algorithm. — 


We will refer to such dependencies as global dependencies. 


4.Mapping the algorithm into hardware 


Once the algorithm has been converted into the new form, 
the mapping into a VLSI array can be accomplished in two 
steps: 

1) Finding for each individual module in the algorithm 
representation a separate timing function which is compatible 
with the local data dependencies and also satisfies the constraints 
imposed by the global dependencies. 


2) Finding a space mapping for each module into the pro- 
cessors of a physical network which is compatible with the tim- 
ing function and satisfies global constraints. 


Assume that each processor of a VLSI array is assigned a 
label Je L”“!c Z"“! . Furthermore, the connection pattern of the 


network is described by a set of vectors A=[8j,89, ... ,5,] which 


specify the interconnections among the processors. Precisely, ; 
is the difference of the integer labels of adjacent cells in the net- 
work. 


The space mapping is a mapping from the set of computa- 
tions to the set of processors 
S: PL"! 
_ which maps simultaneous computations into distinct processors. 
Determining a solution for S is equivalent to find a solution to 
the diophantine equations: 


SD = AK (6) 


for which the matrix iB « being the time function, is non- 
singular. In (6) K is an integer matrix with positive elements. 
The equations may not have any solution or have several solu- 
tions. If no feasible solution is found, the design procedure is 
repeated by starting with a different timing function or else a 
different physical network. If several solutions are possible, the 
one which is optimal according to some criterion is chosen. A 
criterion can be, for example, the minimum number of required 
processors. These issues are discussed in [3]. 

Referring again to the dynamic programming algorithm, 
we first seek linear time transformations 1 and p for module 1 
and 2, respectively. Furthermore, let o be the timing function for 
computation in A5 outside the two modules. The linearity 
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_ requires that for the dependence vectors of the modules it must 


be: 
XN (d)>O fordeD, and p@>0 for deD, 
that is: 
St. St ASI 
LW, <-1 Un 21 Ms 2 1. 


Constraints specified by A1-AS5 lead to the following equations: 


Mi, J; (i+j)/2) > LG, j-l, (i+j)/2) 
Mi, j, it1) > o(+1, j, /) 
ud, Jj, j-1) > of, j-1, j-1) 
wd, fj, @+j+1)/2) > AG+1, j, G@+j+1)/2) 
oi, j, j) 2 max[AG@, j, i+1),uG@, Jj, j-1)] 


It easy to check that an optimal solution to the above system is 
given by: 


(Ser. Joes Wea 
Wj=—2 [y= 1 UW, = 1 
0, =—2 6, = 1 63 = 1. 


Hence: 


Mi.j k) = -i42j—k 
WG jk) = -2itj+k 
O(i jj) = —2i+2/. 


The next step is to find a mapping of the computations 
indexed by J? into the processors of a VLSI array. We consider 
a triangular array labelled by L?= {(x,y)}c Z2. The inter-" 
connections among cells are specified by the matrix: 


The mapping function S:; J>->L? of the computations in the tri- 
angular array is obtained with a construction similar to the one 
for the timing functions. The same space function is determined 
for the two modules and is given by: 


S (i,j KG, i). 


The resulting design is identical to the one first introduced in [4]. 
The systolic array and the action of a cell are depicted in figure 
i. 


5.Conclusions 
We have suggested a methodology to assist a human 
designer into the difficult task of designing a VLSI algorithm 


starting from a sequential specification of a problem. Among the 
many possible transformations which can make parallelism 
explicit in a program the method helps selecting a feasible and 
optimal sequence of transformations. The methodology appears 
useful for complex nonnumerical problems for which standard 
restructuring techniques seem inadequate. The class of obtain- 
able designs is not restricted to designs having constant data 
flow. 
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Abstract 


We illustrate that the verification of systolic 
architectures can be carried out using techniques 
developed in the context of verification of programs. 
This is achieved by partitioning the original problem into 
three parts, namely proving the correctness of the data 
representation, of the individual processing elements, 
and of a composition of a number of such processors. 
By expressing a processing element as a function on a 
stream of data we are able to utilize standard proof 
techniques from programming language theory. This 
decomposition leads to relatively straightforward proofs 
of the properties of the systolic architecture. We 
illustrate the techniques via a substantial example, 
namely the proof of the correctness of a linear-time 
systolic architecture for computing the gcd of 
polynomials. Although this architecture has been 
designed a few years ago, a formal proof of correctness 
has not hitherto appeared in the literature. 


I Introduction 
In the last few years there has been much interest in 
the use of formal techniques for the design and analysis 
of VLSI circuits. Special formalisms have been invented 
for expressing transformations and specifications of 
such circuits. One important aspect of the analysis of 
circuits is the correctness problem. In this paper we 
address the correctness problem for an important class 
of parallel architectures, namely systolic arrays. This 
problem is decomposed in such a manner that formal 
techniques from programming language theory can be 
used in a straightforward manner. The particular 
techniques we use were invented in the context of 
proving the correctness of data representations [3] and 

semantics of functional languages [4]. 


The individual processors in systolic arrays act 
synchronously (at least they can be conceptually viewed 


thus, even though the actual implementation may be- 


asynchronous) on streams of data tokens. As a result 
their operation can be divided into a _ three-step 
read-execute-write cycle, where the execute phase does 
not branch on the availability of data values. It has 
been shown [4] that under such conditions the action of 
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such a process can be described completely by a stream 
function. The action of the entire network is therefore 
merely the composition of these individual functions. 
The verification problem is thus reduced to the problem 
of verifying that the composition of a caerain number 
copies of a function computes as specified. We shall 
illustrate the technique by proving the correctness of a 
linear systolic architecture for computing the greatest 
common divisor (GCD) of two polynomials [1]. Due to 
lack of space, we present here only a brief outline of the 
proof. The interested reader is referred to [8] for full 
details. 


The rest of this paper is organized as follows. In the 
following section (Sec II) we briefly describe the 
architecture of the chip. We also express each processor 
in the array as a_ stream function, provide a 
representation mapping from the domain on which the 
chip operates(i.e. streams of numbers) to the domain of 
polynomials, and give a representation invariant. The 
following section, (Sec III) then formulates the actual 
verificaton problem as three subproblems and addresses 
each in turn. Finally in Sec IV we compare our 
technique with other existing techniques for verifying 
systolic architectures. 


II Description of the gcd chip 
The basis of the gcd-chip is Euclid's algorithm which 
depends on the following lemma. If A and B are two 
non-zero polynomials as shown below, 


= i 
A = a,x Eo ees 


5 3 
B b,x + ... 


+ a,x + ay 
+ b,x + by 


then their gcd satisfies the following invariant. 


gced(A, B) = gced((A - [a,/b,]x*"3*B), B) if i2j 


gcd(A, (B - [b,/a,]x/"**A) ) if j2i 


This serves as the basis for a_ gced-preserving 
transformation that also reduces the degree of one of the 
polynomials by at least one. It is successively applied to 
the original pair of polynomials until one of them is 
reduced to the zero polynomial, at which point the other 
one is the desired ged. Let us view a polynomial as a 
sequence of coefficient-power pairs in decreasing order 
of power. The first term has a nonzero coefficient, and 
thus its power is the degree of the polynomial. If p, and 
P2 are two such sequences we can express Euclii 


algorithm as in Fig II-1. In our proof of correctness we 
shall show that the gcd architecture computes exactly 
this function. 


gcd(p,, P,) = 

if p, = [0, 0} then p, 
elseif Pp, = [O0, 0) then P, 
elseif deg(p,) 2 deg(p,) then 


gcd((p,-[Q(p,-p,), (deg(p,)-deg(p,))]*p,), P,) 
else 
gcd(p,, (p,-[Q(p,,P,), (deg(p,)-deg(p,))]*p,)) 


where 
deg(p) = second (first(p)) is the degree of p; 


Q(p,, p,) = first(first(p,) ) /first (first (p,) ) 
is the ratio of the leading coefs of Pp, and p,; 
and ’-’ and ’*’ are operations on polynomials; 
first (and second) is a function that 
returns the first (second) element of a list. 


Figure I-11: Euclid’s algorithm 


The Kung-Brent architecture consists of 2*m-1 
processors cascaded linearly. Each processor has three 
data input (and output) channels, and one control line 
-start. Two of the data channels,.a and b carry streams 
of numbers, and the third, d carries a single integer 
(aligned with the signal on the start: line). The 
processor begins a new cycle when it receives a start bit 
on the start line. Depending on whether or not d is 
non-negative the processor determines whether stream 
A or stream B is to be reduced (say A). It then computes 
the coefficient q (= a,/b,, while delaying the other 
stream B by one time unit. After this it cycles through 
the rest of the input stream (i.e. until a new start bit 
appears) replacing each a element by a,-q*b,, and 
passing stream B unaltered (except for the delay). We 
shall consider the processor operating on only one pair 
of polynomials. From the description of each processor, 
the equivalent stream function (called reduce, see Fig 
II-2) can easily be determined, using the technique 
described in [7]. 


if first(s,)=0 or (first(s,)#0 A d20) then 
Uf(rest(s,),rest(s,) ,first(s,) /first(s,))*0,s,,d-1] 
at 


(s,f(rest(s, ), rest(s,) , first(s,) /first(s,))%0, dtl] 


where f is defined as 
Sf (s,, s,, q) = if s, = [] then [] 


else 
(first(s, )-q*first(s, )) “flrest(s,), rest(s,), q) 
Note: A is an infix cons operator 


Figure II-2: The function reduce 


In order to prove the correctness of the gcd chip we 
must now demonstrate that the function reduce, applied 
2*m-1 times to input streams of length m yields the ged 
of the polynomials that the input streams represent. 
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Before formulating this problem we must set up some 
basics. Firstly, we must give a mapping, rep from the 
data-domain on which the chip (i.e. reduce) operates 
(treams) to the abstract domain of interest (polynomiais). 
Secondly, we must specify a representation invariant, R 
that must be satisfied by elements of the data domain. 


This is done as follows. We define rep in terms of 


another function prep which takes an integer d and a 
single stream of numbers s (which has no leading 
zeroes) and yields a polynomial. 


prep(s, d) = 
if d = 0 then 
if s=(] then [0, 0] else [first(s), 0] 
elseif s=[] then [0,d]“*prep(s, d-1) 
else [first(s), d]“prep(rest(s), d-1) 


Using this we now define rep (Fig II-3). Al1lZero(s) is 
a predicate asserting that all elements in s are zero, 
StripTrailingZeroes(s) is a function that removes all 
trailing zeroes from a stream s (provided AllZero(s) is 
false) and Final(i, s) is the stream consisting of the last i 
elements of a stream s. 


rep(s,, S,, d) = 
if AllZero(s,) then 
[{0,0], prep (StripTrailingZeroes(s,), 
length ( StripTrailingZeroes (s,,))-1) ] 
elseif AllzZero(s,) then 
(prep (StripTrailingZeroes(s,), | 
length (StripTrailingZeroes(s,))-1), [0,0]] 
elseif first(s,)=0 then rep(rest(s,), s,, d-1) 
elseif first(s,)=0 then rep(s,, rest(s,), d+1) 
else 
rep’ (StripTrailingZeroes(s,),. 
StripTrailingZeroes(s,), a) 


where 
rep’(s,, Sj, d)= 
if d > length(s,) - length(s,) then 


([prep(s,, lengths, )+d-1), 
prep(s,, lengths, )-1)] 
elseif d = length(s,) - length(s,) ‘then 
[prep(s,, length(s,)- 1), prep(s,, length(s, y=1)] 
elseif d < length(s,) - length(s,) then 
[prep(s,, length(s, )-1), 
prep(s,, lengths, )-d-1)] 


Figure 0-3: The representation mapping, rep 


The ‘representation invariant consists of four 
conjuncts R,, R, R, and R, which are defined as 
follows. 


= length(s,) =. length(s,) (=n, say) 
R, = Ulrst(s ) #0 ace 
#1 (R, ) = d20 => 
(nd A Allzero (Final(d, s,))) v AllZero(s,) 
a<0 => 
(n2-d A AllZero(Final(-d, s,))) v AllZero(s,)} 


R,* 


WI Problem decomposition and proof 
With this machinery in place, the actual proof of 
correctness is achieved by demonstrating that reduce 
satisfies the following conditions. 
e It must preserve the invariant R on every 
- invocation, i.e. 
R{s,, $5, d) > R{reduce(s,, S,, d)) 


e It must preserve the gcd of the polynomials 
obtained by applying rep to its input and 
output streams, i.e. 
gcd[rep(reduce{s,, Sp d))] = gcd[rep(s arSp9)I 


eThe computation. defined by reduce 
"terminates", i.e. if the length of the original 
streams is m the exactly 2*m-la applications 
of reduce (denoted by reduce*"™ *(s,, 8, d)) 
cause one of the polynomials to become the 
zero polynomial!, This means that 


first{rep(reduce?"™ \(s,, s,, d))) = [0,0] 
v second{rep(reduce**™ '(s,, s,, d))) = [0,0] 


(1) 


(2) 


(3) 


The proof of Theorem | involves fairly straightforward 
manipulation after substituting the definition of reduce 
in the equation. Next, proving that the function reduce 
preserves the gcd of the polynomials involves a case by 
case analysis on the first elements in the streams s, and 
S,- Although the prof is somewhat long it is fairly simple 
conceptually. Finally, to prove Theorem 3 we also 
perform a similar casewise analysis. We first show that 
{y one of the polynomials that [s,, s,, d] represent is the 
zero polynomial, then further application of reduce 
leaves the streams (and hence the polynomials they 
represent) unchanged. Then we show that associated 
with each [s,, S,, d] is an integer”, N which decreases by 
exactly one on every application of reduce. When N 
becomes zero one of the polynomials is the zero 
polynomial. The interested reader is referred to [8] for 
full details. 


IV Conclusions 

We have presented by means of an example how well- 
known techniques from programming language theory 
can be profitably applied to the verification of systolic 
architectures. An additional contribution of this paper 
is the proof of correctness of a hitherto unproved systolic 
array. Other researchers have addressed the problem of 
reasoning about systolic architectures [2,5, 6]. In 
Chen’s approach space-time recursion equations are set 
up and their fixed point [9] yields the function describing 
the network. By using our approach we do not need to 


Note that this formulation of termination is slightly different from the 
termination in the classical sense. This is because our hardware 
architecture has a constant number of processors, and thus applies the 
function reduce exactly 2m+1 times, regardless of whether it is redundant 
or not. 


2 represents the sum of the number of leading zeroes in one of the 
streams and the degrees of the two polynomials 
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explicitly reason about the time component. Successiv«: 
time instants are represented by successive appearances 
of tokens on a stream. Melhem[5] describes a 
formalism involving graphs with colored arcs which is 
very similar to ours. The processing elements are also 
modelled as (essentially) stream functions, though they 
are referred to as operators. The overall network 
behavior is obtained by solving a system of difference 
equations. Purushothaman [6] also has a similar 
approach, where the verification problem is also reduced 
to the solution of recurrence equations. In both these 
methods the equations are set up using the token-wise 
behavior of the individual processing elements, and 
standard techniques for manipulating functions are not 
exploited. Their use of difference/recurrence equations 
implies that both, the data representation and processor 
behavior issues have to be handled simultaneously. 
These issues are cleanly separated in our approach. 
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Abstract In this paper, a design methodology for synthesiz- 
ing efficient parallel algorithms and architectures from prob- 
lem definitions specified in the language Crystal is presented. 
First, transformations on a problem definition to programs with 
bounded order and bounded degree are introduced. This stage 
of transformations aims to reduce the dominant costs in the 
underlying hardware which might be incurred by a direct par- 
allel implementation of the problem definition. At the second 
stage, pipelining is automatically incorporated into an algo- 
rithm so that the hardware resources can be most effectively 
utilized. The methodology allows the derivation of systolic al- 
gorithms and architectures in a unified framework based on 
Crystal. Since Crystal is a general purpose language for par- 
allel proramming, new synthesis techniques and insights into 
problems can be integrated readily within the existing frame- 
work. 


1. Introduction 


VLSI technology provides a natural medium for parallel 
processing, from the level of switching elements, to logic gates, 
to functional and control units, and to architectures and algo- 
rithms. To expoit its potential fully, parallelism must be used 
at increasingly higher levels. Design automation, consequently, 
must go beyond the level of taking as givens architectural level 
specifications, and should head towards compiling or synthe- 
sizing efficient parallel algorithms and architectures. The tech- 
nology, on the other hand, constrains the way in which parallel 
computation can be organized. .Dominant costs at the tech- 
nological level often have profound implications in the designs 
at algorithmic and architectural levels. Clearly, to achieve an 
efficient design, one must take advantage of what the tech- 
nology can offer, and minimize the costs associated with the 
constraints it imposes upon the design. 

In this paper, we describe a design methodology that takes 
into account the dominant costs in the underlying technology 
and minimizes such costs to yield efficient parallel algorithms 
and architectures. The design process starts with a problem 
definition specified in the language Crystal [4]. The defini- 
tion is interpreted as a parallel algorithm, however naive and 
inefficient it might be. From this naive algorithm, the domi- 
nating costs it might incur in the underlying parallel hardware, 
such as that of communications and fan-ins and fan-outs, are 
examined. We then improve the algorithm by a series of trans- 
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formations that results in a bounded order and bounded de- 
gree program which has reduced hardware costs. Once such a 
program is obtained, it goes through another stage of trans- 
formation, called space-time mapping, by which pipelining is 
automatically incorporated into the algorithm and the hard- 
ware resources are fully utilized. Algorithms are classified as 
either uniform or non-uniform because space-time mapping for 
uniform algorithms can be obtained by an expedient procedure. 

Throughout this paper, the methodology is applied to the 
dynamic programming problem, which turns out to be a uni- 
form algorithm. The well-known systolic algorithm of Guibas, 
Kung and Thompson [7] and others are generated as a result. 
The rest of this paper is organized as follows: in Section 2, a 
mathematical definition of dynamic programming is given. In 
Section 3, its interpretation as a parallel algorithm is described. 
In Section 4, transformations for reducing the number of fan- 
ins and fan-outs are presented. In Section 5, transformations 
that reduce long-range communications are illustrated. In Sec- 
tion 6, space-time mapping for uniform algorithm is described. 
A general inductive method for finding space-time mapping for 
non-uniform algorithms can be found in [5]. In Section 7, opti- 
mization of control signals is illustrated. Finally, some related 
work is discussed in Section 8. 


2. A General Definition of Dynamic Programming 

A large class of optimization problems can be solved by 
dynamic programming. The definition of such problems can 
be posed in a general form where C'(i, 7) is some cost function 
which is to be minimized. 


s>0— min {F(C(i,k),C(,3))} 


Clt,7) = where F is some function on the costs 
s = 0-— C; for some individual cost 


where s=j —2-1, 
(2.1) 
and 7 and j are integers in the range 0 <2 <7 < n, for some 
constant n. 


3. Parallel Interpretation of Algorithms 
The straightforward definition given in Equation (2.1) is 


in the form of a Crystal recursion equation with recursion vari- 
ables : and 7. It can be interpreted as a parallel algorithm as 
follows: Each pair (¢,7) in Equation (2.1) is interpreted as a 


process in an ensemble of parallel processes denoted by the set 


Pie {(t,7) :0< i<j <n}. The local processing at each pro- 


cess (2,7) is, in this example, the function minj;ckc;{F (xx; ye) } 
that takes 7 — 2 — 1 pairs of arguments. The communication 
between processes is specified, for instance, by the pairs (i,k) 
and (k,7) appearing on the right-hand side of Equation (2.1) to 
indicate that the computation of C(z,7) at process (¢,7) needs 
the results from processes (7,k) and (k,7). 

The ensemble of processes and its data flow can be de- 
picted by a DAG (Directed Acyclic Graph) as shown in Figure 
1. It consists of nodes, where each node corresponds to a pro- 
cess (2,7) in P,, and directed edges, each of which comes out of 
a node whose corresponding process appears on the right-hand 
side of Equation (2.1), and goes into a node whose correspond- 
ing process appears on the left-hand side of the equation. The 
directed edges of the DAG define the data dependency relation 
of the algorithm. We say that a process u immediately precedes 
v (u < v), or v immediately depends on u (v > u), if there is 


a directed edge from u to v. The transitive closure “2” of this 
relation, called “precede”, is a partial order, and there is no in- 
finite decreasing chain from any node Vv, nor is there an infinite 
number of processes that immediately precede v. Those nodes 
that have no incoming edges are called sources. 

Processes are parallel in nature. A computation starts 
at the sources which are properly initialized, and is followed 
by other processes each of which starts execution when all of 
its required inputs, or dependent data become available. The 
parallel execution of the naive dynamic programming definition 
starts with all processes such that s = 0, followed by processes 
with increasing s as illustrated in Figure I. 


0 l 2 3 4 


Figure 1: The DAG describing the data dependency of dy- 
namic programming. 
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An immediate, very naive implementation of the parallel 
interpretation uses one processor for each process, and alto- 
gether O(n?) parallel processors are needed. The number of 
time steps needed to compute the result is at best n — 1 since 
any C(i,j) cannot be computed until C(¢,7— 1) and C(i+1,7) 
are computed. 


4. Problems with Large Number of Fan-ins and Fan-outs 


Due to the inherent physical constraints imposed by the 
driving capability of communication channels, power consump- 
tions, heat dissipations, memory bandwidth, etc., it is reason- 
able to assume that data can only be sent or received simulta- 
neously to or from a small number of destinations or sources. 
Putting such constraints in algorithmic terms: the number of 
“fan-ins” and “fan-outs” of a datum must be small, where fan- 
in degree of a datum is defined to be the number of data items 
on which it depends, and fan-out degree is the number of data 
items dependent upon it. If what we are interested in are prac- 
tical and efficient algorithms, the fan-in and fan-out degrees 
should be taken into account in measuring the complexity. 

It can be easily seen that the fan-in degree of C(7,7), which 


appears on the left-hand side of Equation (2.1), is 2s, where 


oe 47 —t—1. Conversely, for any value appearing as C'(i,k) 


on the right-hand side of Equation 2.1, that value is needed by 
C(t,7) for 7 = k+1,...,n (there are n — k such terms). The 
same value also appears as C(k,7) in the equation and it is 
needed by processes (7,7) fori = 1,...,k — 1 (there are k — 1 
such terms). Since both s and k grow with the problem size 
n, the number of fan-ins and fan-outs therefore grows linearly 
with the size of the problem. 

To avoid the high cost incurred at the hardware level, the 
large fan-in and fan-out degrees must be reduced. In the fol- 
lowing transformations, the original fan-in and fan-out degrees 
are reduced from O(n) to a constant while the complexity of 
the original algorithm is maintained the same as before. 


4.1. Reducing fan-out degrees 
The idea used in reducing the fan-out degree is quite sim- 
ple: 


Proposition 4.1. Let z(l) for u <1 < v where u and v are 
integers and u < v be v— u+1 variables to which a value x 
shall be assigned, i.e., z(1) = 2 for u <1 < v (where the fan- 
out degree of x is vu—u+1). Then the assignments of these 
variables can be performed instead by 


l=u-2z 
2(l) = (4.1) 


l>u— 2(l-1) 


l=v-on 


) = <v—2(l+1) ) 


foru<l<v 


where the fan-out degrees of x and z(l) foru <1 < v are 1. 


To decrease the fan-out degree of C(z,k), let the variables 
at processes (7,7) fork < j < nto which C(z,k) will be assigned 
be denoted by a(t,7,k). Similarly, let the variables at process 
(7,7) for 1 < k <7 to which C(k, 7) is to be assigned be denoted 
by b(¢,7,4). Applying Equation (4.1) of the proposition to 
a(t, j,k) and Equation (4.2) to b(2,7,k), we obtain 


g=k—-C(t,k) 

j> k — a(t,j - 1,k) 
t=k—C(k,7) (4.3) 

b(t,9,k) = 4 . 

t<k— b(i+1,7,k) 

forl<i<k<j<n. 


a(t, 7, k) vat 


Now the n — k fan-out degree of C(i,k) to all processes (2,7), 
where k < j < n, is reduced to 1, and so is the k — 1 fan-out 
degree of C(k,7) to all processes (3,7) for 1 <1 <k. We may 
now replace the high fan-out degree values C(i,k) and C(k,j) 
on the right-hand side of Equation (2.1) by the low fan-out 
degree values a(i,7,k) and b(t, 7, k): 


s>0— min {F(o(¢,7, 4), 5(¢,&,9))} 


s=0—-C;. 


C(i,j) = (4.4) 


4.2. Reducing the fan-in degrees 

The reduction of fan-in degree of a value computed by a 
process relies on the associtivity of the operation being com- 
puted. 


Definition 4.1. An operation “€”, where z = @,, <<, a(l), on 
v—u-—1 arguments, is associative if it can be replaced by 
the composition of the following sequence of binary associative 
opertions “@” on variables z(l), u<I< v: 
0 cee (4.5) 
uti<l<v—2z(l-1) @2(l) 


and 2 = 2y-1. 


Similarly, the same operation can be achieved by a differ- 
ent sequence using both binary and ternary associative opera- 
tions. 


Proposition 4.2. If “€)” is an associative operation on v—u—1 
arguments z(l) for u <1 < v, where z = Qycje, 2(l), then it 
can be replaced by a sequence of binary and ternary operations 
on variables z(l), form <1 < v of the following form: — 


l=m-—2(m)@2(u+tv—m) 
2al)=<( m<l<vo2(l-1) @2(l) @2(u+v—-D) 
l=v— 2(1-1) | 
Ut+t4u 
ol 


(4.6) 


where m = | 


and z = 2y. 


Since the operation “min” for computing C(i,7) is asso- 
ciative, let the high fan-in degree value C(i,7) at process (2,7) 


be computed instead by the sequence c(i,j,k) for m< k < j, 
where m © pt27. Now apply Proposition 4.2 to Equation 


(4.4), and it is transformed to 


s=0—-C; 
s>0— 
k=m— min(F (a(t, J, k),b(2,9, k)), : 
F(a(t,j,t+ 9 = k), b(t, 7,2 ze k)) 
m<k<j- 
min(c(é, j,k — 1), F(a(é,3, 4), 6(¢,5,*)), 
F(a(i,j,t+5 — &),0(¢,95,4 +3 — *)) 


k=j- c(t,7,k = 1), 


c(t, J, k) = 


(4.7) 


and C(i,7) = c(t,7,7) for all (¢,7) € Fr. 

The application of Proposition 4.2 is motivated by the fol- 
lowing: Due to the data dependency, the value F(a(i,j,k), 
b(i,7,k)) for i < k < 7 cannot be computed before at least 
maz(k —1,7 — k) time steps. A pair with the least delay is the 
one with k = |$2| and/or k = [+42], where the delay is about 
half of the interval size. Consequently, the most efficient se- 
quence of binary or ternary operations will start with this pair, 
and will be followed by other pairs in the order of increasing 
amount of delay. 


5. Communcations and Bounded-order Recursion Equations 


The following two equations 
g=k—ocyk,k 
a(t, j,k) = ( ) 
J >k— a(t,j o7 1,k) 
8 t= k — c(k,7,7) (5.1) 
b(t, 9,k) = 9 | : 
t<k— b(¢+1,7,k) 
forl<i<k<j<n, 


are obtained from Equation (4.3) by substituting C(¢,k) with 
c(t, k,k), and substituting C(k,7) with c(k,7,7). The system of 


Figure 2: Process structure P2 in which the fan-in and fan- 
out degreess are constant, but in which there are still long 
range communications 


recursion equations consisting of Equations (5.1) and Equation 
(4.7) is a new algorithm for dynamic programming. The new 


ensemble of processes P, {(i,j,k) :0<ti<k<j<nhis 
now three dimensional, as shown in Figure 2. The added one 
dimension and the increased number of processes are for the 
purpose of decreasing fan-in/fan-out degrees. A remaining cri- 
terion for a good parallel algorithm is that all communications 
between processes defined by a system of recursion equations 
should also be local. A few definitions are called for: 


Definition 5.1. The path length between two processes v 
def 


(v1,..-,Um) and u 
+--+ [Um — Um|. 


(u1,...,%m) is defined to be |v — uz| + 


Definition 5.2. A communication between two processes u and 
v, where u < vy, is local if their path length is bounded by a 
fixed constant. 


Definition 5.3. The order of a system of recursion equations is 
defined to be the maximum path length between any pair of 
processes u and v, where u ~ v. 


Similar to the high fan-in and fan-out degrees, the ex- 
.tra delay introduced by the communications of the high-order 
terms in the equations can undermine the efficiency of an al- 
gorithm. We therefore require that the order of a system of 
equations be either constant or grow very slowly with respect 
to the problem size. 


5.1. Localizing communications by domain contraction 

Note that Equation (4.7) contains terms a(t,j,1 + 7 — k) 
and b(t, + 7 —k,7), which in turn immediately depends on 
processes a(t,7 —1,2+7—1—k) and b(¢+1,7,2+1+ 7 —-k) 
by Equations (4.3). The order of the system of equations is 
therefore 2(|k + 1 — 421) for 1 < k < 7, which grows with 
the problem size n. To eliminate the high order terms, further 
transformations are required. aoe 

The path length 2(|k + 1 — *4|) between two communi- 
cating processes is a linear function of 2, 7 and k. The goal 
is to transform this linear function into a constant one. The 
most obvious choice is to set k = FL in which case the path 
length becomes 2. Geometrically, we can see that the coordi- 
nates of the high order terms above are symmetric to those of 
the low-order terms a(i,7,k) and b(i,7,k), with respect to the 
plane k = (¢ + 7)/2, shown in Figure 2 as a shaded plane. The 
algebraic transformation now corresponds to the contraction of 
the original domain of processes by one half along the plane, so 
that (i,7,) and (¢,7,¢-+7—k) become a single process (7, 7,k). 

Let processes (2,7,k), such that m < k < j, be referred to 
as being in the upper half of the process structure P:. Similarly, 
let those processes, such that 2 < k <,m’ be referred to as being 
in the lower half of P). 

Let the new process structure be the upper half of Po, Le., 
P; © {(i,j,k) :0<i<j<nandm<k < 3}, as shown in 
Figure 3. The data streams a and b of those processes originally 
residing in the lower half must be given new names so that they 
can be differentiated from those of the processes originally in 
the upper half. To achieve this, let d and e be new data streams 
to replace a and b for the processes of the lower half as shown 


in Figure 3. 
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Figure 3: Process structure P3, in which all data dependen- 
cies are local 


5.2. Algebraic manipulation to obtain low order equations 
First, Equation (5.1), defined for (t,7,k),7 << k <7, is now 

split into two set of equations, one for the case m e pte) < 

k < j where (i,j,k) resides in the upper half of P) and the 


other for the casei << k < m' & | +42] where it resides in the 
lower half. Since, for those processes (7, 7,k) that reside in the 
lower half, a and b will be renamed by d and e, we must be 
aware of the case when a process (7, 7,k) is in the lower half of 
P, while process (¢,7 — 1,k) might be in the upper half, such 
as in the case when k = m! and i+ 7 is even (or si j-i- 1 is 
odd). Another case to watch for is when (2,7, k) is in the upper 
half while (¢+1,7,) might be in the lower half; it occurs when 
k =m andi+ 7 is even (or s is odd). Two equations in (5.1) 
are now split into four equations, where the above two cases 
are singled out: 


upper half: 

q =k-c 2, k,k 
m<k<j—a(t,j—1,k). 

lower half: 

(k = m') A (odd(s))) = a(z,7 = 1,k) 
(upper half) 

a(t, j,k) r= (k = m') NA (even(s))) — a(t,7 ne 1,k) 
. ! Be ait . 
i<k<m!—a(i,j-1,h). (5.2) 

upper half: 

(k = m) A (odd(s)) — b(¢ + 1,3, k) 
(lower half) 
b(t, 7,k) = 


(k = m) A (even(s)) — 6(¢ + 1,7, k) 
m<k<j—7b(t+1,9,k). 


lower half: 


b(i ° k) k = a — c(k,j,7) 
t7J> = 
: <b <n S04 47,B). 


Next, every a(t, 7, &) in the lower half is replaced by d(#, 7, 7+ 


yj —k), where (i,7,2 + 7 — k) is now the upper half. Similarly, 
every b(i,7,k) in the lower half is replaced by e(7,7,1-+ 7 — k). 


(t,7,k) lower half 
(k = m’) A (odd(s))) — a(i,7 — 1,k) 
(upper half) 
(k = m') A (even(s))) > 
d(t,Jj 1a (J a 1) = k) 
t<k<m' od(i,j-1li+(y—-1-A&). 


d(Qi,j,t+j7—k)= 


(t,7,k) lower half 
k=1i- c(k, J, 9) 
e(t,7,~-+j7 —k) = a 
ae a eee ee eee oe 
(5.3 
Finally, a “substitution of variable” is performed on the 
above two equations, which replaces ¢ + 7 — k by k’: 


(k' = m) A (odd(s))) > a(t,j —1i+ 7 —’) 
(k’ = m) A (even(s))) > d(i,j — 1,k' — 1) 
m<k' <j—-d(t,j—1,k' —1). 

ki = j — e(t, 9,3) 
m<ki<jreit1,j,k' +1). 


d(i,7,k’) = 


e(i,7,k') = 
(5.4) 
Since k! is a bounded variable, replace it by k throughout Equa- 
tions (5.4). 
5.3. Resulting System of Equations 
The resulting algorithm is the following system of recur- 


sion equations: 
k — c(t,k,k) 
el Covet j —1,k) 
(k = m) A (odd(s))) > a(t,7 — 1,4 — 1) 
d(t,j,k) = | (k = m) A (even(s))) > d(i,7 — 1,4 - 1) 
m<k<j—-d(t,j-—1,k-1) 
(k = m) A (odd(s)) > e(t + 1,9,k + 1) 
b(t, 9,k) = | (k = m) A (even(s)) > b(t + 1,7, k) 
m<k<j—-b(i4+1,j,k) 
=j- c(t,J,3) 
ee astied 
s=0-C; 
a k =m — min(F(a(?,7,k), b(t, 9,%)), 
F(d(t,9,k), e(2,3,k)) 
c(t, j,k) = 
m<k<j- 
— 1), F(a(t,7,k), b(t, 3, k)), 
F(d(i, j,k), €(é,5,h)) 
k=j — c(t,j,k — 1) 


min(c(t,7,k 


(5.5) 
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6. Space-time Mapping to Incorporate Pipelining 


From the algorithm (5.5) derived above, we further im- 
prove the efficiency of the algorithm by incorporating pipelin- 
ing automatically. From the stand-point of implementation, a 
process v in a system of recursion equations will be mapped to 
some physical processor s during execution, and once the pro- 
cess is terminated, another process can be mapped to the same 
processor. In fact, such re-use of the resource is the essence 
of pipelining. We call each execution of a process by a pro- 
cessor an invocation of the processor. Let t be an index for 
labeling the invocations so that the processes executed in the 
same processor can be differentiated, and let these invocations 
be labeled by strictly increasing non-negative integers. Then 
for a given implementation of a program, each process v has an 


alias [s, ¢] of f(v), telling when (which invocation) and where 
(in which processor) it is executed. The key to an efficient par- 
allel implementation of an algorithm is to find an appropriate 
one-to-one function f that maps a process v to its alias [s, ¢] 
such that ¢ will be non-negative and t2 > t; if vy < v2, where 


def def 
[s1, t1] = f (vi) and [se, te] = f (v2). 
6.1. Data dependency vectors 


To find such a mapping, we require that the domains of the 


recursion variables of a Crystal program be vector spaces, and 
that the appropriate vector addition and scaler vector product 


be defined. Though each of the recursion variables 7 and 7 
in Equation (2.1) assumes an integer value, its domain can be 
extended to the set of rationals. For instance, an m dimensional 
process structure is now embedded in an m-dimensional vector 
space over the rationals. From now on, we may refer to the m- 
tuple of values of recursion variables identified with a process as 
a vector. As suggested by [12], a data dependency vector plays 
an important role in mapping algorithms to parallel processors. 
Since the vector addition is defined and the data dependency 
of two vectors takes on an exact meaning, a data dependency 
vector can be formally defined: 


Definition 6.1. A data-dependency vector is the difference v—u 
of vector v and vector u, where u < v (u immediately precedes 


v). 


6.2. Basis communication vectors 
As described above, each program defines a set of data de- 

pendency vectors. On the other hand, each network topology 
defines a set of basis communication vectors. For instance, in an 
n-dimensional hypercube, a processor has n connections to its 
nearest neighboring processors. Each of the n communication 
vectors (one for each connection), has n + 1 components. The 
first n ones indicate the movement in space and the last one 
indicate movement in time, which is always positive (counting 
invocations). These n communication vectors, together with 
the communication vector [0,0,...,0,1], representing the pro- 
cessors’s communication of its current state to its next state, 
form the basis communication vectors. In an n-dimensional 
network, there can be more than one set of basis communica- 
tion vectors. Taking a two-dimensional hexagonal network as 
an example, a diagonal connection has a communication vec- 
tor [1,1,1]. The set of vectors {[1,0, 1], [0,1, 1], [1,1,1]} serves 
as bases as well as the set {[1,0,1],{0,1,1],[0,0,1]}. For any 
n-dimensional network which has nearest neighbor connections 
and is regularly connected and indefinitely extensible, all pos- 
sible sets of basis communication vectors can be obtained by 
the enumeration of its symmetry groups Lin and Mead [10]. 


6.3. Uniformity of a parallel algorithm 

The concept of uniformity is introduced to characterize 
parallel algorithms so that an expedient procedure can be ap- 
plied to a uniform algorithm to find the space-time mapping 
from processes to processors. Let d;, 1 <2 < k denote the data 
dependency vectors appearing in a program. 


Definition 6.2. A uniform algorithm is one in which a single 
set of basis vectors b;, 1 < 7 < m, can be chosen so as to 
satisfy the mapping condition that every data dependency vec- 
tor d; appearing in the algorithm can be expressed as a linear 
combination of the chosen basis vectors with non-negative coor- 
dinates, in other words, there exists b; anda; >0,1<j<m, 
such that 


d; = a,b; + agbe +--+: + mbm, forl<i<k. 


6.4. Motivation for the mapping condition 

The mapping condition is motivated by the possibility 
of using as the space-time mapping a linear transform from 
the basis data dependency vectors to the basis communica- 
tion vectors. Each basis communication vector e; corresponds 
to a nearest neighbor communication on a network of proces- 
sors, and its time component is always 1. When the map- 
ping between the two sets of basis vectors is determined, then 
the communication vector ¢; to which a data dependency vec- 
tor d; corresponds is also determined, i.e., cj = )oj_, aje; if 
d; = jt ayb;. 

If a data dependency vector has a term with a negative 
coordinate a; in its linear combination, then a; ,e; represents 
a communication that takes negative time steps, a situation 
that is not feasible in any physical implementation. In the case 
of a non-uniform program, more than one set of basis data de- 
pendency vectors must be chosen so as to satisfy the mapping 
condition, and the space-time mapping may become non-linear. 
In this case, the inductive mapping procedure becomes neces- 
sary. 

Examples of using such a simple procedure to find linear 
mappings of processes to parallel architectures for matrix prod- 
ucts, LU decomposition, array multipliers, etc., can be found 
in [2, 3]. Most of the systolic algorithms reported in the lit- 
erature can be obtained this way, and new systolic algorithms 
are discovered due to the ability to generate systematically all 
possible sets of basis communication vectors. 


m 
j=l 


6.5. A uniform dynamic programming algorithm 
Let d;,z = 1,2,...,5 denotes the data dependency vectors 
of Systems 5.5: 


0 a = 
dy we I ,d g 0 ds ot 0 ’ 

0 —l 0 

; ‘ (6.1) 
dg [1] ,4,2 [0 

1 1 


Let the following vectors be the chosen basis data dependency 
vectors: 


iat Pee ioe re he 
21 = 0 > 23 = 1 »23 oe 0 3 
—]l 0 1 
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and let matrix Z be defined as Z = (%1,2Z2,2%3), then 


d; = Za;, where 


0 1 1 
a, @ 1 jag = 0 yas 0], 
0 0 1 
0 0 
a, = 1 vag 0 
1 1 


Note that every component of a; is non-negative, thus System 
5.5 of dynamic programming is a uniform algorithm. Choose 
the set of basis communication vectors as 


1 0 0 
Cy flo C3 oii Cs elo , and C ¢ (cy, C2, Cg). 
1 1 | 


(6.2) 
The linear mapping T from the basis dependency vector to the 
basis communication vector is then 


-1 0 0 

T=cCczZ'=|0 1 0 

—2 11 

Proposition 6.1. The space-time mapping from each process 
(i,7,k) in P3 of System (5.5) to an invocation of a processor 
[z, y,t] is a linear mapping T, in the matrix notation where 
each process and invocation is written as a column vector, or 
written in a functional notation as 


[z,y,t] = f(i,7,k) = [-t,9, -20+ 9 + k] where 
—n<az<-ll<y<n,2<t<2(n-1). 


The inverse mapping from the image of f to the process struc- 
ture, denoted by f~', is 


(i,7,k) = f71(2,y,t) = (2, y,22 —y +2), 


specifying which process is being executed at a particular time 
step t of processor (x,y). 


Similarly, other space-time mappings can be derived from 
other sets of basis communication vectors which are given be- 
low in a matrix notation with each basis communication vector 
written as a column vector: 


101 10 -1 10 -1 
011]),{01 -1],/01 0 
1141 11 1 11 1 


The set of basis communication vectors in Equation (6.2), or in 
each of the above sets of vectors, results in a different systolic 
architecture, as demonstrated below. 


6.6. Space-time Recursion Equations (STREQ) 

A system of space-time recursion equations (STREQ), which 
describes a target systolic architecture or algorithm can be ob- 
tained from an original program and a space-time mapping 
by algebraica manipulation. The following STREQ that de- 
scribe the resulting design are obtained by substituting the 
inverse mapping f~! of i, 7, and k, the mappings [21,y, tl, 
| = 1,2...,5 of dependency vectors dy, into the original system 


of recursion equations, and by renaming G(z,y,t) = a(t,j,k), 


a 
b(a, y, t) = b(t,9,k), é(a, y,t) = c(t, J, k), d(x, y, t) = d(t,7, k), 
and é(z,y,t) = e(t,7,k). Note that as a result, the substitu- 


tion of the predicates s = 0 becomes z = 1, where z = y— 2, 
k = m becomes t = j Stu 2) ] = [82], k = 7 becomes t = 2z, 
m<k <j becomes ([82] < t < 2z), etc. 


a(z, Y; t) = 


d(x,y,t)= 4 (t= Ea) A (odd(z))) > d(z, y — 1,t — 2) 


t = 22 — é(2, y, t) 
3 
[F1 <t <2z— a(z,y—1,t-1) 


(t = [$1) A (even(z))) + aa, y-1,t- 2) 


1 <t <2z— d(x, y—1,t —2) 
(t= Ea) A (even(z)) > é(a + 1,y,¢ + 1) 
(t= [F1) A (odd(z)) + 6a + 1, y,t — 2) 


el <t<2z— b(x+1,y,t —2) 


t = 2z — é(2, y,t) 


é(z,y,t) = 


; | 
[S1 St < 22 &e+1,y,t+1) 


z=1-C, 


min(F(a(z, y, ¢), b(a, y,t)), F(d(c, y,t), é(a, y, t)) 
é(2, y, t) ” 
([--]S#< 22) — 


min(é(a, y,t — 1), F(a(z, y, t), 6(2, y, t)), 
F(d(2, Y; t), é(z, Y; t)) 


t = 2z — é(2,y,t — 1) 
(6.8) 


7. Target Systolic Architecture 


7.1. Processor and time requirements 

The design consists of a triangular array of (n—1)(n—2)/2 
processors which complete the computation in 2(n — 2) time 
steps, as indicated by the range of x, y and t in Proposition 6.1. 
Notice that the space-time mapping has achieved the goal of 
decreasing the complexity of processor requirement from O(n?) 
of the naive interpretation of System (5.5), to O(n”) in the 
new algorithm by only increasing the time complexity with 
a constant factor, in this case, 2. Thus we see pipelining in 
the resulting algorithm and the successful sharing of a single 
processor among O(n) processes. 


7.2. I/O and storage requirements 

In the resulting algorithm, if any of the space components 
of a process on the left-hand side of a Equation, which is the 
identifier of a processor, differs from that on its right-hand side 
(the identifier of another processor), then the corresponding 
datum is an input to the processor, such as @[z,y — 1,¢ - 1] 
in Equation (6.3), which is from the processor below it and 
which takes 1 time step to arrive. If the processor identifiers on 
both sides of an equation are the same, then the corresponding 


782 


datum is a stored value such as é[z, y,t — 1] in Equation (6.3). 

The space-time mapping in Proposition 6.1 yields the well- 
known systolic architecture for dynamic programming [7}. It 
can be seen in Figure 4 that a process (t,7,k) is mapped to 
processor (—?,7) at time step t = —22+7+k. Such an invoca- 
tion may have inputs from invocations both at time ¢ — 1 and 
t — 2, corresponding to the fast registers and the slow registers 
described in [7]. All signals controlling the loading and unload- 
ing between registers of different speed can be systematically 
derived by techniques of compiler optimization as described 
below. 


Figure 4: A series of parallel planes indicates the time steps 
of the space-time mapping. Processes on the same plane are 
mapped to the same invocation number ¢. (The processor 
number of a process in this case is its projection onto the 
bottom plane with the sign of x changed). 


7.8. Control requirements 

The predicates in the STREQ are implemented differently 
depending on different target implementation media. In a mul- 
tiprocessor machine, the “processor id” (a, y) can be stored and 
an “invocation counter” keeps track of ¢ until conditions such 
as t = 2z is satisfied. In a VLSI implementation, however, it is 
too costly to store these integers and dedicate hardware to per- 
form the tests. The following optimization of STREQ becomes 
necessary. 


7.4. Optimization performed on STREQ 

A better design can be obtained by replacing the expensive 
computations of the predicates by transferring a one-bit control 
signal, as illustrated by [7]. For a predicate that is independent 
of t, there is no concern, since it can be hardwired into the de- 
sign. For any predicate that is dependent on ?¢, it must be 
substituted by one that is independent of ¢. Since a communi- 
cation always both moves in space and takes time to complete, 
it can be used to “compute” expressions of the space-time re- 
cursion variables. For instance, an expression t = x — 1 can be 


computed by a data stream which carries the value “true” and 
“false” . 


(¢ = 0) A(r= 1) — true 

(establish the truth initially for the test) 
(t > 0) A(z = 1) — false 
({>0)A(z@> 1) > g(x -1,t-1) 


q(x, t) = 


It can be easily seen that such signal can be used to compute 
any predicates that is linear with respect to the recursion vari- 
ables. Of all the control signals, the one to implement the test 
of predicate t = [%] is of particular interest. The predicate 
contains a piece-wise linear expression, and the resulting sig- 
nal travels in those processors with even s in half the speed 
as they do in those processors with odd s. This particular 
type of signal is described incorrectly as a linear signal that 
moves “two cells every three time units” in [7] as opposed to 
the above-mentioned piece-wise linear signal. 

Central to an optimizing compiler for parallel systems is 
the making of trade-offs between communications and com- 
putations, which can be carried out symbolically in quite an 
elegant fashion, as illustrated here. 


8. Related Work 


Many attempts in pursuit of systematic methods for de- 
signing systolic computations have appeared in the literature. 
In the classification of systolic designs from the geometric point 
of view, Lin and Mead [10] classify systolic arrays using symme- 
try groups, but there is no attempt made to obtain new designs. 
Cappello and Steiglitz [1] view systolic designs as affine trans- 
forms from combinational designs, however, their method does 
not give clues to finding the appropriate transforms. 

Moldovan [12, 13] notes that data dependency vectors are 
what suggest clues to potential linear transforms. Miranker 
and Winkler [11] describe a similar method, and they have ob- 

‘served that all 2-dimensional regularly connected arrays can be 
embedded in a hexagonal array, called a universal array. Quin- 
ton [14] starts with a specification called a system of uniform 
recurrence equations and applies affine and linear transforms to 
obtain a systolic array. The class of systolic computations that 
can be described by uniform recurrence equations is limited to 
those that have only one of their data streams dependent upon 
other data streams. Li and Wah [8] describe a method that 
first finds an appropriate placement of one of the input streams 
and then perform the linear transforms. Delosme and Ipsen [6] 
describe how to implement a hyperbolic Cholesky solver by a 
linear transformation from a system of recurrences. 

All of the above four methods use (quasi-)linear or affine 
mappings from a restricted class of algorithms to a new design. 
The class of designs that can be obtained by these methods are 
thus subject to restrictions such as uniform recurrent (a more 
restrictive definition than the uniformity defined in this paper), 
synchronous, regular, having uniform data flow, etc. The capa- 
bilities of these methods are, at best, equal to the mapping pro- 
cedure for the uniform algorithms described above. However, 
instead of using a simple, constant time procedure consisting 
of enumerating the basis communication vectors by subgroup 
_enumerations and finding mappings between the two sets of ba- 
sis vectors, they find linear mappings by searching through the 
space of possible linear transforms or input placements. The 
search requires a time at least polynomially of the size of the 
problem [8]. None of these methods is able to deal with the 
problem of the large number of fan-ins and fan-outs. 
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Li and Wah [9] have classified the problems that can be 
solved by dynamic programming into several classes and have 
proposed systolic processors for solving them. The formulation 
given in this paper corresponds to the most general class — 
in their terminology, the polyadic-nonserial class. In their 
treatment, however, the issue of synchronization and timing, 
which makes the systolic design in [7] particularly interesting, 
is not addressed. 


9. Concluding Remarks 


The objective of this work is the seeking of parallel pro- 
grams that effectively utilize underlying technologies. Due to 
the complexity that might arise in dealing with hundreds of 
thousands of autonomous parallel processing elements, we find 
design methodology for this process necessary in order to take 
the enormous burden off the designers. Program synthesis, at 
the very basic level, relies on a formal notation for describing 
the problem, and henceforth on manipulating the descriptions 
to yield efficient parallel programs. This paper describes two 
stages of transformations of programs described in the parallel 
language Crystal. The first stage proceeds from a definition 
of a problem to a program that has fixed fan-in and fan-out 
degrees and uses only local communications. The second stage 
incorporates pipelining into this program by space-time map- 
ping. Optimization of control signals is also illustrated. 

It is essential that the language Crystal be amenable to 
algebraic manipulation; hence, techniques developed for algo- 
rithm transformation can be carried out formally, and poten- 
tially by an automatic process. The procedure of space-time 
mapping for uniform algorithms and the test for the existence 
of a linear mappings are computational attractive and their 
complexities are independent of the problem sizes. 
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Abstract 


As an alternative to using special-purpose hardware to 
accelerate the fault simulation process for digital logic design, a 
new approach is proposed to perform fault simulation using 
multiple workstations or processors in parallel. By dividing 
the faults equally among those processors, it is possible to 
achieve a linear performance improvement for circuits of 
9,000 gates and up. The empirical results obtained from 
implementation of this approach on ELXSI-6400 Super 
mini-computers and Apollo workstations corroborates the 
expectation. 


1. Overview 


The increasing complexity of digital systems has forced digital 
designers to use sophisticated computer-aided design (CAD) 
tools. One of the most widely used ([1]) class of CAD tools is 
digital logic simulators which are used both for verifying the 
behavior of conceptual designs of digital systems and for 
developing tests for digital systems. These simulators are 
commonly referred to as verification simulators or fault 
simulators ([2]) depending upon the intended application. 


There has been an increasing trend ([3][4][5]) towards applying 
parallel processing to accelerate logic simulators. Usually, a 
special hardware engine is proposed and built to implement the 
simulator algorithm. Hardwiring or microcoding one specific 
algorithm may increase the performance from ten to one 
hundred times. The major drawbacks to this approach are the 
extra hardware cost and the difficulty of interfacing with other 
tools. Sometimes the performance gained through one step may 
be lost in the whole design procedure. In this paper, we shall 
describe one method of applying parallel processing to achieve 
acceleration of fault simulation. Our method permits us to use a 
large number of low cost general Purpose computers in a 
parallel processing mode to achieve very high speed fault 
Simulation at very low cost compared to the use of hardware 
accelerators or expensive mainframe computers. As 
distinguished from the application of other parallel processing 
techniques ({3][4][5]) applied to digital simulation, our 
technique exploits general purpose computers rather than 
special purpose hardware. Empirical results show that even 
with a large number of processors, we are able to achieve 
nearly linear acceleration of fault simulation as a function of the 
number of processors. With "N" processors, each operating at 
1 Mips, we can achieve close to the performance obtainable 
from a single processor with a Processing power of "N" Mips. 
2. Status of Fault Simulator 

Fault simulators are used for developing manufacturing and field 
service tests with a particular view to detection and diagnosis 
of defects in digital systems. They are used for evaluating the 
effectiveness of candidate tests in screening digital systems for 
defects. The same simulators can also be used for developing 
diagnostic information that is used in conjunction with the tests 
to help locate defects caused in man ufacturing or defects that 
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develop in the field. The diagnostic information developed from 


the use of fault simulators can then be used in the repair of 
defective products as well as in the adjustment of the 
manufacturing process. 


Defects in digital systems can be quite varied ([6]), e.g. shorts, 
opens, pinholes in oxide, improper alignments, etc.. However, 
it becomes impractical to simulate this diversity of defects 
when applying a fault simulator to digital systems of any 
reasonable complexity. The classical stuck fault model is most 
commonly used in conjunction with fault simulators since single 
stuck faults are relatively easy to simulate and empirical 
evidence suggests that this model is highly effective in practice 
when applied to digital systems. Since a large number of single 
stuck faults can exist in a digital system, the computer 
resources needed to simulate these faults for very large digital 
systems can be excessive ([14]). 


Innovations in fault simulation algorithms have made it practical 
to apply fault simulation to digital systems consisting of tens of 
thousands of logic gates. Starting with serial fault simulation 
([7]) involving simulation of one single stuck fault at one time, 
there has been an evolution of techniques that permit simulation 
of several faults at one time. Parallel fault simulators ({8]) 
enable a fixed number of faults to be simulated in parallel, 
typically 31 faults at one time for a computer with 32 bit 
words. Deductive ([9]) and concurrent ({10]) fault simulators 
permit simultaneous simulation of a much larger number of 
faults even more efficiently ({14]) than with parallel fault 
simulators. 


More recently there have been significant developments in 
application of statistical ([11]) and approximate ([12]) methods 
to achieve additional speed-up in the fault simulation process. 
There is clear recognition of the value of approximate and 
Statistical methods of fault simulation in the design and test of 
digital systems. However, the issue of whether or not one 
needs accurate fault simulation is very much open ([13]) and 
subject to both intense controversy as well as to the application 
at hand and, as such, will not be further elaborated upon in this 


paper. 


Digital systems being designed today are large enough to create 
compute-time bottlenecks for accurate fault simulation ({14]) 
even on the most advanced computers. Rigorous design for 
testability techniques help to significantly alleviate the problem 
bottlenecks. In spite of this, few large digital systems are 
being designed with testability enforced rigorously. Hence, 
there is a continuing need to accelerate fault simulators. 
leration of Fault Simulati 

Special purpose computers ([3][4][5]) have been developed 
expressly for accelerating verification simulation and extended 
to accelerate fault simulation as well. These special purpose 
computers, commonly referred to as hardware accelerators, 
are just beginning to enter the marketplace for fault simulation. 
Current hardware accelerators exploit special purpose 
hardware as well as micro-code to directly implement the 


simulation algorithms. 


Another way to improve simulation speed is to use several 
general purpose processors to execute the same simulation 
algorithm. Rather than each processor simulating a piece of the 
algorithm, each processor can apply the complete algorithm on 
separate pieces of data. For logic simulation, each processor 
can simulate part of the circuit ([3]). However, it is not 
practical to obtain a linear performance improvement with an 
increasing number of processors by this method since there is 
no way to ensure an even distribution of simulation events 
across the multiple processors. In fact, the inter-processor 
communication overhead can quickly overtake the benefits 
realized from parallelism. A second approach is to let parallel 
processors simulate the entire circuit with independent subsets 
of the stimulus vectors. This is feasible only when the set of 
stimulus vectors is such that independent subsets of stimulus 
vectors can be identified with each subset not dependent on the 
Circuit state created by preceding stimulus vectors. In general, 
without explicit designer assistance, it would be infeasible to 
attempt this form of stimulus vector partitioning for sequential 
logic circuits. In combinational logic circuits however, each 
Stimulus vector can be viewed as being completely independent 
of the preceding stimulus vectors. Hence, the technique of 
parallel processing of vectors is quite viable for combinational 
logic circuits. In a recent publication ([15]), this technique has 
been applied by exploiting the bit-wise parallelism inherent 
within all general purpose processors. In ([15]), 256 stimulus 
vectors at a time are simulated in parallel with serial fault 
simulation to achieve significant speed-up in accurate fault 
simulation for large combinational logic circuits. The same 
publication also cited an application of the same approach to 
verification simulation of scan design logic which, for most 
verification purposes, can be treated in a manner similar to 
combinational logic circuits. 


Fault simulation lends itself to multi-processing even more so 
than verification simulation. Besides a plurality of gates and 
Stimulus vectors, fault simulation also has another dimension - 
a plurality of faults. The time used by a fault simulator can be 
summarized by the following formula: 


T=TP+TGeP+FeTFeP/2 
where: 


TP is the time spent for the model building and the 
post-processing, 


P is the total number of patterns to be simulated, 


TG is the average time to simulate the good machine for one 
pattern, 


F is the total number of faults to be simulated, 


TF is the average time to simulate one faulty machine for one 
pattern. 


Here we assume that a fault will be dropped from further 
simulation if it is detected in one pattern. By assuming all 
faults are detected uniformly in these P patterns, a fault will be 
simulated only P/2 patterns on average. 


Usually TF is about 0.01 to 0.1 of TG for a concurrent fault 
simulator. For a circuit of 5,000 gates, there are about 
10,000 stuck-at faults to be simulated. Hence, the third factor 
in the equation is about 50 times the second for this circuit. 
This ratio will be increased proportionally with the size of the 
Circuit. 
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In a multi-processor computer, each processor can simulate a 
subset of the total faults set. At the end of the simulation, one 
processor can merge the data generated by the other 
processors and produce the combined simulation results for 
viewing by the designer. During simulation time, very limited 
communication is needed between the parallel processors. In 
this way, each processor will spend about 1/Nth of the third 
factor on faulty machine simulation. As such, one would expect 
to achieve an almost linear performance improvement with 
increased numbers of parallel processors, even up to hundreds 
of processors for a 50,000-gate fault simulation. Results 
achieved by us from a practical implementation of these 
concepts corroborates the expectation. 


DSSIM-A: Accel | Fault Simulation With Parallel 
Processing 


AIDSSIM ([16]) is a concurrent fault simulator which has been 
used in the industry for several years. Based on AIDSSIM, a 
new multi-processor fault simulator, AIDSSIM-A has been 
implemented on ELXSI super mini-computers and on a 
multi-processor Apollo Domain network. The basic algorithm 
can be summarized as follows: 


Step 1. 
The main processor, called parent processor, accesses model 
and stimulus vectors for the circuit to be simulated. 


Step 2. 
The parent processor activates several children processors 
while passing along the circuit information. 


Step 3. 

Each processor (both parent and child) selects an independent 
non-overlapping subset of the total fault set; only one fault 
subset is to be simulated by one processor. 


Step 4. 

During fault simulation of each subset of faults, the 
corresponding processor creates the simulation data that is 
written out to unique files. 


Step 5. 

re each processor finishes simulating the stimulus vectors 
and its unique fault subset, it signals the parent processor and 
returns summary information to the parent while terminating 
its own processing. 


Step 6. 
The parent processor merges the simulation data created by 
each of the children and produces the final result. 


Since fault simulators spend the majority of their time in the 
simulation of Step 4, the algorithm provides an almost linear 
reduction in elapsed time by using several processors running in 
parallel. Whether the algorithm can achieve the full linear 
performance acceleration depends on the effectiveness of the 
host multi-processing computer in reducing the communication 
overhead. The ELXSI-6400 ([17]) is a super-minicomputer with 
closely-coupled processors. In contrast, Apollo computers 
([18]) are microcomputers linked with a relatively slower 
speed inter-processor communication network. In spite of 
significant architectural differences, both the ELXSI and the 
Apollo systems show an almost linear performance gain with 
increasing number of processors while running AIDSSIM-A. 


We have experimented with different approaches to partitioning 
the total fault set into subsets for parallel processing but have 
not observed any striking difference from one to the other. The 


‘results reported in this paper have been achieved by a fault set 
partitioning that is quite arbitrary and is done as follows. The 
fault set is assumed to have an arbitrary order. Each 
successive group of "n" faults is assigned by associating one 
fault to each of the "n" processors. 


ELXSI-6400 is a multi-processor system with up to ten 
processors. The system can support up to 192 M-bytes of real 
memory. Four billion bytes of virtual memory are available per 
processor. Its GIGABUS, a synchronous, 64-bit data bus 
operating at 320 M-bytes per second, allows data sharing 
among processors with a minimum overhead. With cache 
memory, ELXSI-6400 allows memory segments to be shared 
among processors without any noticeable overhead. Taking this 
into consideration, AIDSSIM-A implementation on the ELXS! 
shares the circuit model's data structure between the multiple 
processors. Only the single parent processor is used to 
perform good machine simulation over a small number of 
consecutive stimulus vectors. The good machine simulation 
result for each group is saved in common memory for access by 
all the children processors. When there is not enough memory 
space to hold good machine simulation data for the next group of 
stimulus vectors, the parent processor will enter a wait state. 
The parent processor's wait state is exited only when the 
children processors have processed the good machine data for a 
previous group, thereby making memory available for the new 
group. When a child processor finishes processing with the 
current good machine data, the child will also enter a wait state 
until new good machine data is available from the parent. 
Provision of shared memory to hold good machine data for up to 
10 stimulus vectors reduces the number of wait states for each 
processor. Further, to minimize wait states for the children 
processors, the parent processor is made to run faster than all 
children processors by having the parent simulate only half the 
number of faults simulated by the children processors. 


We have processed several circuits on an ELXSI-6400 
computer. Figure 1 shows the results for a circuit containing 
6600 gates. There are 23741faults; 8404 faults are 
equivalence faults. After 104 test stimulus vectors, 16031 
faults are detected. The fault coverage is 67.52%. Our results 
are reported from an ELXS!-6400 with 32 M-bytes of main 
memory and 4 processors. Note that the results go up to 9 
processors. We used virtual processors to go beyond the 4 
processors on the machine that was available to us. The total 
processor (cpu) time is the summation of the individual 
processors’ times. Note that the total time increases only about 
10% from one processor to nine processors. As can be seen, 
the average processor time decreases inversely with the 
number of processors. There is about a 25% difference 
between the fastest and the slowest processors running 
independent fault subsets. The effective elapsed time reduction 
has to be determined from the run time associated with the 
slowest of the parallel processors. 


6. AIDSSIM-A On Apollo 


Apollo computers are based on the Motorola 68010/68020 
microprocessors as well as on proprietary bit-slice processors. 
They are linked by the high-speed Apollo Domain network which 
allows communication data rates of up to 10 M-bits per second. 
Hundreds of Apollo computers can be linked in a network. 


Due to the slower communication speed on an Apollo network, 
(compared to an internal bus on an ELXSI-6400), the AIDSSIM-A 
implementation on the Apollos does not transfer data between 
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parent and children processors during the simulation step (Step 
4). At the beginning of the simulation, the parent processor will 
send the model and fault data to all the children processors. 
Each processor will have its local model data structure, but the 
stimulus vectors file is shared. At the end of simulation, the 
parent processor accesses the simulation result files of 
children processors and produces the final result. 


The circuit used to produce results shown in Figure 1 was also 
processed with AIDSSIM-A on an Apollo network with six 
processors. The results are shown in Figure 2. The Apollo 
processors used are DN460 or DN660, each ranging from 4 
M-bytes to 16 M-bytes of physical main memory. Since only 
six processors were available, AIDSSIM-A results in Figure 2 
are limited to a maximum of six processors. As can be seen, 
the total processor (cpu) time undergoes relatively small 
change as we increase the number of processors. For the 
circuit used, we determined the communication overhead to be 
about 5 to 10 seconds per processor. The average processor 
time also decreases inversely with increasing number of 
processors. 


7. Conclusi | Future Extensi 


AIDSSIM-A has demonstrated that fault simulation can be 
accelerated almost linearly with an increasing number of 
processors in a multi-processing environment. A major 
advantage of this technique is in permitting the exploitation of a 
large number of low cost processors to provide a powerful 
cost/performance alternative to the use of special purpose 
processors or hardware accelerators. We further intend to 
gather data from running circuits with one hundred thousand 
gates. Since Apollo networks permit hundreds of processors in 
the network, it will be quite interesting to see how AIDSSIM-A 
performs with a very large number of processors. More work 
is required to permit recovery from the failure of one or more 
processors being used in the network or in a multi-processing 
environment. It would also be desirable to dynamically 
determine the existing/projected workload on various 
processors in a network and allow for migration of the 
processing workload from high utilization processors to those 
processors which have low current utilization by other users. 
In an environment where some of the processors being operated 
in parallel are of unequal compute power, it will be necessary to 
change the allocation of fault subsets and allow for the slower 
processors to be synchronized with the faster processors 
without slowing down the latter. 


References 


Beresford, R. "The Role of CAE: Views from Users", 
presented at Design Automation Conference (6/85) 
published by CMP Publications, Manhasset, NY as part of 
"CAE: The User, the Industry", 9pp. 


[1] 


Goel, P. and Moorby, P.R., "Fault Simulation Techniques 
for VLSI Circuits", VLSI Design, (7/84) 22-26pp. 


[2] 


Pfister, G.F., "The Yorktown Simulation Engine, 
Introduction", 19th Design Automation Conference, 
(1982) 51-54pp. 


[3] 


Siegel, S. and Kaszynski, M.E., "The Design of a Logic 
Accelerator", VLSI Systems Design, (10/85) 76-80pp. 


[4] 


Blank, T., "A Survey of Hardware Accelerators Used in 
Computer-aided Design”, Design and Test of Computers, 
(8/84) 21-89pp. 


[5] 


[6] Timoc, C. et al, "Logic Models of Physical Failures", Fault Simulation for a 6600 Gate Circuit 
Proc. of International Test Conference, (10/83) 7 RUNMENG TOOR  eUN Set 
546-553pp. 


[7] Roth, J.P. et al, "Programmed Algorithms to Compute 5000 
Tests to Detect and Distinguish Between Failures in Logic 
Circuits", IEEE Trans. Electron. Computers, Vol EC-16, 

(1 0/67), oe PP. s0e8 Cumulative CPU Time 

[8] | Seshu, S., "On an Improved Diagnosis Program”, IEEE 
Trans. on Electronic Computers, Vol EC-14, (1965) oe 
76-79pp. 

[9] Chang, H.Y.P. et al, "Comparison of Parallel and 
Deductive Fault Simulation Methods", IEEE Trans. on 
Computers, Vol C-23, No. 11, (11/74) 1132-1139pp. 


[10] Ulrich, E.G. and Baker, T., "The Concurrent Simulation of 
Nearly Identical Networks", Proc. of Design Automation 
Conference, (1973)145-150pp. 


Time (seconds) 


2000 Average Process Time 


1000 


[11] Jain, S.K. and Agrawal, V.D., "STAFAN: An Alternative i 2 3 4 5 6 7 8 
to Fault Simulation", Proc. of 21st Design Automation Number of Processors Employed in ELXS! 6400 Configuration 
Conference, (1984)18- 23pp. | 


[12] Brglez, F., "A Fast Fault Grader: Analysis and Figure 1. AIDSSIM-A Performance on ELXSI Multiprocessor 
Applications", Proc. of International Test Conference, 
(1985) 785-794pp. 
Fault Simulation for a 6600 Gate Circuit 
[13] Panel Session #21, International Test Conference, Running 100% Fault Set 
(1985) 


[14] Goel, P., "Test Generation Costs Analysis and 
Projections", Proc. of 17th Design Automation Conf., 
(1980) 77-84pp. 


\ cumulative CPU Time 


[15] Waicukauski, J.A., et al, "A Statistical Calculation of 
Fault Detection Probabilities by Fast Fault Simulation", 
Proc. of International Test Conference (1985) 
779-784pp. 


[16] AIDSSIM User Manual, Gateway Design Automation Cormp., 
Littleton, MA. 


Time (seconds) 


w Average Process Time 


[17] ELXSI System 6400 Introduction, ELXSI, San Jose, CA. 


[18] "Getting Started With Your Domain System", Apollo 
Computers, Chelmsford, MA. 


Number of Apollo DN660 Nodes 


Figure 2. AIDSSIM-A Performance on Apollo Network 


788 


A SLICING ALGORITHM OF CONCURRENCY MODELING 


BASED ON PETRI NETS 


Carl K. Chang and Huiyu Wang 


Department of Electrical Engineering and Computer Science 
University of Illinois at Chicago 
Box 4348, Chicago, IL 60680 


ABSTRACT 


Since its birth two decades ago, the Petri Nets 
theory has been widely applied to system modeling. In 
particularily, Petri Nets are highly suitable for 
modeling and analyzing distributed systems due to its 
inherent property of concurrency. This paper 
describes an algorithm which can slice out all 
concurrency sets of a Petri Nets model. The 
concurrency set is defined as the set of paths in 
different processes which should be_ executed 
concurrently. Determining the concurrency sets of a 
concurrent or distributed system model is very 
significant in performing analysis. The’ time 
complexity of the algorithm is O((n/N)"* ). Here n is the 
number of transitions of the model and N is the 
number of processes of the model. An example of the 
application of the algorithm in testing distributed 
software systems is also presented. 


I. INTRODUCTION 


Petri Nets is a well-known theory concerned with the 
concurrency or parallelism property of different kinds of 
systems in the area of computer science and other related 
fields [1]. As the application of Petri Nets are continously 
emerging, many variations and extensions of Petri Nets 
have also been developed, such as Modified Petri Nets [2], 
Discrete Time Stochastic Petri Nets [3], extended Petri Nets 
[4], etc. Certainly, they provide more powerful tools in 
modeling and analyzing a variety of systems. 


In general, the problem of analyzing a concurrent 
system is to analyze the interactions among different 
processes in the system. For the Petri Nets model, the 
problem is the analysis of the dynamic relations among the 
transitions in different processes. Typically, firing the 
transitions in one process can affect the execution of 
transitions in other processes. For example, in Figure 1, 
there are two paths in process 1: Pl11=(t11, t12, t14), 
P12=(t11, t13, t14) and two paths in process 2: P21=(t21, 
t22, t23), P22=(t21, t23, t24). P11 and P12, P21 and P22 
are mutual-exclusion paths, i.e. execution of one path will 
suspend another path. In Figure 1, if path P12 in process 1 
is expected to be executed, the path P21 in process 2 should 
be executed concurrently, otherwise the transition t13 will 


Process l 


Process 2 


Figure I. A concurrency set 
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never be enabled. We define the set of paths in different 
processes which should be executed concurrently as 
concurrency set. Therefore, P12 and P21 are in same 
concurrency set. Identifying concurrency sets in Petri Nets 
models is critical to the analysis of these models. 


Upon the modeling of a distributed software system, 
several processes can be executed concurrently on different 
processors. In regard to the testing problem, we want to 
select a set of input data to execute these processes 
concurrently in order to verify the correctness of the 
program [5]; for example, to detect deadlock. Because there 
may exist more than one path in each process, we have to 
determine which paths should be executed concurrently (i.e. 
in the same concurrency set). To date, few research results 
have been reported in this area. 


In section II, a slicing algorithm is presented. The 
algorithm is able to slice out all concurrenccy sets by 
examining the Petri Nets model. The complexity of the 
algorithm is discussed in Section III. An example is given in 
Section IV. Finally, Section V presents the concluding 
remarks. 


II. SLICING ALGORITHM 


Before describing our slicing algorithm, we need the 
following definitions. 

Definition 1: In a Petri Nets model, if transition t is in 
Process i and I(t), input place of t, is in 
process j or O(t), output place of t, is in 
Process k, i # j or i # k, then t is termed 
a communication transition. 

Definition 2: A path in a process which contains one or 
more communication transitions is termed 
a communication path. 

Definition 3: For Process i and Process j, if there exists 
t in Process i, such that O(t) is in Process 
j, then Process i and Process j are 
interrelated, or Process i has a relation with 
Process j. 


Definition 4: A concurrency set is a_ set of 
communication paths, which are selected 
from some interrelated processes and can 
be executed concurrently. 


The slicing algorithm can slice out all concurrency sets 
in a Petri Nets model. In this algorithm, we first find a base 
path which covers at least one communication transition and 
put it into the concurrency set S. We mark those transitions 
in other processes which have relations with this base path. 
Then we scan each process to select a path which covers all 
the marked transitions. This path may generate new 
communication transitions which have relations with it in 
either the previous (been-scanned) processes or the 
succeeding (not-scanned yet) processes. If this path does not 
involve any new communication transitions having relations 
with the previous processes or these transitions are already 
in the concurrency set, then we put this path into the 
concurrency set and mark those transitions having relations 
with the succeeding processes. Otherwise, if this path 
involves new communication transitions having relations 
with a previous process, say x, then we have to give 
temporary marks to these related transitions and backtrack 
to process x to find a new path to cover both marked and 
temporarily marked transitions. If we can find such a path, 
then we replace the one already in the concurrency set by 


‘this new path and mark again the transitions in other 
processes. Otherwise, we erase temporary marks and try to 


find a new path other than the old one which was already in 

the concurrency set. Afterwards, we restart the scanning 

process from x, until all processes have been scanned and a 

concurrency set has been found. This procedure is repeated 

until all communication transitions are included in certain 
concurrency sets. 


Here we should consider a special case in which there is 
no concurrency set in the model. For example, in Figure 2, 
transition t:; and t:5 are mutually exclusive. There exists 
only one pith whit includes t;; in process;. ti, has a 
relation with t; in process, and t has a relation With tio in 
process;. When we backtrack to process; we can find neither 
a path’ which covers both t;; and tj for a new path which 
covers t:,. In this case, the algorithm should terminate. 
Otherwise it will enter an infinite loop. 


Details of the algorithm now follow. 


Process k 


Process j 


Process i 


Figure 2. A special case: there is no concurrency set. 


Input: P[{1], P[2], ..., PIN] : Process 1 to Process N. 
CS = { té T, t is a communication 
transition} 
Output: S[1], S[2], ..., SI] : A set of concurrency sets. 
Variables: M = {t{ t€T, t has a mark} 
TM = {t|teT, t has a temporary mark} 
WS = {t| t€CS and t is in current process 


being scanned} 
N = number of processes in the model. 


Algorithm_ Slicing (P(1],...,P[N],CS : in; S[1],...,S[I] : out) 
for j <-~ 1 to N do if There exist more than one path in 
Pij] 

then changable[j] <-- true 
else changable[{j] <-- false; 
I <-0; terminate <- false; 
while CS # 2 and terminate = false do 


I<-I+1; SI) <-@; 
M <- @; T™ <- @; 
WS <-@; WS <- WS U {t}; /* pick up a t from 


CS which belongs to P[1] */ — 


Procedure_findpath (P[1], WS: in; PA : out); 

/* input process P[1], WS to find a path PA in P{1] 
which covers all communication transitions in WS. If 
there is no such path, PA = @ */ 


S[I] <-- S[I] U PA; 
M <-M U {t|t€P[x], x=1 and t has relation with 
PA}; 


Procedure scanning (P[1], .... PIN] : in; CS, M, 
TM, S[I] : in&out); 

/* scanning all processes according to the base path 
to find a concurrency set */ 


CS <-- CS - CSASIT] 
end while 
end slicing 
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The procedure findpath is to find a path in the given 
process such that this path can cover all communication 
transitions in WS. WS is the working set which contains 
the marked communication transitions in the process to be 
dealt with. The Procedure_findpath is not complicated. Here 
we will not give a detailed description. 


The procedure scanning is central to the algorithm. 
Executing this procedure once, we can get a concurrency set. 
The following is the Procedure__scanning: 


Procedure_ scanning (P(1], ..., PIN] : in; CS, M, TM, S[I] 


: in&out) 
i<-2; flag <- true; 
whilei < N and terminate = false do 
if flag = true 
then WS <- P[iJaM; 
ifWS #0 
then Procedure_findpath (P[i], WS 
: in; PA : out) 
elsei<~i+l1 
else PA <—@ 
end if 


ifWS # @ and PA = @ 
then i <-- No. of nearest changable process; 
flag <— false; 
ifWS # G and PA # © 
then if (there is no t in previous processes which 
have relations with PA) 
or (these ¢ are in S[I]) 
then S[I] <-- S[I] U PA; 
i1<-i+ 1; 
M <-M U {t|t€ P[x] (x>i), and 
t has relation with PA}; 
flag <— true 
else TM <-~- TM U {t|t€ P[x] (x<i) and 
t has relation with PA}; 
i <-— the least No. of process x 
(x<i) which has relation with 
PA; , 
SI] <- S{I] - all paths in 
processes Jj, j>i; 
flag <-- false 
end if 
end if 
if flag = false 
then WS <- P{iJn(M U TM); 


Procedure _findpath (P[(i], WS : in; PA 
: out); 


ifPA = @ 
then TM <-- TM - TMNP{i]; 
if there is no new path which 
covers WS 
then terminate <-- true 
else flag <-- true 


end if 
end if 
end while 
end scanning. 


Ill. TIME COMPLEXITY 


Algorithm _ Slicing is basically a backtracking algorithm. 
Finding a concurrency set consists of slicing out N paths 
from N processes. Because there would be several paths in 
each process, say k, the total combinations of these paths is 
k’*. The backtracking in this algorithm is necessary to find 
a new path combination. Hence the worst case is that the 


_algorithm tries all possible combinations. 


If there are n transitions in the Petri Nets model and N 
processes, we assume that n transitions are evenly 
distributed in N processes. That is, each process has n/N 
transitions. Now the problem is how to determine the 
number of paths in each process. Let us consider the 
situation in Figure 3. In the upper half, each place has two 
output transitions and in the lower half part, each pair of 
transitions has one output place. Thus the number of paths 
in this model equals the number of transitions at the m4), 


Figure 3. Maximum munber of paths in a Petri Net model 


level. Let x denote the number of transitions in this model. 
We take this model as consisting of two binary trees. Thus 
we have the equation: 

a2M.3)-29Ml=,y (1) 


2™ _ 1 is the number of transitions in a binary tree with 
depth m. 2™-* is the number of transitions at the Mp 
level. Solving equation (1): 


gm+1 _ a gm-1 =x 


gm-1 (92 _1)=x%4+2 
gm-1 = (x + 2)/3 
m-1 = /g(x + 2)/3 
m = Ig(x +2)/3 + 1 


Thus the number of paths (i.e. number of transitions in 
_ the mj, level) is: 


gm-1 — oflg(x+2)/3+1)-1 — glg(x+2)/3 — (x + 2)/3 
number of paths with x transitions = (x + 2)/3 (2) 


The number of paths in a process calculated by equation 
_ (2) is at least more than the general case, because we 
assume each process has one entry and one exit. 


Now we can continue to discuss the time complexity of 
our algorithm. Since each process has n/N transitions, the 
number of paths in each process is (n/N + 2)/3, so that the 
worst case is: 

[WN + 2/3]N 


Thus the time comp ity 
T(n) = [(n/N + 2/3] = O((nN)N ). 


Discussion: 


There are two factors affecting the time complexity. One 
is n, the total number of transitions, and the other is N, the 
number of processes. If N has been given (ie. N is a 
constant), then T(n) becomes simpler. The time increases as 
n becomes larger. But in another case, the problem will be 
more complicated. That is, if n has been given, the 
complexity varies as the ratio n/N changes. Therefore, N 
will affect the complexity cqgsiderably. Now let us analyze 
the complexity function (n/N) 


Let y = (n/N) 
(inyy = (N Inn/NY 
y/y = (N Inn/N)’ 
y = (N Inn/N)’y 
= [InniN +N (- nN n/N) (n/N) 
: = ( Inn/N - 1) 
Let y = 0, that is (Inn/N-1)/N)N = 


Because in our problem 0<N<n, so (n/N)** = 0 
(inn/N-1) = 0 
inn/N = 1 
n/N =e 
N=nle . 
when N < née y >O Le. (n/N)N is increasing as N is 
getting larger, 
N>nle y’ <0O ie. (n/N)N is decreasing as N is 
getting larger. 
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Figure 4 shows the curve of the complexity function (n/N yN 
(when n is given). 


Figure 4. The curve of the complexity function 


For example, if n = 100, T(100) = (100/N)N The 
ne table gives the time comlexity at different values 
0 . , 


ps tos | wo | wo | zs | sr | oo | so | 
10 5 5 
(n/n) 3.2x10° 1.0x10 9.45x10! 1.13x10/ 9.47x10+ s.2x102 1.13x102° 


Table 1. 


Example values of the complexity function 
When N = 100/e = 36.79, ((WN)N y = 0. 


Because N is an integer, we consider N = 37 as an 
extreme v@lue. From Table 1, we can find that when y < 
37, (n/N) is increasing; and when N > 37, (n/N)"" is 
decreasing. 


IV. AN EXAMPLE 


In this section, we shall show how to apply our 
algorithm to the testing of a distributed software system. 
As mentioned in the INTRODUCTION section, in order to 
generate test data for the system, we should first determine 
which parts of the system must be executed during the 
testing. There may exist a set of concurrency sets in the 
system model. We have to determine these concurrency 
sets, before we can generate different sets of input data to 
execute these concurrency sets. 


Figure 5 gives an example of a distributed software 
system modeled on Petri Nets. There are three processes in 
the system. Now we apply our slicing algorithm to this 
Petri Nets model to obtain all concurrency sets. 


process 1 Process 2 


process 2 


Figure S&S. 


A system model in Petri Nets 


Input : é P[1], P[2], P[3] 


S = {al, cl, el, a2, b2, d2, f2, g2, c3, d3, 
All three processes are changable. 


1. First select the base path (al, b1, cl, e1) in process 1. 
S = { (al, bl, cl, e1) } 


2. mark communication transitions in process 2. 
M = {b2, d2}; WS = {b2, d2} 


3. Find a path in process 2 to cover WS. 
{ a2, b2, d2, e2, k2, h2 } 
Because this path only has a relation with process 3, 
we put it into S1: 
S1 = { (al, b1, cl, e1), (a2, b2, d2, e2, k2, h2) 


mark C3 in process 3. 
M = {b2, d2,c3}, WS = {c3}. 


4. Find a path in process 3 to cover WS : 
{ a3, b3, c3, d3, g3 } 
Because this path has the relation with transition f2 in 
process 2 and f2 is not in S1, we go back to process 2 
and give a temporary mark to f2: 
T™ ={f2}, WS = {b2, d2, f2}. 


5. Find a new path in process 2 to cover WS : 
{ a2, b2, d2, e2, f2, g2, h2 } 
Because this path has the relation with el in process 1 
but e1 is in S1, so put this path into S1 and delete the 
old one. 


S1 = {(al,b1,c1,e1), (a2, b2, d2, e2, £2, g2, h2)} 
mark transition c3 in process 3 : 
M = {b2, d2,c2}, WS = {c3}. 


6. Find a path in process 3 to cover WS : 
*  {a3, b3, c3, d3, g3 } 
Because the transition f2 has a relation with this path 
is in S1 , we put this path into S1: 
S1 = { (al, bl, cl, e1), (a2, b2, d2, e2, f2, g2, h2), 
(a3, b3, c3, d3, g3) }. 


7. Now we have found a concurrency set S1. Delete all 
communication transitions in S1 from CS: 


CS = { f3}. 


Repeat this procedure: select base path in process 3: 
( a3, b3, c3, f3, h3, g3 ); 
We can easily get another concurrency set S2: 
S2 = { (al, b1, dl, e1), (a2, b2, c2, e2, f2, g2, h2), (a3, 
b3, c3, f3, h3, g3)} 


Now CS = @, the algorithm terminates. There are two 
concurrency sets S1 and S2 in this Petri Nets model. 


Vv. CONCLUSION 


A slicing algorithm of concurrency modeling based on 
Petri Nets is presented in this paper. Its application has 
been discussed and an example of an application in 
distributed software testing is given. The time complexity of 
the algorithm is also analyzed. This slicing algorithm will 
be a very useful tool for analyzing distributed systems 
modeled in Petri Nets. 
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Abstract--Simulation, which is very’ often 
compute bound on uniprocessor computers, is an 
excellent candidate application for parallel 
processing. This paper describes the 
implementation of a simple military simulation 
on the BBN Butterfly parallel computer. The 
entity centered nature of the problem will be 
described, followed by an overview of the model 
that represents the essential characteristics of 
similar full scale simulations. Implementation 
issues on the Butterfly and performance will be 
given, with particular attention to scaling up 
to the 124 node size of the available machine. 


1. Background 


The performance problem =in- military 
simulation is illustrated by the CORBAN (Corps 
Battle Analysis) simulation, which runs only 
about twice as fast as real time for a large 
scenario [1]. Since the period simulated is 
measured in days, this is too. slow’ for 
Operational use in decision aids. This is very 
unlikely to be achieved at reasonable cost with- 
out resort to parallelism. The effort reported 
in this paper investigated the potential for 
Speedup by using a small, simple simulation 
having similar structure to CORBAN, implemented 
on the Butterfly computer. A full report on the 
effort has been prepared [2]. 


The CORBAN simulation represents ground 
combat and some aspects of air combat, including 
perception, attrition, movement , and 
decisionmaking processes ([3][4]. A simulated 
scenario includes many entities, each 
representing a military unit, which may have a 
variety of types of weapons. Each such entity 
may only engage enemy units which it~ can 
detect.In addition, combat simulations include 
decisionmaking algorithms and movement 
algorithms that cause entities to change 
location and hence the combat’ relationships 
among them. A time stepped approach is used in 
CORBAN, so that each At every unit perceives, 
shoots, makes decisions, and moves. Entities 
directly observe and modify state variables of 
each other. This style of programming maps well 
onto a shared memory parallel computer such as 
the Butterfly where all state variables can be 
maintained in global memory. [5 | 


2. The Butterfly Computer 


The Butterfly is a shared memory parallel 
computer having up to 256 nodes, each within an 
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8 Mhz 68000 microprocessor, a 2901 bit slice node 
controller, and 1 Mbyte of memory. Instructions 
referencing memory located locally take about 
1 1/2 microseconds, while those’ referencing 
memory on another node take about 5 microseconds. 
References to other nodes are through an omega, 
or "Butterfly," network topology from which the 
machine takes its name. Each link has a capacity 
of 32 mbits/sec. The machine is programmed by 
the user as an MIMD (Multiple Instruction stream, 
Multiple Data stream) shared memory machine [6]. 
3. The Simulation 

With a parallelism approach that processes 
entities on different processors, potential for 
problems occurs where more than one entity may 
access the same data simultaneously during 
perception, combat processes, movement. 
Decisionmaking is a function internal to an 
entity, so it has no intrinsic requirement to 
reference other entities. Only the perception, 
combat, and movement functions were included in 
this simple simulation since they are the problem 


areas. This simulation named "Zipscreen," was 
written in C with the addition of parallel con- 
Structs available through the BBN "Uniform 
System" package which supports allocation of 


global memory blocks and scattered matricies, and 
parallel process generation [7]. 


_ The data structures of the program are shown 
in Figure 1. As in CORBAN, units are represented 
by blocks of data of varying Jength, which 
contain a number of common attributes in the 
“unitsb" structure and a list of assets in the 
concatenated "“asitem" structures. Units are 
accessed using the "“unarry" array. Each asset 
item specifies the type and quantity of each 


asset. The variable "hexloc" specifies an octal 
hex location address. These index into the array 
"phexar" which contains pointers to the linked 


list of units for each hex. The hex address 
“newhex" is the new hex a unit moves to during a 
time step. 


The program structure is illustrated in Fig- 


ure 2. The “runsim" routine cycles through al] 
of the time steps. The "GenOnIndex" is a 
mechanism provided by the Butterfly "Uniform 
System" to initiate a task for each unit. The 
"ciunit" routine calls "“percev," the perception 
routine, to generate a target list of adjacent 


is used to 
If 


enemy units. The "“hxadd" function 
generate the hex addresses to be searched. 


unarry L[UNXDIM] untsb 


healoe 
newhex 


Tprertu [od 
[win J 


C_pnerta=O_] 
[~presta -O Junitox 


to hex entry in pa 
corresponding te 
location ) of the unit 


Figure 1. Data Structures 
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Figure 2. Parallel Program Structures 
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targets are found, the subroutine "“combat" is 
called. The movement routine then updates the 


unit's position. After “ciunit" is completed for 
all units, "hxmove" is called for all units to 
update the occupancy lists. The combat routine 
makes separate calculations of effects for each 
of the weapons in the unit's asset list. A 
separate task, "OneShot," is spawned for each 
asset. Each weapon then calculates attrition 
against each asset of each target unit. The 
"movent" routine causes each unit to move at a 
constant speed in the direction “hexdir." When 
the unit moves into another hex, "hexdir" is 
added to the hex address “hexloc" to determine 
the new hex location "newhex." The original data 
Structure for hex access had to be modified to 


use the "AllocateScatterMatrix" function of the 
Uniform System. For the hex pointer array, 
"phexar," the columns correspond to the hex 


numbers. A second element in the column is used 
for the lock. The matrix column pointer array is 
duplicated on all of the processors. Figure 3 
illustrates how this is done. 


King nod 


column | 
a 
| lock =0 
a ee Column 2 
a ee 
| eilista | | lock =0 | 
@ scatterarray piphexar allocated in main 
® ptphexar inserted into gib by ciacty 
@ Probs , the pointer to glb, passed via GenOnIndex 
@ ptphexar retreived from glib by Init Proc at node i 
® column pomters copied to local arny by block 
move in Init Proc 
Figure 3. Distribution at ScatterMatrix Pointers 


4. Performance and Conclusions 
The original Simulation, with no 
parallelism, required 29.81 seconds for a run 
with 99 units for five time steps. When 
parallelism was added for units but the machine 
was constrained to use only one processor, the 
run time was 30.59 seconds, a 3.4% increase in 
run time. Later, with concurrent processing of 
assets, the single processor run time increased 
to 31.00 seconds, for a total of 3.5% overhead. 
A version of the simulation that did not include 
movement was developed and run on the large 124 
node butterfly. Figure 4 shows the results. The 
results are sensitive to the amount. of 


combat. With an earlier scatter algorithm that 
permitted only 62 of the 800 units to engage in 
combat, efficiency for 124 nodes was still 


relatively high at 56%. 


120 
ae simx 
800 units 
10 time steps 
80 
ee simx 
480 units 
5 time steps 
40 
20 
0 F. 
u 20 40 a ’ 160 
Figure 4. Zipscreen Performance on the Butterfly 


In later histogramming, movement processing 
was found to consume a very small amount of the 
computational resources. Figure 4 also 
illustrates speedup curves using the most 
complete version of the simulation. Note that 
for the 480 unit case on 124 processors, removal 
of initialization processing raises the 124 node 
efficiency to 45%. Typical CORBAN runs range 
from about 500 to 1000 units. These two cases 
bound the range of performance that might be 


achieved without ~~ the use of optimized 
scheduling. Histograms indicated that load 
balancing was the dominant cause of 
inefficiency. 

The Butterfly | computer running the 
Zipscreen simulation demonstrates the 


feasibility of parallel combat simulations of 
the type similar to CORBAN, having entity 
centered, time stepped architecture. Good 
scaling with additional processors was demon- 
strated for up to 124 processors, with 
efficiency limited by the task breakdown implied 
by data. 


An effort is currently underway’ to 
demonstrate parallel simulation on a_ larger, 
more detailed scale by implementing a subset of 


the CORBAN simulation on the BBN Butterfly and a 
hypercube architecture simulation. These 
projects are being pursued by the US Army 
Ballistic Research Laboratory, Los Alamos 
National Laboratory and The BDM Corporation, 
under the sponsorship of the Defense Advanced 
Research Projects Agency. 
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Abstract -- This paper describes a new technique to 
analyze data-independent SIMD algorithms to obtain per- 
formance measures of their execution on a parallel pro- 
cessing system. The execution time of an SIMD algorithm 
is the greatest amount of time required by the control 
unit and any one processing element to complete execu- 
tion. The operations performed by the control unit and 
this single processing element constitute the longest pro- 
cessing path or critical path through the parallel algo- 
rithm. <A procedure is described to extract the critical 
path and to simulate the execution of the critical path on 
an SIMD control unit and a single processing element. 
This is a general technique that combines a complete 
parallel architecture hardware model with serial processing 
tools to obtain very detailed SIMD algorithm performance 
information. The simulation studies presented here are 
for SIMD signal processing algorithms written in Parallel- 
C and executed on the PASM parallel processing system. 


I. Introduction 


An SIMD machine typically consists of a control unst 
(CU), an interconnection network, and N processing ele- 
ments (PEs) where each PE is a processor-memory pair. 
The CU broadcasts instructions to the PEs and all 
enabled PEs execute the same instruction at the same 
time. Masking operations can be performed by the CU to 
enable or disable any set of PEs within the machine. The 
execution time of an SIMD algorithm is the greatest 
amount of time required by the CU and any one PE to 
complete execution of the algorithm. The operations per- 
formed by the CU and this single PE constitute the long- 
est processing path or ertttcal path through the parallel 
algorithm. Performance measures of a parallel algorithm 
such as execution time, throughput, utilization, and 
speedup can be determined by extracting the critical path 
and simulating a CU and a PE executing it. This paper 
describes a new technique to extract, compile, and simu- 
late the critical path of an SIMD algorithm. 

Many parallel processing systems are currently being 
designed or are under development. During the early 
stages of these systems parallel hardware and a complete 
set of parallel software tools are often not available. The 
simulation technique presented here is important because 
the critical path through the algorithm is a serial program 
and thus can be processed and simulated using existing 
serial processing tools. Simulation of the critical path per- 
mits analysis of all major components of an SIMD algo- 
rithm including enabling and disabling PEs, inter- 
processor communications, and CU operations. This tech- 
hique is being used to design a real-time distributed 
speech understanding system [1]. 
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Section II discusses the PASM architecture which is 
used as a model in this paper, the SIMD signal processing 
algorithms to be analyzed, and aspects of the Parallel-C 
programming language in which the algorithms are coded. 
Section III details the critical path analysis procedure. As 
an example of this procedure, Section IV presents the 
analysis of a fast Fourier transform algorithm. 


Il. Architecture and Algorithms 


PASM Parallel Processing System 


PASM [10] is a partitionable SIMD/MIMD system 
which can be structured as one or more independent 
SIMD and/or MIMD machines. This dynamically 
reconfigurable multiprocessor system is being designed at 
Purdue University to serve as a research tool in speech 
and image processing applications. PASM consists of N 
PEs, Q Microcontrollers (MCs), a partitionable multistage 
interconnection network, multiple secondary storage units, 
and additional processors for job scheduling and I/O. 

N/Q PEs are associated with each MC to form an 
MC-group. An SIMD machine partition of MN/Q PEs is 
formed by combining the efforts of M MC-groups. The 
MCs act as CUs for their PEs in SIMD mode. The PASM 
operating system will schedule and protect system 
resources to support a multi-user environment. A proto- 
type of the PASM parallel system with N=16 PEs and 
Q=4 MCs is being constructed. The design uses the 
Motorola MC68010 microprocessor [5] as the computation 
engine of the PE, MC, and support processors. 


Parallel-C Signal Processing Algorithms 


Many common signal processing algorithms used in 
speech and image processing such as fast Fourier trans- 
forms (FFTs), _ digital filtering, power spectrum 
estimation, convolution, and histogramming [7, 8] have 
very regular data-independent control structures. It is 
this type of algorithm that this work addresses. 

The SIMD signal processing algorithms are written in 


Parallel-C [4], an extension to the C programming 
language Ket Parallel-C provides a “high-level” medium 
to describe a parallel algorithm without excessive 


emphasis on the details of the machine’s hardware. The 
language extensions for SIMD mode processing include the 
definition of parallel variables, functions, and expressions; 
a scheme for accessing parallel variables by using selec- 
tors; and extended control structure semantics. Selectors 
are special variables which are used to define the set of 
PEs that will perform a parallel operation. As in serial C, 
scalar variables, functions, and operations may also be 
defined. Variables are declared in Parallel-C by the new 
keywords parallel, scalar, and selector. 
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Algorithms in Parallel-C can be written for whatever 
number of processors seems “natural” without regard for 
the number of physical processors that may actually be 
available at run time. + If the algorithm’s eztent of parallel- 
tsm exceeds the physical machine size, multiple virtual 
processing elements (VPEs) can be emulated on each real 
physical processing element (PPE) to create a virtual 
machine size (VMS) equivalent to the extent of parallel- 
ism. For each VPE emulated within a PASM PPE, a vir- 
tual microcontroller (VMC) is emulated within the associ- 
ated physical microcontroller (PMC). If the extent of 
parallelism in an algorithm is E but only H PPEs are 
available in hardware, the operating system will map E/H 
VPEs into each real PPE and E/H VMCs into each PMC. 
Therefore, each of the M MC-groups in the machine parti- 
tion emulates E/H virtual MC-groups. 

Conceptually, the ME/H MC-group' emulations 
described above are separate processes, M of which are 
scheduled to run at a time. These processes are forced to 
perform a contezt swttch at certain synchronization points 
in the algorithm to ensure correct emulation of the SIMD 
machine. For example, consider a synchronized SIMD 
inter-processor communication step in which all source 
VPEs write data to destination VPEs through the inter- 
connection network. All VPEs must send data to their 
destinations before any is allowed to retrieve the data. A 
synchronization point is required in the network transfer 
routine to force each virtual MC-group to context switch. 
peaes E/H context switches have occurred all VPEs can 
continue. 


II. Analysis Procedure 


The execution time of an algorithm is obtained by 
extracting the critical path from the Parallel-C source 
program, compiling the resulting serial C program, assem- 
bling the code, adding the PASM library functions, and 
simulating the algorithm. 


Critical Path Extraction 


The source code of an algorithm written in Parallel-C 
describes each scalar and parallel operation and the 
appropriate masking information. The Parallel-C Critical 
Path (PCCP) preprocessor accepts the Parallel-C source 
program and converts it into an equivalent critical path C 
program using the PASM hardware machine size (Q MCs, 
'N PEs) and the machine partition size (M MCs, MN/Q 
PEs) to specify the conversion. The critical path conver- 
sion process mimics some of the operations that would be 
performed by a Parallel-C compiler. 

PCCP extracts the critical path of an algorithm by 
transforming each Parallel-C input line into its critical 
path equivalent. As variables are declared, PCCP main- 
tains a list of each variable name and type while convert- 
ing the declaration into equivalent C language notation. 
Scalar variable declarations remain the same except for 
the elimination of the scalar keyword. Parallel variable 
declarations are converted into normal C language vari- 
ables. Selector variables are not declared in the critical 
path code. PCCP uses the variable information to 
differentiate between scalar and parallel expressions and 
to analyze parallel variable selector information. 

A Parallel-C expression containing only scalar vari- 
ables is equivalent to its critical path counterpart and no 


t This approach can be used for an operating system implementation 
in which multiple jobs may be active simultaneously and the available 
resources cannot be predicted at compile time. In real-time systems, 
programs would be coded for a specific number of processors and 
would be compile-time bound to those processors. 


changes are necessary. Conversion of an expression con- 
taining parallel variables requires several steps. Initially, 
selector information is extracted from the expression. The 
selector information within a parallel statement specifies 
the MC masking operations required to execute the state- 
ment. It is necessary for the MC to perform masking 
operations only if the new selector differs from the selector 
of the previously executed parallel statement. The 
number of selectors, the selector type (select a fixed group 
of VPEs or a group of VPEs that varies depending on the 
value of a program variable), and the surrounding algo- 
rithm control structure may also indicate the necessity to 
perform MC masking operations. PCCP tracks selector 
information and generates special instructions for the MC 
to perform masking when required. After generating the 
masking information the parallel expression is converted 
by removing all selector information. 

The PCCP preprocessor performs all of the opera- 
tions of the standard C preprocessor [3]. This permits the 
inclusion of special optimized algorithm macros and 
assembly language code to improve algorithm perfor- 
mance. The preprocessor also permits the selective elimi- 
nation of sections of Parallel-C source code from the final 
program. With the exception of the expressions that 
translate into MC control instructions, removing source 
code permits a closer dissection of the execution times of 
the operations which comprise the algorithm such as 
masking operations time, operating system overhead, and 
the time for inter-processor communications. 


Simulation 


The resulting serial C source program text is 
prepared for simulation by an MC68010 C compiler, 
assembler, and linker-loader. PASM library functions to 
perform C startup, initialization, mathematics, intercon- 
nection network, PE masking, and context switching 
operations are linked to the algorithm code. 

The output of the linker-loader is an object file and is 
interpreted by the simulator. The simulation program is 
used to simulate the CU and a PE executing the critical 
path of the parallel algorithm. The simulator is a general 
purpose uni/multi-processor hardware description and 
simulation program. It allows the modeling of a wide 
variety of machine interconnection structures and process- 
ing models. 

To obtain the algorithm critical path execution time 
the critical path VPE and associated VMC are simulated. 
When the machine partition size is less than the extent of 
parallelism, the critical path execution time must be mul- 
tiplied by E/H to obtain the complete algorithm execution 
time. This is accurate because the overhead of the 
operating system to perform context switching is included 
in the PASM library functions and hence in the critical 
path simulation results. 


IV. Example Algorithm Analysis 


As an example of the detailed algorithm performance 
measures obtainable by the critical path extraction tech- 
nique, the analysis results for a 512-point complex 
decimation-in-time FFT algorithm are presented. The PE 
model used in this simulation uses the Motorola MC68010 
microprocessor and MC68881 floating-point coprocessor as 
the execution unit [5, 6]. A multistage cube interconnec- 
tion network [9] is used. 

Parallel SIMD algorithms to perform FFTs have been 
described in [2]. The principal components of the algo- 
rithm include computation steps called butterfly opera- 
tions, interconnection network transfers, PE masking 
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operations, and algorithm control. The structure of the « 


512-point algorithm consists of 9 stages each containing 
256 butterfly operations. Each butterfly operation 
requires 2 complex floating point additions and 1 complex 
floating point multiplication. The extent of parallelism in 
the algorithm is 256 PEs. The complex data, D(i), 
0<i< 512, is distributed 2 data items per VPE. At the 
start of the algorithm VPE i contains D(i) and 
D(i+ 256). At algorithm completion VPE i contains data 
item br({2i) and br(2i +1) where br(i) indicates the bit- 
reverse of i. The algorithm is optimized by pre 
calculating the FFT twiddle factors, masking operation 
constants, and interconnection network transfer settings. 


Figure 1 is a graph of the cumulative execution times | 


of the operations that comprise the algorithm as a func- 
tion of the number of PPEs within the machine partition. 
Starting from the bottom of Figure 1, each solid line indi- 
cates the algorithm execution time resulting from the 
addition of the labeled operation to the operations 
graphed below. The top curve on the graph was obtained 
by simulating the complete FFT algorithm. The second 
curve from the top was obtained by using PCCP to elim- 
inate the code that performs the operating system func- 
tions. The difference between the top two curves is the 
operating system execution time. The difference between 
the next two curves is the execution time of the butterfly 


operating system 0 
butterfly operations V 
transfer operations O 
control operations A 


masking e 


Execution Time (ms) 


1 2 4 8 16 32 64 128 256 
PPEs Within the Machine Partition 


‘Figure 1 The cumulative execution times for the opera- 
tions that comprise a 512-point complex fast 
Fourier transform algorithm. | 


operations. Additional performance measures such as 
overhead ratio, speedup, and throughput [11] can be 
obtained by combining these execution time results. Pro- 
cessor utilization [11] can be inferred by using PCCP to 
eliminate portions of the algorithm corresponding to each 
PE mask setting, obtaining the execution time for the 
remaining PEs, and comparing these times to the execu- 
tion time of all PEs. 


V. Summary 


A new method to obtain execution time related per- 
formance measures of certain data-independent SIMD 
algorithms running on a parallel processing system has 
been described. Performance measures such as execution 
time, speedup, throughput, and utilization can be 
obtained by extracting the critical path from a Parallel-C 
algorithm and simulating its performance. The critical 
path analysis procedure is described and the analysis of an 
FFT algorithm is presented. 

The technique allows detailed analysis of parallel 
algorithms at an early stage in the development cycle of a 
parallel system while the software and hardware tools are 
still under construction. Analysis of the critical path per- 
mits a more detailed algorithm analysis than is possible by 
purely analytical techniques because it includes actual 
compiler and operating system overhead details such as 
register allocation, function call overhead, high-level 
language initialization, context switch overhead, and 
specific details about the hardware implementation. 


Although PASM has been used as the specific model in 
this work, critical path analysis is a general technique and 
can be applied to other SIMD architectures executing a 
Parallel-C-like high-level language. 
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Abstract 


This paper presents the numerical solution of 
partial differential equations for simulation of 
physical phenomena on a class of multiprocessors, 
cagled EGPA-systems. The used multigrid methods are 
well-suited for the considered EGPA systems. An 
implementation example is given. Communication 
mechanisms are explained, The results are discussed 
in respect to efficiency and speed up. At last a 
view at the future work is taken. 


Introduction 


For many years only computer scientists have 
been interested in multiprocessors consisting of 
‘more than just a few elements. But with the appear- 
ance of the Hypercube, such a system is commercial~ 
ly available, In the field of vector-computers the 
trend to multiprocessors is also discernible, Till 
now, these supercomputers were offered with only 
a small number of processors, but the number is 
continuously increasing. 


At the University of Erlangen-Nurnberg, West- 
Germany, multiprocessors have been investigated for 
several years. We are particularly interested in a 
certain class called EGPA (Erlangen General Purpose 
Array). Since we presented the features of such 
systems in former papers [1-3] we will only give a 
short description of the EGPA architecture in chap- 
ter II. The main part of this paper deals with soft- 
ware aspect, 


Two EGPA systems were realized, The first one, 
a small prototype (5 processors), was built in the 
late seventies. In chapter III we will have a short 
look at the other system which is presently running 
and consists of 21 processors, 


With both multiprocessors, many algorithms were 
implemented. Encouraging results [-6] demonstrate 
that EGPA computers are well-suited for a broad 
spectrum of applications, One area is the numerical 
simulation of phenomena in physics, chemistry etc. 
As a rule, the corresponding mathematical models 
use partial differential equations. In order to 
find a good solution of such an equation, the dis- 
cretization must be fine and as a consequence, the 
required computer performance in FLOPS is very high. 
The emergence of multigrid methods and the advance 
of vector-processors dramatically improved the si- 
tuation. Suddenly, many but not all phenomena can 
be successfully investigated by simulation. A lot 
of supercomputer applications have emerged, and a 
worldwide interest in multigrid algorithms and their 
implementation is rapidly growing. 
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Since the main objective of multiprocessors is 
high speed computation, such a system must be 
suitably structured to fit these algorithms. In 
chapter IV we present a general method for mapping 
multigrid algorithms onto EGPA systems. A simple 
example, the Poisson-Equation is used in chapter V 
to demonstrate in more detail the implementation 
technique and related problems, The measured speed- 
ups for this type of equation together with the re- 
sults for the more complicated Steady Stokes Equa- 
tions are discussed. Some reasons for the loss in re- 
lation to the theoretically maximal values are given. 


In the last chapter we give a short survey of our 
recent research, 


EGPA systems: 


EGPA systems form a subclass of multiprocessors 
with the following features {3 ]: 


Tight coupling: The architecture consists of pro- 
cessor-memory-modules (PMM) which are connected in 
two-dimensional orthogonal grid-like structures, 
Each processor has access to the memories of the 
four adjacent PMMs (bidirectional connections) and 
each memory can be accessed by neighbouring proces- 
sors through a multiport control device, 


Restricted neighbourhood: By coupling each PMM 
with four neighbours the local complexity remains 
constant throughout the processor plane, whereas the 
global complexity grows proportionally to the number 
of PMMs in the system, The principle of constant lo- 
cal complexity and linearly growing global complexity 
can be realized in various size independent topolo- 
gies, 


Regularity and homogeneity: In most cases a rect- 
angular array of processors is most appropriate, as 
it can be derived from many algorithmic schemes from 
relaxation methods. Regularity regarding the inter- 
connection structure, and homogeneity regarding the 


PMM~structure are mainly directed to the requirements 
of data-exchange in multiprocessors, 


Hierarchy: Two or more arrays of PMMs constitute 
a hierarchical system through "vertical" connections. 
Each processor (master) - except those at the lowest 
level - has access to the memories of four subordina- 
ted PMMs (slaves, unidirectional connections). 


For larger systems it 1s unnecessary to have many 
hierarchical layers, Therefore a possible EGPA com- 
puter may have the following structure (s. fig. 14): 


- The number of PMMs in the lowest layer is four 
times larger than the number of PMMs in the 
second layer. Each PMM in the second layer is 
connected to a disk via I/O and has access to 
the memories of the slaves, 

- The number of PMMs in the third layer will de- 
crease more strongly to produce for instance 
a tapering ratio of 1:16, Then for hardware 
reasons, buscoupling should be preferred me- 
mory coupling. 


In such a system there is a functional partitio- 
ning. The lowest layer is the main worker layer, 
The next layer has to perform mainly operating 
system functions, e.g, to manage the data swapping 
from or to the disks, The processor (or processors) 
in the top layer constitute an interface to a host 
(or a workstation). All the upper layers are used 
for broadcasting (or collecting) data or programs. 


The EGPA system built with DIRMU modules 


DIRMU (Distributed Reconfigurable MUltiprocessor) 
is a kit for building multiprocessors. The basic 
element, a PMM consists of two subunits: the P-sub- 
module (8086/87 with 320 kB private memory and 

8 exit-ports) and the M-sub-module (64 KByte 8-port 
memory). Two PMMs can be connected by plugging a 
cable into an exit-port of the first module and in- 
to a memory port of the other module. Therefore, 
every combination of memory coupled processors is 
possible, whereby each processor can have up to se- 
ven neighbours (1 exit-port and 1 memory port are 
used to allow the P-sub-module access its own mul- 
tiport memory), One possible architecture is the 
EGPA system shown in fig. 1b. 


This system was used for the implementation - 
among others - of the multigrid algorithms described 
in the following. The DIRMU kit and the measured re- 
sults are presented in several papers 5-7] in more de- 
tail. For a better understanding the following des- 
cribes briefly the addressing technique, In this 
case we want to build the 3-processor-system shown 
in fig. 2 top. The corresponding DIRMU systems looks 
like fig. 2 bottom and the addressing is explained in 
figs 3+ 


General Multigrid Approach on EGPA systems 


Let us assume that we are searching for a solution 
u (x) of is 
L =f, x €B, 

C (u). x €9B, | 
where L is a (not necessarily) linear differential 
operator and f ig a function defined on the region 
B contained in R.. C stands for a condition which 
u has to fulfil at the boundary of B. Henceforth, 
We assume that B is (or is almost) the unit 
cube. Where necessary, this can be achieved by a 
suitable transformation of the equations [ §]. 


Such equations are usually solved by discretiza- 
tion. For this purpose we choose a grid G with mesh 


size h. Then we solve the approximate system by re- 


laxation 
Le uy, = tw? x € BNG 
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If h is small enough and the relaxation process 
is executed sufficiently often, then the approxi- 
mate solution u, is a good approximation of u. The 
main problem is that the rate of convergence de- 
crease strongly with decreasing h for all known 
relaxation methods. 


The situation can be essentially improved by 
using multi grids G., with h = 2 +. A multigrid 
procedure is given by a predefined sequence of 
operations of the type R, E, L, I,, I. The follo- 

: 1 2 
Wing rules must be obeyed: 


~ Besides G,, each grid under treatment is 
relaxed by operation R. 
In G, the equations are solved exactl 
(operation type E) : 

- In case of a transition from G. to G. 


3 
' then i= j+t. 
- In case of a transition from G._,. the 
residues (f,-L,U.) are restricted 
: : h vhh 
(operation L): 
- In case of a transition from G. to Goa! 


a) af G.,, was already used, 
corréctions must be interpolated and 
added to the existing approximations 
of the discrete n-values. 

rf Ge was not yet used, then the 
corrésponding u-values must be inter- 
polated (operation I,). 


b) 


There are many possible orders through which the 
grids can be passed. One example is the V-cycle 
shown in fig. 4a) and a full multigrid cycle based 
on V-cycles (fig. 4b). 


Multigrid methods are well-suited to EGPA systems 
because all the operations are homogeneous and 
local. Thus, a general approach for mapping such 
an algorithm onto an EGPA system can be given. 


We assume an EGPA system with 1<m layers L (top), 
Lins L3y eee, L (bottom), Mapping the algorithm onto 
thie computer will be effected in the following way: 


~- Assign each grid to the finest of all layers 
which possesses no more modules than the grid 
has points, This rule avoids situations where, 
for a calculation, a processor needs data which 
it can not access directly. Another consequence 
is that as many grids as possible can be compu- 
ted by the processors in the lowest layer - so - 
that the computing power is used very efficient- 
ly (see fig. 5), 

~ If the problem is 2-dimensional, each grid plane 
is partitioned and the parts are assigned to the 
modules of the corresponding processor layer in 
a natural way (fig. 6a), Partitioning is 
achieved in a load balancing manner. This means 
that all parts should be as equal as possible, 
Only parts with points of the boundary should 
be smaller to compensate higher computation 
load, Three dimensional grids are projected in- 
to a 2-dimensional grid which is divided and 
assigned to processors in the same manner as 
described before, Each processor manipulates 
the 3-dimensional column of grid points pro- 
jected onto the assigned part (see fig. 6 b). 
In any case, each processor has direct access 
to all needed data (fig. 6a), 


It should be mentioned that the influence of the 
boundary must be estimated by the programmer. Auto- 
matic partitioning methods are not known at present, 
The computational efficiency can be improved only 
‘by analysing execution times obtained with test da- 
ta, 


At runtime, each processor executes a program 
which differs from the mono-program only in index 
boundaries and in additional synchronisation parts, 
If all is correctly implemented the behaviour of 
the multiprocessor program is completely equiva- 
lent to that of the monoprocessor program, How 
such a correct program can look will be shown in 
the next chapter. 


An implementation example 


In the following, we only discuss the parts of the 
program which differ from that known from monopro- 
cessors. For demonstration purposes we chose the 
relatively simple Poisson equations with Dirichlet 
conditions. 


a ee 2 
= u(x,y) - ea u(w,y) = f (x,y),(x,y)e(0,1)~SG, 
: ox u(x,y) = g (x,y),(x,y)e9G. 

-k 


Let for h = 2 
" : a k 
G = (x5 095) as ee = Jh31,J = 0,1...,2 , 


A possible discretization of the equation is 


. C_ 
ee Gard Ben tg ag as 
where 

was u(x; »y;) and tS =a (x. , v3). 


We parallelize a full multigrid algorithm (s, 
fig. 4b) implemented by Witsch [9] with the follo- 
wing operations; 

Relaxation: Each grid point is relaxed in red 
black order 
2 


eg ag ae er ee ie 
Red black order means: All grid points are 
coloured red or black so that the whole grid looks 
like a checkerboard, All red points are relaxed be- 
fore all black points, The main advantage is that 
it is unimportant in which order the reds or the 
blacks are relaxed. The reason is that the new red 
values depend only on black values and vice versa. 
Restriction: full injection 
Interpolation 1: bilinear 
Interpolation 2: of 4th order. 


We now explain the control flow of the user pro- 
gram for solving Poisson equation, For sake of sim- 
plicity we assume that exactly one grid level is 
manipulated by a processor layer, The following con- 
trol variables are used 


O grid level is reached for the first time, 
€,g. position#®in fig. 4 

grid level is reached for the second time, 
e,g. position & in fig. 4 

2 otherwise 


deep = 1 


id 
deepest - {ene finest gri 


-[ false otherwise. 
. _ ytrue stop 
end 7 fee go on 


to rocessor: 


The corresponding process is initialized by the sy- 
stem, This process initializes the son processes 
at each of its slave processors. Then the follo- 
Wing loop is executed: 

exact solution in G,;3 

interpolate into son memories; 

messages to sons, 

wait for messages from all sons, 

if message = "finished" then stop; 

end loop. 


All _other processors execute the same program: 


Initialisation: deep: = 03 deepest: = true/false 
V3 = number of relaxation steps 
befor interpolation 
Vo? number of relaxation steps 


after restriction, 
initialize sons, 


wait until all 
sons have sent 


a message 


copy 
relax 


restrict 
message | message 
'fini- 
shed' 
‘go on' to to 


father father 


‘go on! 


message 


Although the above diagram is self-explaining, 
we have to clarify some features. 


The interpolations I; and I, produce data, which 
have only effects on the contents of the memories 
of the next deeper PMM layer. In the first case the 
corrections are added to the existing values of the 
grid points, in the other case the grid is gene- 
rated for the first time. In contrast to that, the 
restriction operation writes the residues into the 
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memories of the same layer, since the processors 

of a layer have no direct access to the memories in 
the next higher layer. Therefore, the PMMs of the 
higher layer first have to copy the data into their 
memories. 


While the above diagram shows synchronisation be- 
tween layers, the following program demonstrates 
synchronisation between processors of the same le- 
vel, 


We only look at the procedure "relax" which is 
always called V, (V,,) times in sequence. 


BEGIN | 
WHILE blackrel & north # blackrel OR 
blackrel & west # blackrel OR 
blackrel & south # Blackrel OR 
blackrel & east # blackrel 


DO wait (t) OD; 

relaxed; 

redrel: = redrel + 1; 

WHILE redrel & north # redrel OR 
redrel & west # redrel OR 
redrel & south # redrel OR 
redrel & east # redrel 

DO wait (t) OD; 

relaxblack; 

blackrel: = blackrel + 1 

END 


To understand the routine we have to remember 
that this program is running at any processor, This 
is the reason why we write relative programs (rela- 
tive to an assumed PMM location within the system), 
We achieve this by using variables with relative 
names. Two examples are "blackrel" and "redrel", 
"redrel" ("blackrel") is the name of a set of vari- 
ables. In each PMM, there is an incarnation of "red- 
rel" located in the multiport memory, Such a set is 
relatively addressed. Each processor writes or reads 
the incarnation into its multiport memory by using 
the name "redrel", The corresponding incarnation of 
the western neighbour (according to the gridlike 
interconnection structure) is accessed by "redrel 
& west", With DIRMU (fig. 3) we explain these facts 
from a hardware point of view, Let us assume that 
the connection from a processor to the multiport me- 
mory of its western neighbour is plugged into exit- 
port 3 and its own multiport memory is accessible by 
addresses of port O. Furthermore we assume that "red- 
rel" has the address 8°64 K+d. Then the address of 
"redrel & west" is 11°64 K+d. This address is trans- 
formed to 8-64 K+d by the multiport of the western 
PMM and this is the address of the variable "redrel" 
from the point of view of the western processor, 


In the above routine the variables "redrel" re- 
spectively "blackrel" indicate how often the red- 
points resp. the blackpoints were relaxed. There- 
fore this variables are set to zero during the ini- 
tialisation, The first WHILE guarantees that the 
data needed for relaxing the red points already 
exists in the adjacent PMMs, The second WHILE is 
necessary to be sure that for the relaxation of the 
black points all needed red points being in the PMMs 
of the neighbours are updated, 
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in the case, 


Now we discuss an improvement of the main routine 
given by the diagram. The synchronization between 
layers 1s achieved by messages. In the implemented 
program we realy used spinlocks, because this tech- 
nique is much quickier than message passing (usec in 
comparison to msec). This technique we illustrate 
for the synchronisation between layers (grids). 


For controlling the states of the sons and of the 
neighbours, we introduce the variables, 


an 


= activity number 
lan. = 


last activity number 


A father-process signals the next activity to the 
sons by incrementing their activity number 


an & sons +: = 1 
(we use this as abbreviation for the real statements 
an & son 1: = 


an & son 1+ 13 


e 


an & son 4: = an & son 4 + 13). 

(Here we address the incarnations of "an" laying 
in the memories of the slaves by adding .& son 1 
equivalent to the technique how we select the incar- 
nation in the memories of the neighbours, It is al- 


so a kind of relatively programming. ) 


Now let us assume that the sons have copied the 
last value of "an" into "lan". If a son-process is 
ready for a new activity, he executes a waiting- 
Loop 

WHILE an = lan DO wait (t) OD;. 

After incrementing "an" by the father the proces- 
sor executes the next statement after this WHILE. 
For instance, he can control the states of the 
neighbours by another waiting-loop: 


WHILE an & neighbours # an DO wait (t) OD 
(an & neighbours stands for 
an & NORTH # an OR ... OR an&WEST#an) 


There are a few possibilities for upwards~-syn- 
chronization (from son to father), 


If the process terminates the current activity, 
he can e.g, change the sign of "an" (an:=-an). Then 
the father asks for the values of "an" in the sons 
memories. 


WHILE an & sons > o DO wait (t) OD; 


The father process leaves this waiting-loop only 
that all son-processes have been car- 
ried out the requested activity. 

way we synchronize within most of our 
found this time saving technique very 
users preferred it. There is one dis- 
advantage in comparison to message passing: there 
is no system control of the synchronization. To 
overcome this we used hardware monitors for tra- 


In such a 
programs, We 
good and all 


cing. Finally we want to emphasize one fact in con- 
text with our Poisson program. No matter which tech- 
nique we use, only local synchronization takes place. 
This is important, because a global synchronization 
would be much more time-consuming, 


The possibility to synchronize locally (practi- 
ced in most of our programs) and the possibility to 
program relatively are two very important features 
of our system. 


Some results 


We implemented two multigrid algorithms on the EGPA 
system build with 21 PMMs (see fig. 1b) [2,13]. In 
both cases, the processor at the top operates only 
on grid O and the four processors of the middle 
layer manipulate grid 1. All other grids were as- 
signed to the 16 processors of the bottom layer. 
The results are given in table 1, 


Some comments to the results; 


With increasing number of grids the speedup in- : 
creases, Since the synchronization and communica- 
tion part of the program decreases in comparison to 
the calculation part. If the problem is too small, 
it will make no sense to use such an multiprocessor. 
Some lower bounds for the size, which a problem 
should have, are given by Mierendorff and Kolp [11] 
They showed that in case of a 3 dimensional multi- 
grid aLgorithm the problem size has to be 0 (p3); 
when p is the number of processors in the lowest 


layer. The results for the Steady State Stokes 
Equation are better than those for Poisson, since 
in the former case there is more work to do, 


Since not all processors are always busy - only 
one layer is working at a time - the maximal speed- 
up cannot be equal to the number of processors, The 
ideal speedup gives the upper bound which can maxi- 
mally be achieved under ideal circumstances. The dif- 
ference between measured speedup and ideal speedup 
reflects the loss, 


This loss is not effected by the time for the 
proper synchronization action but by the waiting 
times, These periods of idleness are caused by un- 
balanced load, which is obtaind by the fact that 
it is impossible to divide the used grids into 
parts with equal load, 


Future work 


We introduced EGPA systems, We demonstrated the 
programming of such systems. 


For two multigrid algorithms the speedups mea- 
sured at our 21-processor system were given, To- 
gether with former enjoying results in context with 
many other problems we were encouraged to continue 
the investigation of such systems. 


In order to increase the performance of the 
DIRMU-PMM we decided to improve the hardware by 
adding a powerful coprocessor, First measurements 
with a first prototype yielded speedups of more 
than 100 in comparison to the existing system, In 
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the next time we will test whether this good re- 
sults can be generalized to a whole system ae 


In this context we will investigate the I/0-pro- 
blem which will occur if the data to be handled are 
too many and swapping algorithms have to be used. 
Provided that our theoretical estimations are cor- 
rect and we believe this, then we will be able to 
manage this problem with additional hardware with- 
out slowing our system down, 
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Fig, 3: Addressing scheme for private and shared me- 
mory in the DIRMU system, 
The private memory is addressed without address 
transformation (pages 0-7). Each PMM accesses a 
shared memory plugged into exit-port i by addres- 
ses corresponding to page 8+i. The PMM of which 
the shared memory is a part sees the addresses 
as belonging to page 8. Therefore each address 
| (8+i1)-64K+d (displacement) is transformed to 
b) 8-64 K+d. 


Fig, 1; EGPA systems 
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Fig. 4: Multigrid method 
a) V-cycle 
full multigrid with V-cycles 


b) 

0 exact solving (E) 
© relaxation (R) 
\ 
/ 
\ 


interpolation I 
restriction (L) 
interpolation I Fig. 6: Mapping one grid onto one layer 
a) 2-dimensional case. For one example the 
area to which one processor needs access 
is hatched, 


bd) 3-dimensional case 
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A) Porsson EquaTION WITH DIRICHLET ConpdITIonNS (MGOOD, WitTscu) 


4 NUMBER OF GRID LAYERS 
GRID POINTS IN THE 
A¢ FINEST GRID 4225 1664] 65025 
MONO RUNTIME /SEC 
an 16 PROCESSORS 
SPEEDUP 
286 
i 40 
B) Steapy State STOKES E@uATION 
Fig. 5: oe the problem structure onto the EGPA Va= Va = 1 (wumBer oF RELAX STEPS BEFORE/AFTER COARSENING) 
system. 
Left, a system with 4 layers. The total NUMBER OF GRID LAYERS 
number of processors in each layer is NUMBER OF CELLS IN THE 
given. On the right, are the grids G. FINEST GRIDS 1024 4096 16384 
3(it1) 5s MONO RUNTIME /SEC 12.35 48,43 192,82 
(2 points in the 3-dimensional case). 
The arrows show which processor manipulates 1b PROCESSORS 1.28 3128 ee 
PB ae . SPEEDUP 9,$5 Ape) ue 


which grid. 
V4 


MONO RUNTIME /SEC 20,03 78,72 51377 


16 PRocESSORS 1S Y es A 20.5 
SPEEDUP 10,3 1557 15,3 
Table 1: 


Measured results with 16 processors in the lowest 
level of the EGPA system (21 processor pyramid), 


In case B we parallelized a proposal of A, Brandt 
[10]. This algorithm uses staggered grids. There- 
fore the region is partitioned into cells. 
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CONSTRUCTING THE VORONOI DIAGRAM ON A MESH-CONNECTED COMPUTER 
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Abstract 


In this paper, we present a Mesh-Connected Computer algo- 
rithm to construct the Voronoi diagram of a set of planar points. 
Given a set of n planar points, our algorithm constructs a Voronoi 
diagram on an O(VnxvVn ) MCC with constant storage per pro- 
cesser in O(Vnlogn) time. Using the Voronoi diagram, the prob- 
lem of determining the nearest neighbor between two sets and 
constructing the Euclidean minimum spanning trees can be solved 
with the same time complexity on the MCC. The best sequential 
algorithms for constructing the Voronoi diagram have an optimal 
O (nlogn) time complexity. Previous known parallel algorithm 
for this problem requires O(log*n) time on a Parallel Random 
Access Machine and O(log‘n) time on the Cube-Connected- 
Cycles with O(logn) storage per PE. 


I. Introduction 


Voronoi diagrams (also called Thiessen diagrams) of a set S 
of n points is a well known structure which makes explicit some 
proximity information about S. In the Voronoi diagram [1], each 
point is surrounded by a convex polygon enclosing that territory 
which is closer to the surrounded point than to any other point in 
the set. Shamos [2] applied two-dimensional Voronoi diagrams to 
obtain elegant solutions in computational geometry, such as 
finding the nearest neighbor, construction of minimum spanning 
tree, etc. Generalizations of the Voronoi diagram were considered 
by several authors. Shamos [3] describes an O(nlogn) time 
sequential algorithm to construct the planar Voronoi diagram for a 
set of planar points. The strategy used in the serial algorithm is 
divide-and-conquer. In the merging step the Voronoi edges are 
generated in a "zigzag" sequential manner which seems difficult to 
parallelize. 


Our algorithm is based on Brown’s [4] approach which 
transforms the problem of constructing a planar Voronoi diagram 
for an ”-point set to the construction of a convex hull of n points 
in three dimensional space. Chow [5] worked on parallelizing the 
algorithm to solve this problem. Using the shared memory 
machine (SMM) with 7 processors, the time performance of her 
algorithm is O (log*nloglogn). A recent paper [6] presented an 
O(log*n) time algorithm to solve this problem with O(nlogn) 
space. An O(log*n) time algorithm for the construction of Voro- 
noi diagram on Cube-Connected-Cycles with O(logn) storage per 
processor, was also proposed in [5] with O (log‘n ) time complex- 
ity. Our algorithm has O(Vnlogn) time complexity, and is per- 
formed on an O(V¥nxVn ) MCC with constant storage per PE 
which is a simpler model than the SMM. It needs to be pointed 
out that Vnlogn <log*n for practical n. 

In the MCC, identical processors are connected via simple 
and regular communication paths to form a two dimensional array, 
each with a constant number of registers. It is a single- 
instruction-stream, multiple-data-stream (SIMD) computer. (For 
other geometric algorithm on MCCs, see [7-9].) 


Several well know MCC algorithm such as Sorting [10], and 
performing a Random Access Write(RAW) and a Random Access 
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Read(RAR) [11] will be used in our paper. Section II describes 
the algorithm for constructing the Voronoi diagram, and section 
III provides some applications of the Voronoi diagram. A sum- 
mary will be given in section IV. 


II. Constructing Voronoi Diagram 


Given two points p; and p; in the plane, the set of points 
closer to p; than p; is the half-plane containing p; that is bounded 
by the perpendicular bisector of p; and p,. Denote this half-plane 
by H(p,p;). Given a set S of n points in the plane, the locus of 
points in the plane that are closer to p; than to any other point is a 
convex polygon (maybe unbounded). It is known as the Voronoi 
polygon and denoted by V;. 


Vig Hip). 
jei 


The Voronoi diagram is then composed of the n polygons V(i) 
(see Figure 1). The vertices of the diagram are called Voronoi 
points and the line segments are Voronoi edges. 


To construct a 2-dimensional Voronoi diagram, Brown [4] 
used a technique called inversion to transform the planar points to 
3-dimensions. Given an inversion center Pp and an inversion 
radius r , we can transform point Q to point Q” by inversion, 
where, PoQ’ is in the same direction as PoQ, and |PoQ’| = 
r*/|PoQ|. The inversion has the following properties: 


1. An inversion transforms a plane which does not pass through 
the inversion center to a sphere which passes through the 
inversion center, and vice versa. Denote the former type of 
transformation as a plane—sphere transform and the latter as 
a sphere—plane transform. 


2. The interior of the sphere corresponds to one of the half 
spaces bounded by the plane and the exterior of the sphere 
corresponds to the other half space. 


3. The inversion is involutory, i.e. application of inversion 
twice yields the original point. 


Given a set of n points on the xy plane, we examine the 
inversion of three of them with respect to an inversion center Po, 
not in the xy plane. We can consider the inversion as a 
plane—sphere transform such that the three points are on the xy 
plane originally and their images are on a sphere passing through 
inversion center Po. On the other hand, we can consider the inver- 
sion aS a sphere—plane transform in which the three points are 
originally on a sphere passing through their circumcircle on xy 
plane and Pp out of xy plane, and their images are on a plane. In 
this case, if the other n—3 points are in the exterior of the circum- 
circle, then due to property 2, their images are in one half space 
bounded by the plane decided by the images of those three points 
(see Figure 2). So, the problem of determining for each set of 
three points whether their circumcircle excludes all other points 
becomes that of determining for each face if it is a convex face. 
Note that the number of the faces < 2m—4 for a hull with 7 ver- 
tices. 


The method used to construct a Voronoi diagram will be 
presented in a top-down manner below. 


Algorithm Voronoi: 


input: A set of n points represented by their Cartesian coordinates 
(x;,y;). They are distributed on a Vn xVn Mesh-Connected Com- 
puter, one point per PE. 


output: A set of Voronoi edges contained in the mesh. 


1. Perform the inversion for each point. For simplicity, choose 
point (0,0,1) as the inversion center and choose r=1. See 
Figure 3. Let (x’;,y’;,z’;) be the coordinates of the inversed 
image of a point, ie. x’ = x/(x*+y7+1); y’ = y/(x2+y2+1); 7 
= (x74 ?)/(x2-+y 241). 

2. Construct the 3-dimension convex hull CH for the set of 
image points. This is done by algorithm Convex Hull which 
will be described latter in this section. At the end of this 
step, each convex face F; represented by three points through 
which it passes, is stored in a PE with the flag of this face 
extn=1. The addresses (i.e. PE indices) and the PEs in which 
its adjacent faces are present are also known by the PE. If 
we exclude degeneracies, each convex face is a triangle and 
has three adjacent faces. 


3. Each PE performs the "reinversion" for the three points on 
each convex face it contains. The image (x”,y’,z”) of a 
point (x’,y’,z’) on the face F; is given by x” =x/(1-z’); 
y’ =y’/(1-z’). Find the center of the circumcircle of the 
three points corresponding to each face F;. Let C; denote the 
center of this circle. If Po and the CH are at the same side of 
F;, set v; to 1. C;is a Voronoi point. 


4. Each PE containing C; with vj=1 does RAR of v; from the 
PE containing F; such that F; is an adjacent face of F; in 
CH. 
If v j=1, (C;,C;) is a Voronoi edge. 
Else C; starts a ray in the (oF j direction. 


End of algorithm Voronoi 


Theorem: Algorithm Voronoi finds the Voronoi diagram for a set 
of n planar points in O(Vnlogn) time on an O(VnxVn ) Mesh- 
Connected Computer. 


Proof: We give this proof with the assumption that the algorithm 
Convex Hull is correct and takes O(Vnlogn ) time. (This will be 
justified once we describe algorithm Convex Hull later in this sec- 
tion.) Steps 1 and 3 can be done in constant time. Step 4 involves 
at most three RARs and needs O(vn ) time. Thus algorithm Voro- 
noi has O(Vnlogn) time performance, subject to the assumption 
about algorithm Convex Hull to be presented. O 


The rest of this section deals with the description of algo- 
rithm Convex Hull. The strategy we used is divide-and-conquer 
based on [4]. Given two set of faces each belonging to a convex 
hull, we need to first decide for each face if it is an external face. 
We then remove the faces which are not external. Finally we add 
new faces along the "circuit" which is shared by pairs of faces 
consisting of an external face and a non-external face. 


To judge if a face is external, we need to test if the two con- 
vex hulls to be merged, say convex hull A and B, are in the same 
half space bounded by the face. Let the face to be judged be FA in 
convex hull A. Denote all the faces in convex hull B as F B's, 
Instead of testing all the F? with F/’s, we only test two represen- 
tatives F? and F2 such that if they are in the same half space as 
convex hull A, every face in B is. When we refer to a face, we 
indicate it by three points and define each point by its coordinates 
(x;,¥i2;). The equation of the plane can be expressed as follows: 
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xy zl 
X1 X2 X31 
yi 23 1 
Z1 27 2731 


= 0. 


The general form of the above equation is ox + B;y + yz + 6; = 0. 
The details of algorithm Convex Hull are as follows: 


Algorithm Convex Hull: 


input: A set of points represented by their coordinates (x;,y;,z;). 
They are distributed on a Vn xVn Mesh-Connected Computer, one 
point per PE. 

output: A set of convex faces where each face is represented by 
three points; a flag indicating whether the face corresponds to a 
Voronoi point and the index of the PEs where the information 
about the adjacent faces are present. 


2.1 Sort the ” points by x-coordinates into non-decreasing order 


in shuffled row major. 


2.2 Recursively solve the convex hull problem in parallel on 
each block of the mesh where a block consists of 2* PEs in 


the & -th iteration. 


Combine two convex hulls each generated in a submesh. 
(Fig. 4(a) is a reference.) 


2.3 


2.3.1 Decide for each face F; if it is an external face, and 
indicate if so by setting the flag extn to 1, as described 
below. 

Each PE which contains a face of convex hull A, 
FA, performs the following actions: 

Find the two representative faces of convex hull B, 
F? and F3; (This is done by invoking the algorithm 
Represent which will be presented later.) 

Test if F? and F? and convex hull A are in the 
same half space bounded by F/. 

If it is, set flag extn to 1. 

PEs containing faces of convex hull B performs simi- 
lar actions independently and simultaneously, with A 
and B reversed. 


2.3.2 Find the circuit edges. 

The PE which contains F/ and has set extn;=1 does 
a RAR from the PE where Ff, an adjacent face of FA, 
lies. If extn, = 0, mark the points shared by FA and 
Ff. (They form a circuit edge.) All above is repeated 
three times, once for an adjacent face of F & 

(Do the same thing for the PEs containing F'?’s in 
parallel.) 

To order the edges in the circuit, independently and 
concurrently sort all the marked points p’s in A and 
in B by cos"! y/Vy*+z*, which is the angle between 
op’ and the y axis where p’ is the projection of p on 
yz plane and o is the origin. 


Find the new convex faces between the circuit in A 
and the circuit in B. 
comment: Each face is decided by an edge Ef in cir- 
cuit A (respectively E : in circuit B) and a vertex in 
circuit B (respectively A). Using the method in [5], 
we need to find for each Ef (E tg respectively) a vertex 
from all the vertices in circuit B (A respectively) to 
form a convex face. Once a vertex, say p; is chosen, 
it divides the vertices in circuit B into two subsets. 
Those edges with index smaller(respectively grater) 
than Ef need only search among the vertices with 
index smaller (greater respectively) than p;. Here is 
the implementation. (See Fig. 4(b).) 

Originally all the edges in circuit A forms a set and 
all the PEs containing points in circuit B (marked 


2.3.3 


points) are the submesh associated with the set. Do 
the following recursively: 
(**) Choose an edge with the middle index in each set 
and broadcast its record to its associated submesh. 

For each associated submesh concurrently, do the 
comparison in each row of the submesh in parallel 
and in the right most column of the submesh after- 
ward. 2 
_ Thus the vertex to form the convex face can be 
found by the criterion in [5]. Denote the index of the 
PE containing it as PE,,. 

Let the edges with index smaller (greater respec- 
tively) than or equal to the middle be a set, the PEs 
with index smaller (greater respectively) than or equal 
to PE,, be the associated mesh of the set. (For simpli- 
city, assume the submesh is a rectangular block, since 
the points in a incomplete row may be distributed to 
the PEs in the regular block with at most constant 
increases in the number of points handled per PE. See 
Fig. 5.) go to the line marked "**". 

For each new face, set extn to 1. 


End of algorithm Convex Hull 


Theorem: Algorithm Convex Hull finds for a set of n points its 
convex hull in O(V¥nlogn) time on an O(VnxVn) Mesh- 
Connected Computer. 


Proof: Sorting in step 2.1 takes O(Vn ) time. Denote the running 
time of algorithm Convex Hull as T(n). T(n) can be expressed by 
the recurrence equation 7(n)=T(n/2)+M(n). To find M(n), 
assume algorithm Represent in step 2.3.1 is correct and has 
Q(Nnlogn) time complexity. (This will be shown after the 
description of algorithm Represent.) Step 2.3.2 involves a constant 
number of the RARs and sorting steps; therefore O(vn ) time is 
sufficient. In step 2.3.3, broadcast, and comparison in each row or 
column takes O(vn ) time respectively. Step 2.3.3 needs to be 
repeated for logn iterations, thus takes O (vn logy ) time in total. 
Since M(n) = O (Vnlogn ), we have T(n) = O (algea ). A convex 
hull with 2 vertices has at most 3n—6 edges and 2m—4 faces. We 
can distribute the elements on the Mesh-Connected Computer of 
size vn Vn , with a constant number of points, edges or faces per 
PE. The proof is complete subject to the correctness of the 
assumption about algorithm Represent which will be shown as fol- 
lows. O 


The representative face of FA is the face F? in CH(B) such 
that angle(F, F?) is the smallest. Let the normal vectors of faces 
FA and face F 'B be <a;,b;,c;> and <aj;,b;,c > respectively. 


cos angle(F/, F B) = <dj,bj,C;>"<a j,b;,c j> = (aja; + bib; + cjc}) 


If angle(FA, F ‘B) is the smallest, a;a;+b,b ;+c;c; is the greatest when 
Osangle<n. Investigate the distance between point p; with 


coordinate(a;,b;,c;) and point p ; with coordinate(a j,b ;,c j). 


distance (pj,pj) = | 2(1—(a;a ;+b;b ;+c;c j)) 
We can see if aja;+b;b;+c,c; is the greatest, distance (p;,p;) is the 
smallest. To find for F/ with normal vector <a;,b;,c;> the face F2, 
we need only to find for point p; with coordinate (a;,b;,c;) its 
nearest neighbor among all the points pj; with coordinates 
(a;,b;,c ;) where <a;,b,,c ;> is the normal vector of F7’. 

All the points p; corresponding to faces of CH(B) are on the 
unit sphere and we can construct a spherical Voronoi diagram for 
them by projecting onto the sphere the edges of the circumscribing 
polyhedron which is formed by the planes tangent to the sphere at 
p;s [4]. Performing point location for query point p;, we can find 
its nearest neighbor. Observe the Voronoi diagram in Figure 1, we 
can find that it is composed by a set of broken line generated at 
different levels of the merging step, when the Voronoi diagram is 


being constructed by a divide-and-conquer approach. In Figure 
6(a), the bold broken line was generated in the final (logn—1 level) 
merge step. The fine broken line was generated one step earlier 
and the dashed broken line was generated two steps earlier. If we 
sort all the points, on whose Voronoi diagram we are locating the 
query point, by their x -coordinates in non-decyeasing order so that 
their binary indices are Digg,’-bibo, each segment in the i-th level 
broken line is a perpendicular bisector of two points such that the 
most significant bit different between their indices are the i -th bit. 
To locate the query point p;, we first want to decide whether it 
falls in the left half or right half of the Voronoi polygons. Then 
find which quarter it falls in, and so on. For the example of Fig. 
6(a), we first want to decide which side of the bold broken line the 
query point p; lies on. Then find which side of the fine broken line 
it lies on and so on. To decide which side of the broken line a 
point lies on, first we divide the plane into horizontal slabs each 
containing one segment (see Fig. 6(b)). Once we find the slab the 
point falls in, it becomes very easy to examine which side of the 
segment, and thus of the broken line, point p; lies. The point loca- 
tion on the spherical Voronoi diagram is similar. The MCC algo- 
rithm is as follows: | 


Algorithm Represent 


input: Two sets of faces belonging to the convex hull of A and B 
respectively, where face F; is represented by the equation Ox + 
By + ¥,z + 6; =0. The faces are distributed one per PE. 

output: For each convex face in A two representative convex faces 
in B, (the PE which contains a face in A will have at the end of 
the algorithm the indices of the PEs which contain the later), and 
for each convex face in B two representative convex faces in A. 


2.3.1.1 A PE which contains a face Ff (F? respectively) finds 
its normal vectors <a;,b;,c;> and <—a;,—b;,—c;> 
(<aj,b;,c ;> and <—a,,—b,,—c ;> respectively) where 
Q,; 
7° 
VO +B; +Y; 
oe ea 
= 
OP +B 74? 
pee 
= 
VoP+B2 + 
The following steps find for each point p,(a;,b;,c;) and 
Pi(—a;,-b;,-c;) in A (B respectively), its nearest neigh- 
bor among p,(a;,b;,c;)'s in B (A respectively). (They 
are all on the unit sphere.) — 
2.3.1.2 Construct the spherical Voronoi diagram for the points 


in A and B respectively. We describe the steps for the 
points in A. Steps for the points in B are similar. 


(a) Find for all the points p; (a;,b;,c;) corresponding to 
- face Ff the planes tangent to the unit sphere at that 
point. These planes will form a circumscribing 
polyhedron of the unit sphere. Each PE which con- 
tains a face Ff generates the equation of the plane 
ax+by+cz=1. 
(b) To find the edge in the circumscribing polyhedron, 
each PE containing Ff does a RAR from the PE 
which contains an adjacent face F ie The intersection 
of these planes 


decides an edge of the circumscribing polyhedron. 


ax+by+¢z=1 
ax+by+cz=1 


(c) 


By examining in convex hull A all the faces adjacent 
to FA of CH(A), we can determine the vertex of the 


808 


circumscribing polyhedron tu;,v;,w,)’s. 
(d) Find the Voronoi point A,(&;,n;,6;) and Voronoi arc e; 
of the spherical Voronoi diagram. Each PE computes 
Ui; 
sie ur t+ve+we 
Vj 


Vu; + V; + Ww; 


Ww; 


Ni <— 


G- 
t 

up +v?e +w? 
x?+y247727=] 


116) 6151 
abo E269 


Each Voronoi arc "remembers" the two points with 
which it is associated. 


ej — ei 


Ean 7 


x—- y+ 


z 


2.3.1.3 Mark the levels for the Voronoi arcs in the spherical 
Voronoi diagram, as described below. 

Sort all the points p,’s (the_points surrounded by the 
Voronoi arcs) by 0; = x/~/xf+y/. Each point has its 
binary index. 

Each PE which contains a Voronoi arc e; does a RAR 
to get the index of the pair of points it is associated with. 

Find the "bit exclusive or", say ‘Y, of the binary index 
of each pair of points associated with e;. The level of e,, 
li can be found by J; = [log'¥]. [(2’s complement(2") - 
2°) /\ (index of e;’s associated point) \/2'* l is the index of 
the chain to which e; belongs. For an example, if e; is 
associated with point 0010 and 0101, "bit exclusive or" 
of 0010 and 0101 is 0111, [log 0111] = (2),o, so e; is of 
level 2. Furthermore, 2’s complement(2?)-2? = 
2’s complement(0100) — 0100 = 1100 — 0100 = 1000. 
(1000 /\ 0010) (or 1000 A 0101) = 0; so e; is indexed as 
O in the chains of level 2. 


From level i =logn—1, i= i—1, until i = 0, 
each PE containing an unlocated query point in subset 


j checks which side of the Voronoi arc chain j of level 
i it lies at. 


2.3.1.4 


(a) Sort all the points in the chain using the chain index 
as the major key and their y -coordinates as the minor 
key together with the query points using the index of 
the subset they belong as the major key and their y - 
coordinates as the minor key. Each query points 
records the index of the PE which it is in as 
global_rank . 


(b) Sort the query points only, using the same key 

described above. Each query point records the index 

of the PE which it is in as Jocal_rank and calculates 
dest = global_rank —local_rank — 1. 

Then sort all the points in the Voronoi arc chain only, 


using the same key described above. 


Each PE containing the query point does a RAR from 
the PE with index dest to get the equation of the 
Voronoi arc and the two points associated with it. 
Decide whether the Voronoi arc lies to the left/right 
side of the query point and record it as 
bound,/boundp. 


If a query point has bound, and boundp bounding the 
same region, i.e. has the same associated point, the 
query point has been located. Record the index of 


(c) 


@d) 
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the PE in which the "owner" point of the region lies. 
Let all the unlocated query points on the left/right 

sides of the Voronoi chain of level i be a subset 

whose index is Digg,"-b*"bybg with b; equal to 0/1. 
comment: In a sorted list, the rank of an element is the 
number of the elements before it in the list. Given two sets 
P {pj} and Q {q;}, if they are sorted together, global_rank of 
an element indicates the sum of the number of p;’s and q;'s 
previous to it. Its local_rank indicates the number of q/s 
previous to it only. So the global_rank —local_rank indi- 
cates the number of p,'s previous to it, and the one closest to 
it has its own local_rank equal to global_rank — local_rank 
—1. For the details, see [8]. 


End of algorithm Represent 


Theorem: Algorithm Represent finds for each face in A the 
representative faces in B in O(Vnlogn) time on an O(Vnxvn ) 
mesh. 


Proof: Step 2.3.1.1 needs only constant time. In step 2.3.1.2, step 
(a), (c) and (d) each takes constant time and (b) needs O (Vn ) time 
for RAR. Thus step 2.3.1.2 uses O(vn ) time to finish the con- 
struction of the spherical Voronoi diagram. Step 2.3.1.3 involves 
a sort and a RAR each requiring O(Vn ) time. For step 2.3.1.4, 
sorting in (a) and (b) each takes O(Vn ) time; RAR in (c) takes 
O(Nn ) time and (d) needs constant time only. Step 2.3.1.4 will be 
executed for logn iterations, thereby requiring O (Vnlogn ) time. 
Thus the time complexity of algorithm Represent is O(Vnlogn). 
The number of the edges in a Voronoi diagram constructed for n 
points is O(n). If there are k Voronoi arc chains of level i each 
with O(n) arcs, there will be 2k of Voronoi arc chains of level i—1 
each with O (n/2) arcs. So we can always distribute the O (n) ver- 
tices or arcs in the Voronoi chain together with O (n ) query points 
on an O(VnxVn ) mesh, such that each PE has only a constant 
number of arcs (vertices). 


We have presented an algorithm to construct the Voronoi 
diagram for a set of n planar points on a Vn xVn MCC with con- 
stant storage per PE, and proved that its time performance is 


O (Nnlogn ). 


Ill. Applications of the Voronoi Algorithm 


The Voronoi diagram can be used to solve many geometrical 
problems such as finding the nearest neighbors between two sets 
and constructing the minimum spanning trees. 


Given two sets P and Q with p and q points respectively, 
the problem of all nearest neighbors between two sets is to find for 
each point in Q its nearest neighbor in P. To solve this problem 
on MCC, we first construct the Voronoi diagram for points in P, 
which takes O (Vp logp ) time. Then we perform point location for 
each point in Q using the same method described in algorithm 
Represent. The nearest neighbor of each point in Q can be found 
in O (Np+q logp ) time. 

Given a set of points in the plane, the minimum spanning 
tree of it is a subgraph of the Voronoi dual [2]. Based on a Voro- 
noi diagram, the Voronoi dual can be constructed in constant time 
on the (VnxvVn ) mesh by joining each pair of points which are 
associated with a Voronoi edge. The dual is a planar graph. 
Using, for instance, the mesh algorithm proposed in [12], the 
minimum spanning tree for a planar graph can be constructed in 
O(Nnlogn) time. Thus the time complexity of the algorithm solv- 
ing the Euclidean minimum spanning tree problem on a Vn xvVn 
mesh is O (Vn logn). 


IV. Summary 


We have presented the algorithm for constructing the Voro- 
noi diagram of a set of planar points on a Mesh-Connected Com- 
puter having constant number storage per PE. Given a set of n 
planar points distributed on an O Ae n ) Mesh-Connected Com- 
puter, our algorithm can find the Voronoi diagram for the set in 
O (Vn logn ) time. 

We have also discussed some geometrical problems which 
can be solved on the Mesh-Connected Computer once the Voronoi 
diagram has been determined. The problem of determining the 
nearest neighbors between two sets and constructing Euclidean 
minimum spanning trees can also be solved with the same time 
complexity. 
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circuit A 


Figure 3 


(b) 


Figure 


4 


circuit B 


Originally B is a submesh associated with the set of 
edges stored in A. Ef, the edge with the middle index, finds 
pj; which divides mesh B into two submeshes, the white one 
and the shaded one. Thus, the white submesh is associated 
with set 1 and the shaded one with set 2. The dashed arrows 
indicate the search in the next iteration. 


Figure 5 


(b) 


Figure 6 


OPTIMAL ALGORITHMS FOR MESH-CONNECTED PARALLEL PROCESSORS 


WITH SERIAL MEMORIES* 
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Abstract -- We examine the problems of computing the 
discrete Fourier transform and performing matrix multiplica- 
tion on an array of processors, where each processor consists 
- of an ALU, a constant(<3) number of storage registers, and 
has a single serial memory attached to it. We show that the 
discrete Fourier transform can be computed optimally on one 
or Ve euensions arrays wih p processors in time 
e(n + =>) and e(—= 


VP 


+ a) respectively. 
p? 


For the | 


problem of multiplying two n XK n matrices, we show the. 
4 


optimal bound of e(n? + =>) for a linear array of p pro- 


2 4 
n 
cessors and a lower bound of 02(—— 


: : Vp ip 
bound of Oo + =,) fora Vp X Vp array of proces- 
P p 
sors. We also settle the case of multiplying a matrix by a vec- 
2 
tor in time e( 2 + ny. Optimal bounds are also esta- 
nN 


blished for the corresponding bit model. Our lower bound 


+ =>) and an upper 


techniques use a combination of crossing sequence and com-. 


munication complexity arguments. 


1. Introduction 


In an attempt to achieve the maximum possible speed- 


ups, researchers have studied various architectures and exam- 
ined their suitability to solve different classes of problems. 
Many schemes for specific problems have been developed and 
have been shown to achieve optimal speeds on these architec- 


tures. However almost all of these designs require that the. 
number of processing elements must grow linearly with the. 


size of the problem, a condition that makes the hardware cost 
too high. 


We examine, in this paper, a simple array architecture 
(one and two-dimensional) with a fixed number of processors 
such that a serial memory is attached to each processor. 


Serial memories appear in magnetic disk, magnetic bubble, | 


charge coupled devices, and even VLSI shift registers. Because 
of their high information density and low cost, serial 
memories seem to be quite attractive for use in applications 
requiring the processing of a large amount of data. However, 
their overall performance seems to be severely limited due to 
* Supported by the U.S. Army Research Office, Contracts No. 
GAAG29-82-K-0110 and DAAG29-83-K-0126. 
+ Current address: Department of Computer Science, Penn 
State University, University Park, PA 16802. 
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the fact that the elements can be only accessed in one partic- 
ular order. A similar model has been studied by several 
authors ({BL], [BW], [CLW], [CW], [MTW], [OJ], [TW], [WC], 
[W]). In particular, it was shown in ([OJ]) that sorting n 
numbers on a linear array - p processors (comparators) can 
be done in time O(n + =>). and in time e(—_ 

p Vp Pp 
a Vp x Vp array of processors. These results show that the 
performance of serial memories is comparable to that of ran- 
dom access memories whenever p is large enough. In this 
paper, we consider numerical problems that have high compu- 
tational requirements. Assuming that each processor is an 
ALU, we show that computing the discrete Fourier transform 


n 
+ a) on 


on n points requires O(n + =~) steps in the word model, 
P 


2 
and O(n gq + ) in the bit model, where each number 


consists of g bits. For the case of a Vp x Vp array of pro- 


show the optimal bounds of e(— 


vp 


2 
cessors, we + —) and - 


p 


nq n?q? ; 

aie + a) for the word and bit models respectively. 
Pp 

The lower bound techniques use a combination of crossing 


sequence and communication complexity arguments. We also. 
consider the problem of multiplying two n X n matrices and 


4 
exhibit matching upper and lower bounds of e(n? + =~). 


for a linear atray, of p processors, and a OWED pune of 


a 


+ — and an upper bound of ote + = for a 
P p® 
Vp x Orie! array of processors. The problem of multiplying a 


n2 
matrix by a vector is shown to require e(2 + ae steps. 
n Pp 


The rest of the paper is organized as follows. The formal 
model is introduced in the next section, while the upper and 
lower bounds for the discrete Fourier transform are presented 


in section 3. The matrix multiplication problem is considered 


in section 4 and the last section is devoted to the problem of 
multiplying a matrix by a vector. 


2. The Model 


We will consider a network of p interconnected proces- 
sors, where each processor P;, 0 <7 < p, consists of one (R, 
in the case of the discrete Fourier transform) or more (A,, B,, 
and C, in the case of matrix multiplication) storage registers 
and an ALU capable of adding, subtracting, and multiplying 
numbers. A single q bit number can be stored in each 


storage register. Attached to each processor P, is a single 
serial memory S,. Each serial memory S, consists of a linearly 
connected array of storage cells Qyy, O< 7 </. Hither a 
single g bit number (the word model) or a bit (the bz model) 
can be stored in each cell Cy;. The length of each serial 
memory §, is l. 


When a serial register S, is clocked, the content of cell 
Cy; 1< 7 </, is transferred to cell C,;_,. Furthermore, 
the input to serial memory S, is transferred to cell Cy;_, and, 
hence, to the serial memory itself. The output of the serial 
memory is the content of cell C, 9. 


Suppose processor P, is connected to processors P,,, 
0o<j < n;. Then, at the beginning of each step, processor 
P, has available to it the contents of the storage register(s) of 
processors Pi, 0<j < n;, the contents of its own storage 
register(s), and the output of its serial memory. At the end of 
each step, each processor is expected to supply a number to 


its serial memory and possibly change the contents of its 
storage registers. 


In this paper we will consider two particular interconnec- 
tions. In the first case, the processors are interconnected to 
form a linear array of length p ( iinear model). Figure 2.1 
illustrates this configuration for p —=4and n = 24. 


» i a a? ie rae: 
Py — Coo Coi— Coe =e Cos cies Co. —H Cos 


— | ~~ 
Pi Cy C.i— C, 2 Cis Cia Cis 


Poy —— Cy9—Ca1— Co2— Co3— Ca, ro 


as C3a0— C3,— Cy 0 —C, 5— Cy4—C3 ; 


Figure 2.1 Linearly Connected Array. 


In the second, the processors are interconnected to form a 
Vp XVp_ array ( two-dimensional model). 


Our model is consistent with most of the architectures 
which have used serial memories as the principal storage 
memory. In particular the model is consistent with most 
architectures which use rotating magnetic disks, charge cou- 
pled devices, bubble memories, and equal length VLSI shift 
registers. 


3. The Discrete Fourier Transform 


The Discrete Fourier Transform, DFT, of n points Zo, 
Zi, °° *,» 2, —, is given by: 
n-1 .. 
y= Dw" z;, 
j=0 


Oo<i <n, <DFT(n)>, 


where w is a primitive n’th root of unity. In matrix form, 
y = Wx, 
; T , T 
x= [z.0<% <n| , y= Ju <i <n| ; 
and W = wit, o< tij< n|. 


In this section we will show that DFT (n) can be com- 


where 


2 
puted in time O(n + =~) on a linear array of p processors 
P 


2 

and e(— + as) on a Vp x Vp array of processors. In 
Vp pp? | 

both cases, a serial memory of length | = i is attached to 


Pp 
each processor. To establish the upper bound for the case of 
a linear array, we Will do the following: 


n 

1) show how DFT (n) can be computed on a p X ry array 
of processors using O(n ) steps; and 

2) show how each step of the array can be emulated using 


BLS steps on our linear model. 
P 


The fast Fourier transform (decimation in frequency scheme 


[BB]) can be computed on a p X = array of processors in 


time O(p + ay, In this case p and n are assumed to be 


both powers of two. From [JO2], we see that this algorithm 
consists of logn steps. Step 7, 1 <# < logp, involves a 


column shuffle of length F the evaluation of expressions of 
2 


the form az + 6 y at each processor, and a column 


unshuffle of length = am Step 7, logp < +t < logn, involves a 


2' 
row shuffle of length —, the evaluation of expressions of the 
2 
form a z + 6b y at each processor, and a row unshuffle of 


length —. From figure 3.1, we see that a column{row] 
2 


shuffle/unshuffle of two sequences of length + can be per- 
formed using # — 1 column{row] exchange operations in a rec- 
tangular array. 


Po 


Figure 3.1 Shuffling. 


Hence, DFT(n) can be computed on a rectangular array 
using O(p + ”) steps. 
p 


We now consider the emulation of ap X = rectangular 
array on our model. To emulate the array, a use a linear 
array of p processors. A shift register of length on is 
attached to each processor. In the emulation the sonecite of 


storage cell Cyy,0 1 <p,olfg< = represent the state 


of processor P, j of the array being emulated. From [OJ], we 


see that each step on the rectangular array can be emulated 
; n ; he 
using a steps on our linear model by shifting the contents 


of all the shift registers through the processors once. Hence, 
the discrete Fourier transform of n points can be computed 


nn n? 
using O((p + mr lie O(n + =) steps on our linear 
p 


model. 


The weights used in the algorithm can either be serially 
supplied to the processors or computed. Computing the 
weights requires doubling the length of the serial memories 
and adding another storage register to each processor ([JO2]). 


In the case of the two-dimensional array, it is straight- 
forward to show a bound of O((Vp + ay) = 6, ae 
P p 


: Vp 


+ =.) steps using an emulation of an algorithm for a 


Vp x Vp x Me three dimensional array of processors. 
Pp 


Note that if we assume that a RAM is attached to each 
processor, we can obtain the bounds of O(n) (linear array) 


and O(—). Hence, if at least (vn ) (linear array) or 


, VP 


O(n *) (two-dimensional array) processors are used, a RAM. 


based architecture is not superior to a serial memory based 
architecture. 


To establish the lower bounds, we will elaborate on the 
basic model. 
the proof. 


(1) ; 
evenly divides n. The numbers are initially assumed to 
be stored in a predetermined order independent of their 


values. 

(2) Each number consists of gq bits and each step operates 
on a number as a Single entity. 

(3) The serial memories are unidirectional. 

(4) A general step of the algorithm at processor P, consists 


of: 


(a) computing two quantities a; and (2; that depend only 
on the output of S;, Ry, Ry_,, and Ry, ;. 
(b) updating R, with a; and shifting @; into S,. 


(5) The steps of all the serial memories are synchronous. 


(6) The output numbers will reside in the serial memories in 
a predetermined order independent of their values. 


We will call a cycle a set of = consecutive shifts. Without 


Pp 
loss of generality, we assume that the output is generated 


after precisely c cycles, ie., after c Be steps, for some 
P 


positive integer c. 


Theorem 3.1: Under assumptions (1) - (6), computing the 
discrete Fourier transform of n points on p ancerly con- 


nected processors with serial memories requires (—— +n) 


steps. Fora Vp x Vp array of processors, we need at least 
2 
A 


read eos) steps. 


Vp 


Proof: With each set of input values, we associate two 


We state precisely the conditions we need for 


; ? n 
Each serial memory S, contains — numbers, where 2 p 


sequences of p-tuples which we call cycle sequences. The first 


Sequence can be defined as the contents < af), ast ) 


a) > of the R registers (of the processors) at the steps 
nN 


2 

defined above. The second sequence is the contents < BS), 

BS, 
Pp 

Note that the number of possible distinct cycle sequences is 

Q747° | We will next establish a lower bound on this quantity. 


7 i 
++ ne O <2 < c, where c is the number of cycles as 


oh pe > of the R registers at steps =. 1<t<e. 


Recall that the n-point discrete Fourier transform of x 
is given by 


y = Wx 


Xy 


Let x = be such that x, is the length = vector - 


corresponding to the contents of. the lower half of the serial 
memories and Xz is the length = vector corresponding to the 


contents of the upper half of the serial memories. Similarly, 


we can define the two vectors (again each of length a Yi and - 


Y2 which form the output vector. We can reorder the rows 
and columns of W to obtain a matrix W such that 


Yi _ |X Wi Wi Xy 
— W — — ae 
Ye Xo Wei Woe | [Xe 
X4 X, 
Let x = - and x = be two input vectors which 
2 Xo 
1 
have the same cycle sequences. Let y = y and 
2 
yi | 
y = |— | be the corresponding output vectors. Let’s exam- 


2 
Xi 


ine the output vector corresponding to the input vector i 
Xo 


| 


For the first -_ steps, the algorithm works exactly as in 
Pp 


the case of the input x. At the beginning of the ([— + 1)’th 
p 
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step, the situation is identical to the beginning of the 


n . = — 
Srv + 1)’th step on input X since x and X have same cycle 
sequences. Hence, the algorithm behaves exactly as in the 


case of the input X for the next > steps. Again, at the 
P 


n ; ae . 
3 + 1)’th step, the situation is identical to that of input x 
n , 
at the a + 1)’th step, and so on. Therefore, the output 
Y1 
will be |_ 
y 


| i.e., we have 
2 


But this implies that 
Wie (X_ — Xp) 


0 


Woi(xi - ¥%) =O (*). 
Let rank (W, >) =p and rank (W.;) =). We know that 
(TT) prtrAS - regardless of how W is partitioned. 


Hence, range {W,.u} has 2"! distinct points, say { W,. uj, 


Wiety, -°°, Wy. Uy, } and range {W,,v} has 2%! 

distinct points, say { Wo v, W. 1Ve -, Woivp 3 
Vj Vv 

Consider the input vectors and ti, such that 


(7,7) 4 (j,7). In this case (*) does not hold. Hence any 
two such vectors must have two distinct cycle sequences. The 
n 


ge 
number of such vectors is at least 2 7. Therefore we must 
have 


n 
2 


q 
q771 > 2 


2 
Hence, c > >From this it follows that T = 2(—,). 
P p 
We now show that in the case of a linear array 0(n) 
steps are required no matter how large p is. Let’s look at the 
link between processor PS and processor Rs ae Half the 
— +41 
2 2 
input numbers are to the left of that link and the rest are to 
the right. Computing DFT (n) requires Q(n ) communication 
between these two halves. Therefore, Q(n ) steps are required. 


The same argument will give a(—) for a Vp x Vp array. 


Vp 
of processors. (] 
Note that if we assume that RAM’s are attached to the 


processors, we get the lower bounds of O(n) and AF) for 
Pp 


the linear and two-dimensional arrays respectively. 


We can also establish similar lower bounds even if we 
assume that each number need not be treated as an atomic 
unit. We modify assumptions (1), (2), and (6) as follows. 


(1?) Each serial memory §, contains "7 pits of the input, 


P 
where 2p evenly divides ng. The bits are stored in a 
predetermined order independent of their values. 


Each step operates on single bits. 


(2’) 
(6’) The bits of the output will reside in the serial memories 
in a predetermined order independent of their values. 
The bits of an output number are stored in consecutive 


locations in a single serial memory. 
nq 


We now define a cycle to be a set of consecutive steps. 


Theorem 3.2: Under the modified assumptions, computing the 
discrete Fourier transform of n points, each consisting of q 
bits, on p linearly connected processors with serial memories 


2,2 

requires 0( ue d + nq) steps. Under the same assumptions, 
n*q? nq 

we have the lower bound Q(—+*— + ) for a Vp x Vp 

| p? Vp 


array of processors. 


Proof: The proof is similar to that of theorem 1.1 with the fol- 
lowing modifications. Partition the elements of x into two 
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subsets L and U such that |U|, |L| > 7 and 2;€L 


(z;€ U) implies that at least = of the bits of z; are initially 


stored in the lower (upper) half of the serial memories. 


Xy 


Now let x = be such that x, and X» are the vec- 


2 
tors corresponding to the elements in L and U respectively. 


Y1 
Let y = be defined as before. We thus have 
Y1 Wi Wi» xX) 
Yo} Wo, W2,2 =2 


after reordering the rows and columns of the matrix W. 
Notice that the bits of x, are not necessarily stored initially in 
the lower half of the serial memories. However, we know that 


at least = bits of each element of x, are initially stored in the 


lower half of the serial memories. Consider all the possible 
input values for which all the bits of any element of x,, not in 
the lower half of the serial memories, are set to zero, and all 
the bits of any element of xg, not in the upper half, are set to 
ng 

zero. The number of such input values is at least 24. We 
now follow the same argument as in theorem 3.1 in identify- 
ing those values which yield identical cycle sequences. 


Notice that if rank (W,.,) =p, then there exists non- 
singular matrices P and Q such that 


P W,,, Q = diag {1, 1, --,o} 7, 


where the number of 1’s is equal to pw. Hence, if u and v have 
different first yw entries, then 


P Wot Qu +P Wo, Qv 
But this implies that W.,(Qu) #4 W.,(Qv) and thus 
HG 


=e OO; 


range (W,,,u) contains at least 2 * points in this case. 


4. Matrix Matrix Multiplication 


The product of two nxn matrices 
A= fai, 0 ay GS n| ne B= [6,50 are n | 
is given by: 

n ° . 
C= C5 = Yay, ye j,0S1, 9 <n 
k =0 


In this section we will show that this product can be 


4 
computed in time e(n? i =) on a linear array of p pro- 


3 4 
cessors and in time Oo + =>) on a Vp x Vp array of 
p 


n? 


P 
is attached to each processor. To establish the upper bound 
for the case of a linear array, one can do the following: 


processors. In both cases, a serial memory of length i 


present an algorithm which computes the product of two 
n Xn matrices on an n Xn cyclic square array (an 
array with cyclic end around connections) using O(n) 
steps; 


1) 


2) show how each step on the square array can be emulated 


using ore + as steps on a p X1/ cyclic rectangular 
n p 
array; 


3) show how each step on the rectangular cyclic array can 
be emulated using O(1) steps on a rectangular noncyclic 
array; and 


4) show how each step on a rectangular noncyclic array can 


be emulated using O(-) steps on our model. 
P 


We leave the details of steps 1-4 to the reader. 


We now establish the lower bounds. 


Theorem 4.1: Under the assumptions (1)-(6) of section 3, mul- 
tiplying two n Xn matrices on p linearly connected proces- 
4 


e e e e n 
sors with serial memories requires Q(n? + —) steps. For a 


Pp 
Vp x Vp array of processors, we have the lower bound of 


Proof: We actually prove a slightly stronger result by estab- 
lishing the lower bounds for the problem of matrix squaring. 
The main strategy is similar to that of theorem 4.1. We show 
that the number of cycle sequences must be large. 


Let X be an n Xn matrix such that each entry consists 
of q bits and let Y = X?. Let X =X, + X,,, where X, and 
X,, correspond to the entries initially stored in the lower and 
upper halves respectively of the serial memories. Suppose that 
X = X, + X,, generates the same set of cycle sequences as 
X. If Y,; and Y,, denote the output arrays corresponding 
respectively to the lower and upper halves of the serial 
memories, then it is easy to see that we have the following: 


2 
Y+Y= (x +f Xu] oN x ee 


Note that as in the case of theorem 3.2, we can establish 
similar results for the bit model. Moreover, it is worth men- 
tioning that, for the case of a Vp xVp array of processors, 
the bounds are optimal if p<n or p= O(n”). On the other 
hand, if 1<p <n? and only the semiring operations of addi- 
tion and multiplication are allowed, then some processor will 
have to perform A(n*/p ) ae and hence we have the . 


lower bound of on + =~) which matches the upper 
p” 


bound. 
5. Matrix Vector Multiplication 
The product of a nxn matrix 
A= |4,;,0<1,j <n] by a n vector 


T 
b= |b;,0 <7 <n] is given by: 


n T 
C¢; = ) 4,; bosi<a| 
j=o 


In this section we will show that this product can be 


2 
computed in time a(t + Le on a linear array of p proces- 


2 
sors and a, /E + = on a Vp x Vp array of proces- 


sors. In both cases, a serial memory of length | —= Lae is 
P 


attached to each processor. To establish the upper bound, we 


. will do the following: 


—_ nae me — \2 == pene ee aaa on ee 
Y,+¥,= (X +X) =X? ¢ 8? ¢ So 2 
eos — \2 a po — 
Y+Y%= (x f X,) a Cl > Cdk ap Cp SE eee € 
aa aes 2 — Wiens ants 
Yi + Yu=(Ki +X) =KP + KY? + HX, + XX, 


It is easy to deduce from the above equations the following 
identity: . 


(% - %) (XX) + (% -)(%i - K) =00% | 


But in [JK] it was shown that n(22"4) matrices exist such no 
pair of which satisfy (*), where a@ > 0 is some constant. 
Therefore the number of cycle sequences must be at least as 
large, i.e., 


92Ppq > gang 


4 
and hence JT = 2+). The rest of the proof follows as 
p 


before. 
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1) show how this product can be computed on a p X/ 
array of processors using O(n ) steps; and 
2) show how this product can be computed on our linear 


po, n? 
model using O(— + —-) steps. 
n p 


We will first consider the case where p = n. The algorithm 
foram Xn array begins with the storage registers A; y B y 
and O,; of processor Py}, 0 <i, 7 <n containing a;,;, 6;, 
and O respectively. Note that the elements of b are repli- 
cated. Figure 5.1 illustrates this initial configuration. 


A storage registers 


2o0 Gor Qon-1 
210 G11 Gin-1 
Gyn 1,0 Qn 11 Qn -in -1 
B storage registers 
b 0 b 1 by -1 
bo b 1 by -1 
bo b 1 b,, -1 


C storage registers 


> => o--— — > 0 

o> o— —) 0 

o-——7s o-- — 0 
Figure 5.1 Initial Configuration for p — n. 


The algorithm consists of n steps. At the k’th step, 
1) 


the contents of the C storage registers are shifted right- 
ward as indicated in figure 5.1, i.e., the contents of 
Ciy-. OS t <n, 1S k <n, or 0, OSt <n, 


= 0 are transferred to Qy,; and 
2) the content of Cy, O27 <n, 
Ai, Bye + Cyy. 
It is clear that this algorithm can be emulated on our model 
in only O(n) steps. 


is replaced by 


We now consider the case where p < n. In this case, we 
assume that p evenly divides n. The algorithm fora p X n 
array begins with the storage registers A, j B, j and C, 3 of 
processor Pyj, Oo<1t,7 <n containing 4; j, b;, and O 
respectively, where i—|i2| and j=j +1 mod (i ,—~). The 

n p 


2 
algorithm consists of — steps. At the k’th step, 
Pp 


1) the contents of Cyy_,, 0 <1 <n, 0A mod(k,n) or 


0,0<1 <n, 0= mod(k,n) are transferred to Cy,; 


and 


2) the content of Cyy, O< i <n, is replaced by 


Ayx Baa + Ox. 
It is clear that this algorithm can be emulated on our model 


n? 
in only O(—-) steps. 
Pp 


We now consider the case where p > n. In this case, we 


assume that n evenly divides p. The algorithm fora p Xn 
array begins with the storage registers A, * B, yj and C, j of 


processor Py;, 0o<1,j <n containing Gj b;, 


respectively, where i=|i7] and j=j +l mod G2). 
, n . P 
Clearly, in this case, the product can be computed in 
2 
O= fe P) steps. Taking all the above facts together, we 
n 
get that the product of a matrix by a vector can be computed 


2 
in ces + ans) steps on our linear model. 
n p 


The algorithm for a Vp x Vp array follows from that 
of the linear array. The difference being that the product can 


2 
be computed in only oo + V2 steps instead of 
p n 


2 
ot + P) steps, if p > n. 
p n 


and O 


[1] 


[2] 


[3] 


[4] 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[2] 


[13] 


[14] 


It is obvious that these bounds are optimal even if. 


RAM’s are used. 
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Abstract 


Programming for parallel architectures often includes 
the specification of a process embedding in which logical 
processes and their interconnections are mapped onto 
physical processor elements and their connecting links, 
respectively. Few parallel programming environments 
provide assistance in performing this embedding. We 
are investigating the use of grammar-based descriptions 
of embeddings and we report here on a specific grammar 
that we have developed for generating embeddings of ar- 
bitrarily large, complete binary trees in square processor 
arrays. The embedding has an efficient processor utiliza- 
tion and it is easily automated. Further, if the number 
of logical processes exceeds the number of physical pro- 
cessors, our embedding has a simple contraction which 
yields an optimal, near-uniform distribution of computa- 
tion and communication tasks. 


Introduction 


Within the next decade, parallel architectures composed of 
a thousand or more processing elements will be commercially 
available. As the number of processors grows, however, their 
programming will become an increasingly cumbersome ‘task. 
This is especially apparent in the activity of program embed- 
ding in which logical processes and their interconnections are 
mapped onto physical processing arrays. Aspects of this map- 
ping problem have received significant amounts of attention, 
yet current parallel programming environments have been able 
to provide only limited support. We are investigating the use 
of grammar-based graph descriptions in the development of a 
strategy for program embedding, and we report here on the 
characteristics of a specific embedding that we have found for 
mapping arbitrarily deep, complete binary trees in square grid 
processor arrays. 

Previous embeddings of binary trees into rectangular arrays 
have been of two types: hand embeddings and recursive embed- 
dings. Hand embeddings|8] have optimal processor utilization 
but they will not be feasible for the huge numbers of process- 
ing elements that we anticipate in future machines. Recursive 
embeddings/4]-[5] can be automated but they have so far failed 
to achieve the efficiency of hand embeddings. The Hyper-H 
embedding, for example, is a useful method for laying out the 
tree for VLSI but its processor utilization is not optimal. Our 
strategy, based on specialized shape grammars(6|, automatically 
generates optimal embeddings for arbitrarily large trees — even 
in the case where the number of logical nodes exceeds the num- 
ber of available processors. 


“Much of this work was supported by the Office of Naval Research un- 
der contract N000014-84-K-0647. Duane Bailey was supported under an 
American Electronics Association Fellowship from ComputerVision. 
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In the next section, we describe the embedding of a 2n- 
level complete binary tree in a 2” x 2” processor arrays with an 
interconnection network of the CHiP computer(8]. In following 
section, we describe the mapping of such a tree onto a smaller 
grid. And finally, in the last section, we draw some conclusions 
on extending this program embedding strategy to other regular 
structures. 


The Tree Embedding 


In this section we describe a technique for performing tree 
embeddings for the CHiP architecture[8]._ A CHiP machine 
(Figure 1) consists of a 2” x 2” array of processors with lo- 
cal memory. Corridors of switches are located horizontally and 
vertically between rows of processor elements. We require that 
each switch have degree 8 and a crossover level of 3, allowing it 
to host three simultaneous communication paths. 

Our tree embedding technique uses a two-phase shape gram- 
mar! to determine the node assignments and switch settings. In 
the first phase, processes are embedded into the processor array. 
For this we use a type of parallel shape grammar, called a tem- 
plate grammar, in which derivations are constrained to rewrite 
the entire sentential shape at each step; that is, the left-hand 
sides of the set of productions applied must be a minimal set 
covering the sentential form. In the second phase, communica- 
tion channels are routed. For this we use a type of sequential 
shape grammar, called a ruler grammar, in which scaling trans- 
formations are not allowed. 

The template grammar for the first phase of our tree em- 
bedding is shown in Figure 2. The letters that appear on the 
shapes are not actually part of the shapes - they are simply 
coordinate labels; non-terminal shapes contain an arrow, while 
terminal shapes do not. A sample derivation is shown in Fig- 
ure 3. Because the entire shape must be rewritten in every 
step, all successful derivations in this grammar have the same 
form: first odd numbered productions are applied for an arbi- 
trary number of steps and then even numbered productions are 
applied for a single step. The very first production divides the 
processor array into quadrants. The root of the tree will be lo- 
cated in the quadrant labeled P and it’s left and right children 
will be located in the quadrants labeled L and R respectively. 
The single unassigned, “orphan processor” will be located in 
the upper right quadrant, labeled O. To embed the remainder 
of the tree, the array is recursively subdivided into quadrants 
which will host lower level subtrees. In the final step, processor 


‘A shape grammar operates on shapes in much the same way that a 
conventional grammar operates on strings: the portion of a sentential form 
that matches the left-hand side of a production is replaced by its right-hand 
side. Unlike string grammars, however, the productions of ashape grammar 
may undergo arbitrary Euclidean transformations before matching [6]. 


nodes and “buds” for their communication channels are laid 
down, all of the nonterminal shapes are removed, and we are 
left with a start shape for the second phase of the grammar. 

A ruler grammar for the channel embedding phase is given 
in Figure 4. In this case, the letters are again used for la- 
bels but their positioning is significant. The grammar works 
by “growing” the channels from the buds left by first phase in 
such a way that no two channels share a common wire. Fig- 
ure 5 shows several steps in a sample derivation. Derivations 
which erroneously create channels that are not part of our em- 
bedding will fail to rewrite all non-terminal markers[1]. Final 
embeddings produced by our grammar for a 63 node tree in an 
8 x 8 grid and a 255 node tree in a 16 x 16 grid are shown in 
Figure 1. Note the symmetry of the embeddings and that the 
8 x 8 embedding, rotated by 180°, is found in the upper right 
quadrant of the 16 x 16 embedding. 


Contractions Induced by Tree Embeddings 


When the number of logical processes exceeds the number of 
physical processors, it is necessary to contract the logical struc- 
ture to fit the array. The contraction must map sets of processes 
to processors while preserving their connectivity. Fishburn and 
Finkel(3] have suggested using quotient maps to generate equiv- 
alence classes of processes which may be multitasked on a pro- 
cessor. If the cardinality of each equivalence set is the same, 
the contraction is said to be computationally untform. A quo- 
tient map also induces a set of equivalence classes on the logical 


communication channels and, if every physical channel emulates 
the same number of logical channels, it is said to be exchange 
uniform. If a contraction is exchange and computationally uni- 
form, it is said to be totally uniform. 

Berman and Snyder[2] have suggested the quotient map 
shown in Figure 6 in which the depth of the tree has been 
reduced by one, the left and right subtrees of the root have 
been identified, and the root has been grouped with its two 
children. Further contractions would be accomplished by iter- 
ating this procedure. Repeated coallessing of the root, however, 
makes this contraction unattractive: it is not computationally 
uniform because the equivalence class of the root must always 
contain nearly twice the number of processes found in any other 
class. | 

Quotient maps for grids, however, contract more uniformly. 
Figure 7 details such a contraction. If the grid is folded like a 
napkin — horizontally and vertically — groups of four nodes are 
identified to form a smaller grid of equivalence classes. This 
contraction is totally uniform. It might be expected, therefore, 
that embeddings of trees that take advantage of grid contrac- 
tions may yield more uniform distribution of computation and 
communication. We have found this to be true. 

Any embedding of a contracted binary tree in a square grid 
can off-load extra processes from the root to the unused pro- 
cessor, provided it is possible to route a channel between them. 
Our embedding assures that the orphan processor is always lo- 
cated horizontally opposite the root, easily accessible to such 
a channel. We call a complete binary tree in which an extra 
process has been attached to the root in this way an augmented 
tree. 

Extending our grammar to augmented trees in the obvious 
way produces an embedding with quadrants that are identical 
under vertical and horizontal flips and can thus be folded into a 
contraction. This contraction identifies nodes and channels in 
much same way as Berman’s quotient map except in its use of 
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the extra processor. It off-loads nearly half the processes that 


would otherwise be located at the root, making it exchange uni- 


form and as computationally uniform as possible. Our strategy 


of first mapping the tree onto a large logtcal processor grid, 
and then contracting it to an augmented tree embedding of the 


correct size is conceptually distinct from the contract-then-map 
paradigm and yields a more uniform distribution of the com- 


putation and communication. 


Conclusions 


Because shape-related information is often important to the 
embedding process, we have investigated the use of shape gram- 
mars for describing layouts of regular process structures. The 
strategy we have developed first recursively assigns processes 
to localities and then routes their interconnections. Using it, 
we have been able to create efficient embeddings of arbitrarily 
large, complete binary trees in a square grid. We expect that 
this strategy will be useful in creating layouts for other regular 
structures and we are now concentrating on the development 
of tools to automate some aspects of these grammatical de- 
scriptions. In addition, we have found that our tree embedding 
permits contractions induced by quotient maps of square grids, 
resulting in a uniform distribution of tasks and communication. 
We expect that this embed-then-contract approach will also be 
useful in producing uniform maps of other recursively described 
structures and we are investigating the extent to which it too 
can be automated. 
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Figure 1: Tree embeddings in the CHiP processor array. Large squares are processors, and 


small squares are switches. 


Figure 2: The process embedding grammar. Note that the start shape (left side of produc- 


tion 1) may be scaled before the derivation begins. 
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Abstract — An important consideration while designing 
a large-scale parallel processing system is whether it provides 
adequate bandwidth to sustain its communication require- 
ments. We show that for a major class of topologies, if global 
communication is allowed, the communication requirements 
increase more rapidly than the bandwidth with increasing 
number of PEs. We explore the strategy of restricting the 
communication neighborhood of PEs. We give a method for 
maximizing the neighborhood while satisfying the bandwidth 
constraints for the point-to-point topologies and bus topolo- 
gies, and demonstrate the advantages of bus topologies. 
1 Introduction 


Massive parallel processing systems, with tens of 
thousands of processors, are feasible with current and foresee- 
able hardware technologies and costs. An important concern 
while designing such systems is their scalability. A system is 
scalable if one can add an arbitrary amount of hardware and 


get a corresponding increase in performance. 
In this paper, we first prove that any multi-processor 


interconnection scheme that allows communication between 
arbitrary pairs of PEs cannot be scalable. A possible solution is 
to restrict the maximum communication distance of each PE. 
For many reasons, including the desirability of uniform load 
distribution, we wish to maximize the number of PEs reachable 
in that distance from an individual PE. The problem is that of 


maximizing this neighborhood without sacrificing scalability. 
For point-to-point topologies, this problem can be solved 


by choosing the largest communication distance that is con- 
sistent with the bandwidth requirements. Bus—based topolo- 
gies are more interesting in this context because they provide 
an additional degree of freedom for the designer: namely, the 
number of PEs connected by each bus. 


In the next section, we show why global-communication 
schemes are not scalable. Section 3 enunciates the strategy of 
restricted communication neighborhoods. In Section 4, we 
apply this strategy to point-to-point topologies. Section 5 
develops a method for determining the optimal communication 
neigborhoods for bus topologies, and applies it to some specific 
topologies. Section 6 summarizes the results. 

2 Global Communication is Not Scalable 


Theorem: Any topology that connects a number of PEs 
with uniformly distributed communication requirements wethout 
using additional switching elements ts not arbttrartly scalable. 

‘Uniformly distributed communication requirements’ 
means that both the source and destination of the messages are 
found with equal probability from any of the PEs of the 
system. The intuitive idea behind the proof is simple: as the 
number of PEs in a network increase, so does the number of 
communication channels, thus creating a balance between mes- 
sage data generated per unit time and the total communication 
bandwidth. But each message now has to traverse more chan- 
nels than before, and so the communication requirement even- 
tually catches up with the available bandwidth. 


0190-391 8/86/0000/0823 $01.00 © 1986 IEEE 


Let B, and B be the available and the required 
bandwidth of a network as whole, respectively. The available 
bandwidth can be written as: B,=channels(p)B (1) 
where B is the bandwidth of an individual communication 
channel and channels(p) is the number of communication 
channels available in a topology with p PEs. Each PE must 
have connection to a bounded number of channels to be scal- 
able. Let that bound be c. Assuming each Channel connects 
w PEs (w>2), channels(p) = pe/w (2) 
Then, By = Bpc/w (3) 

The required bandwidth depends on two factors, the mes- 
sage data generated by each PE, say m bytes/sec, and the 
average number of channels that each message has to travel, 
say hops(p). | B,=m p hops(p) (4) 

| Given that the number of connections per PE is bounded, 
the average number of hops must increase with p. To ensure 
that the system always provides the required bandwidth, we 
must have B, >B,. However: 
B,>B, — Bpc/w > mp hops(p) — Be/(mw) > hops(p) 
The left hand side is constant, while the right hand side 
increases with p, and so there exists a p beyond which this 
requirement can not be satisfied. Oo 


The preceding theorem has to be understood in proper 
perspective. It does not preclude global communication alto- 
gether: The constant factors may require an unreasonably 
large number of PEs before the communication stagnation 
occurs. Communication strategies may bias the probability 
distribution of the PEs involved in a message towards shorter 
paths so that even though the length of the longest path 
increases with p, the average path—length does not. Decompo- 
sition strategies may increase the grain-size along with the 
number of PEs, thus reducing the overall communication need. 


The result applies to a variety of topologies including the 
shuffle-exchange, the cube—connected cycles (CCC), and the 
grid of PEs. For example, in CCC, each PE is connected to 3 
channels, and so the number of channels is (3/2)p. So if we 
have designed a CCC system with 384 PEs (2 .6) with well- 
matched processor speed and communication bandwidth, then 
the theorem guarantees that for the next possible CCC, which 
has 896 PEs, communication stagnation is bound to occur. 
The boolean hypercube escapes this fate because the number of 
channels in it increases as p log p while the hops(p) is at most 
log p. However, it achieves this by increasing the number of 
connections per PE: there are log p connections/PE. This is 
inconsistent with our scalability requirement. 


The result does not apply to systems with switching net- 
works because the channels within the switching network pro- 
vide the required bandwidth. For example, the omega network 
has p log p communication channels. In this paper we confine 
ourselves to topologies without additional switching elements. 


3 Restricted Communication Neighborhoods 


The theorem of last section points to a fundamental limi- 
tation of global communication. A system designer has two 
approaches for dealing with its consequences. 


1. Allow global communication and accept the concom- 
mitant limit on the maximum size of the system. This is the 
only possible approach when the application critically depends 
on global communication. That is a reason why Ullman [1] 
suggests this approach for fast sorting on multi—processors. 


2. In many application domains, global communication is 
not necessary. So, it is feasible to limit the communication 
neighborhood of each PE and still retain the scalability of the 
system. Such domains include parallel execution of functional 
and logic programs. Here, the sub—computations can be con- 
tracted out to any PE, and in particular to a PE within the 
neighborhood. Considerations such as _ uniform  load-— 
distribution dictate that the neighborhoods be as large as pos- 
sible... Finding the largest neighborhood while satisfying the 
bandwidth constraints is the topic of the rest of the paper. 

4 The point-to-point Topologies 

| Let us consider the point—to—point topologies to begin our 
analysis. For these, w=2, i.e. each channel connects only 2 
PEs. Let there be p processors, each generating m bytes/sec. 
Let d be the (average) distance each message is allowed to go. 
Then, the required bandwidth of the whole system, B, = pmd. 
By eqn. (3), the total available bandwidth is: B, = (pe /2)B, 
where B is the bandwidth of an individual channel. For max- 
imum communication distance d_,., we equate the required 
bandwidth with the available bandwidth, to _ get: 
pmd_,=(pe /2)B. So, d,,,=(¢B/2m). 


Notice that d_,, is the bound on the average communica- 
tion distance. It is possible to design communication strategies 
that allow different communication distances, including those 
larger than d_.,, with varying probability so that the average 
is d_.,- Here we consider a simpler, more conservative stra- 
tegy: restrict each message to travel at most d_,, hops. 


Let the number of PEs reachable in d hops be nbrhood (d) 
for a given topology 7. Then the maximum communication 
neighborhood for 7 is nbrhood(d_.,). As an example, consider 
the 2—-dimensional grid of PEs. Here, nbrhood (d)=2d +2d-+1. 
Let the bandwidth of each channel be 100K bytes/sec. and 
each PE generate a communication load of 10K bytes/sec. 
Each PE is connected to 4 channels (c=4), so d... = 
(4100) /(2X<10) 20 and the maximum communication 
neighborhood includes 2X20 +2X20+1 = 841 PEs. 


max 


4 
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Square Grid of Buses 
Figure 1 | 
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A linear extension of the Square Grid 


5 Bus Topologies 

A bus is a communication channel that connects 2 or 
more PEs. A bus topology is an interconnection scheme that 
uses a number of buses. A few bus—topologies have been 
reported in the literature [2-5]. The number of PEs on each 
bus is called the wrdth of the bus, and denoted by w. We 
assume that all the buses in any specific instance of a scheme 
have the same width. An interconnection scheme may organ- 
ize a given number of PEs using a variety of bus—widths. Thus 
w is available as a design parameter that can be chosen with 
some flexibility. This flexibility makes the problem of finding 
the optimal communication neighborhoods interesting. 


We use the notation defined in Section 4. In addition, let 
6 be the total number of buses in the system. Equation (2) 
can then be re—written as: pe = bw (5) 
Intuitively, the total number of ‘wires’ coming out of the PEs 
must equal the total number of connections available on the 
buses. As before: The required bandwidth, B,=pmd 
The available bandwidth, B, =bB=(pce /w)B (by eqn. 5). 
Equating B, and B,, pmd = (pe /w)B. 
Therefore: wd =cB/m=w, (6) 
For a given application domain, m is fixed, and for a given 
communication technology, B is fixed. c is constant for any 
interconnection scheme. So the equation says that the product 
wd must remain constant. We denote this constant by ws. 


Let nbrhood (w,d) be the number of PEs reachable from 
some PE in d hops in an instance of the topology r with the 
bus-width of w. Obviously, nbrhood(w,d) increases with 
either of w or d. Equation 6 tells us that we cannot hope to 
increase both simultaneously: their product must be kept con- 
stant (=w,). This sets up an interesting trade-off between the 
two. The optimization problem thus reduces to maximization 
of nbrhood{w,d) subject to the constraint that wd = wy. 
Alternatively, it can be expressed as the maximization of 
nbrhood (w,w,/w), or of nbrhood (wo/d,d). 


The approach for finding the optimal design parameters 
for a given topology is therefore to first find its nbrhood func- 
tion, and then to maximize the special instance of it expressed 
above. If we cannot find the a closed-form expression for 
nbrhood (w,d), we may calculate and plot nbrhood(w,/d,d) 
empirically, and look for the maxima. 

5.1 A simple topology 


We first consider a simple topology to illustrate the 
method. An extension of the square grid of buses ( Figure 1) 
leads to the topology of Figure 2. If w is the width of buses, 
then in 1 step we reach 2w—1 PEs. For d>1 it can be seen 
that nbrhood(w,d) = (d—1)w” + w’/8, assuming the source is 


of Buses 
Tree of Buses 


Figure 2 


Figure 3 
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A Lattice—mesh 


Figure 4 

at the center of a bus. The optimal w occurs at the maxima of 
nbrhood(w,w,/w) = (wy/w —1)w"-+w” /8. By simple analysis, 
the optimal w is 4w,/7. This corresponds to d=w,/w=7/4. 
Approximating to the nearest integer, we conclude that the 
best, way to arrange this topology is to let each bus span w,/2 
PEs and allow communication to a distance of 2. 

5.2 Tree of buses 


An instance of the tree—of—buses topology is shown in 
Figure 3. Notice that any node can be made to look like the 
root of two identical trees (ignoring the discontinuity intro- 
duced by the root and the leaves of the physical tree). So, the 
nbrhood. function can be written as: nbrhood(w,d) = 
2(w (49) 1) (w,-1) —1, where w, = w-—1. For pedagogical 


value, we approximate this to: 2w, —1. The optimization 
w 


. eo « 0 e 
problem then reduces to maximizing: w, The maxima 


occurs when w,+l = wj,.In(w,). This leads to w,=3.59, ie. 
w=4.59 So we conclude that the width of buses should be 5. 
The communication distance can then be calculated as w,/5. 
5.3 The lattice-mesh 


The lattice-mesh [4] can be thought of as a two- 
dimensional extension of the square grid of buses. The PEs are 
laid out on the lattice points in a rectangular X,,. XY, 
matrix, where X,,, and Y,,,, are both multiples of w. The 
layout of buses can be best explained by associating a label 
with PE. The label of a PE at (x,y) is (1 + (x+y) mod w). All 
the buses parallel to the X-axis start at a PE with label equal 
to one, and all the buses parallel to the Y-axis start at a PE 
with a label Y,, where Y, is any label other than 1. Figure 4. 
shows a lattice—-mesh of 100 PEs with a bus—width of 5. Buses 
extending beyond one end are connected consistently at the 
other end, forming a torroidal structure. 


In d hops, the maximum distance one can traverse along 
either axis is wd/2, because any path is a sequence of 
alternating X and Y axis movements. Moreover, it can be 
shown that the coordinates of the reachable PEs, assuming 
(0,0) to be the coordinates of the starting PE, are bounded by: 
IX!+1Y! < wd /2+w—2. (Intuitively, every pair of hops can 
traverse only a distance of w, except at the boundaries). 
Assuming d->>1, we ignore the smaller terms to get 
'X!+!¥!<wd/2. So the reachable PEs are confined within a 
diamond with distance from the center to a corner equal to 


pn 4e Ue am, a OD 
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wd /2=w,/2. Therefore, nbrhood(w,d) is bounded from above 
by 2(w,/2)° or wy /2; and so is independent of w and d! (Even 
if w and d vary, their product must remain constant by equa- 
tion 6). Le. beyond a certain point, increasing d does not lead 
to any increase in nbrhood(w,d). 


A general calculation of nbrhood(w,d) for lattice-mesh 
seems difficult. Instead, we derive the number of PEs reachable 
in a distance of 2. Figure 5-a shows the PEs reachable in 2 
hops from the darkened PE, in a lattice-mesh with w=7. Fig- 
ure 5—b depicts the reachable PEs schematically. The doubly 
shaded areas represent PEs that can be reached by two alter- 
nate paths. The reachable PEs are those within the shaded 
area. By elementary geometry,’ nbrhood(w,2) 
4(w")/2 — 3(1/2)(w /2)° = (13/8) w’. For two hops, w=w,/2 
(by Equation 6), and we get nbrhood(w,2)=(13/32)w, Le. in 
two hops alone, we can reach about 80% of the upperbound 
on the number of reachable PEs! 


We therefore did some computations to find the maxima 
of nbrhood(w,/d,d) for specific values of wy). The results are 
depicted in Figure 6. In all the instances examined, the max- 
imum occurs at d 3, and the difference between 
nbrhood(w,/2,2) and nbrhood(w,/3,3) was very small. 


Notice that when the communication distance is 
increased from 1 to 2, for w) = 24, the number of reachable 
PEs increased from 47 to 223, even though the width of the bus 
dropped from 24 to 12. Thus at a very little (two-fold) 
increase in the delay we get a large increase in the neighbor- 
hood, without increasing the total communication require- 
ments. Because of the low value of d, very few (1 if d=2) PEs 
incur the overhead of relaying the message. Therefore, the 
conclusion seems clear: limtt communication to a distance of 
two or three and choose the width of buses accordingly. 
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5.4 A Bus—topology better than the grid 


Is there a bus-topology that is ‘better’ than the grid—of_ 


processors, in the sense of having a larger neighborhood with 
similar hardware constraint (i.e. cost) and communication 
technology? To answer this question, we construct a bus 
topology that is a generalization of the grid, i.e. one of which 
the grid is an instance with w=2. Such a topology is shown in 
Figure 7. It has 4 connections per PE like the grid. It is called 
a double lattice-mesh (DLM) because it can be thought as two 
lattice-meshes overlapped on each other. A PE at (z,y) has 
the label 1+(2+y) mod w. Analogous to the lattice—mesh, the 
DLM is paramterized by 4 numbers X,1X,0 41) Y,. in addi- 
tion to w. There are 2 sets of buses in each row (and column). 
Each set is arranged just like in the lattice-mesh, except that 
buses in one set start from PEs with label X,1, while buses in 
the other set start from PEs with label X. sg” Lhe buses parallel 
to the Y-axis are similarly defined. It can be easily seen that 
with w=2, this topology reduces to the grid of PEs. 


Again, for the lack of a simple nbrhood function, we per- 
formed some computations to plot nbrhood(w,/d,d). Figure 8 
shows the plots for different values of wy. The rightmost 
point on each plot corresponds to the point-to-point grid. It 
can be verified from the plots that the bus topologies (w > 2) 
always do better than the grid. Moréover, d=3 is either the 
optimal distance or is very close to it. 


ee 


More accurate calculations for nbrhood(w,2) 


show that when w is odd, 
nbrhood( w ,2)=(13.w ~4.w—1)/8, and when w is even, 


nbrhood(w,d)=(13.w’—6.w)/8, 
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6 Conclusion 


We demonstrated unscalability of global communication 
for a major class of inter—connection schemes, namely those 
with no additional switching elements. For many application 
domains, this limitation can be overcome by limiting the com- 
munication horizon of each processor. For the point—to—point 
topologies this strategy led to a simple limit on the communi- 
cation radius. For the bus topologies, the problem of finding 
optimal communication neighborhoods is more interesting 
because of an additional degree of freedom for design: the 
width of the bus. We gave a general method of finding such 
neighborhoods, and applied it to several bus—topologies. Such 
calculations for a topology that generalizes the grid of proces- 
sors were used to show that bus topologies can perform better 
in this respect than the point-to-point topologies. 


We are investigating bus topologies in the context of a 
system for. parallel processing of combinatorially explosive 
symbolic. computations expressed as Logic Programs. In 
absense of global communication, the question of distribution 
of work becomes more important. Contribution of different 
topologies towards this needs to be investigated. Also, stra- 
tegies such as (1) probabilistic distribution of communication 
distance instead of the absolute bound as suggested in this 
paper and (2) dynamic variation of this distance in response to 
variations in communication load need to be investigated. 
Many simulation studies are planned to that end. 
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Abstract. The KAP/205 is a Fortran source-to-source vectorizer 
which translates serial Fortran DO loops into explicit vector syntax for 
the Cyber 205, a supercomputer with memory-to-memory vector 
instructions. The translated program may be compiled using the 
Fortran-200 compiler and executed on the Cyber 205. This paper 
explains the optimizations of the KAP/205 which are peculiar to the 
Cyber 205 architecture and gives some speedups attained relative to the 
Fortran 200 compiler alone. 


1. Introduction 


The Control Data Cyber 205 supercomputer [1] has memory-to- 
memory vector instructions. Because of the penalty for starting a vec- 
tor instruction and the requirement that vector operands be gathered 
when they are not contiguous, the Cyber 205 performs best when the 
program uses many long contiguous vector operands. The Cyber 205 
is programmed predominantly in Fortran, using the Fortran-200 com- 
piler [2]. Fortran-200, a Fortran-77 [3] based language, includes exten- 
sions such as explicit vector syntax and in-line machine code. These 
extensions allow users to get good performance from their programs, 
but require much recoding and make the programs entirely non- 
portable. Users of the Cyber 205 have long desired a better automatic 
vectorizer that would allow them to access the power of the Cyber 205 
without recoding their programs. The KAP/205 fills this need. 


The KAP/205 is a Fortran precompiler which accepts and re- 
generates Fortran-200 programs. The KAP/205 examines the DO loops 
in the source program and generates vector code where possible. The 
output of the KAP/205 is an equivalent Fortran-200 program with the 
vectorized DO loops replaced by explicit vector syntax and in-line 
machine code. The advantage of using the KAP/205 is that programs 
remain portable, and the users need not concern themselves with the 
peculiarities of the machine. 


The second section of the paper describes important features of 
the Cyber 205 with respect to optimization. The third section describes 
the KAP/205 and how it maps programs onto the Cyber 205. The 
fourth section describes the vector optimizer of the KAP/205, which 
chooses the best method to execute a loop nest on the Cyber 205. The 
final section presents some speedup results of various programs exe- 
cuted on the Cyber 205. 


Scalar 
Processor 


Pipelined arithmetic 
units, registers 


2. The Cyber 205 Supercomputer 


The Cyber 205 is a high-speed, pipelined, memory-to-memory 
vector supercomputer. The basic clock is 20 nanoseconds, and the 
Cyber 205 can issue one instruction per clock in the best case. The 
machine is divided into a scalar unit and a vector unit sharing a com- 
mon main memory, as shown in Figure 1. The floating point units are 
pipelined (even in the scalar unit), so floating point results can be gen- 
erated at a high rate. In addition to simple vector arithmetic operations 
such as add and multiply, the Cyber 205 has vector reduction instruc- 
tions, such as sum of a vector, product of a vector, and maximum and 
minimum of a vector. Vectors may have up to 65,535 elements. The 
vector unit can have one, two or four pipelines, which means that one, 
two or four 64-bit floating point results can be generated in each 20- 
nanosecond clock period during a vector instruction. Each vector pipe- 
line has a multiplier and an adder, so linked-triad operations like A(I) 
= B(I)+C(I)xD, with an add, a multiply, and two vector inputs, can be 
executed with a single pass through the vector pipeline. In this case, 
each vector pipeline executes two floating point operations per clock 
(add and multiply); with four pipelines, the Cyber 205 can execute 
eight floating point operations every 20-nanosecond clock for a peak 
rate of 400 million floating point operations per second (400 
megaflops). In addition, the Cyber 205 has HALF PRECISION opera- 
tions; for 32-bit operands, each vector pipeline splits in half and can 
execute two adds and two multiplies in each clock period. Thus the 
maximum rate of the Cyber 205 is 800 megaflops using 32 bit 
operands. 


Each vector instruction incurs a certain amount of startup time. 
Scalar instructions must be executed to set up the vector operand 
descriptors, but scalar instructions are very fast. Each vector instruc- 
tion itself needs many clocks (as many as 50 clocks, or more than a 
microsecond, for some instructions) before the vector unit actually 
starts producing results. Some of this time is related to the pipeline 
depth of the vector arithmetic units, but much of it is required to start 
the vector operands streaming from the main memory. 


2.1 Gathers and Scatters 


All the vector arithmetic instructions require that the vector 
operands be stored in contiguous memory. Non-contiguous operands 


1, 2 or 4 vector pipes 
each pipe has 
add and multiply 


Vector 
Processor 


Figure 1. Architecture of the Cyber 205 Supercomputer. 
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must be gathered into temporary vectors before use, and non- 
contiguous results must be scattered after generation. Periodic gathers 
and scatters are used when the stride is fixed, and indexed gathers and 
scatters are used for nonlinear indexing. For example, in Figure 2(a), 
the array C would be gathered before being added to the array B and 
stored in array A. Two vector instructions would be used to execute 
this loop: an indexed gather (using the index vector IP) and a vector 
add. In Figure 2(b), the first column of the matrix D would be gath- 
ered before being added. Again, two vector instructions would be 
used: a periodic gather (with a period of N, the size of a row of the 
matrix D), and a vector add. 


Gathers and scatters on the Cyber 205 are slow compared to 
arithmetic operations. Since the speed of a gather or scatter does not 
increase with the number of vector pipelines, a gather or scatter opera- 
tion is even slower relative to vector arithmetic on Cybers with multi- 
ple vector pipelines. 


REAL A(N),B(N),C(N) 
INTEGER IP (N) 


DO 100 I = 1,N 
100 A{I) = B(I) + C(IP(I)) 


(a) Indexed Gather. 


REAL A(N),B(N),D(N,N) 


bo 200 I = 1,N 
200 A(I) = B(I) + D(i,TI) 


(b) Periodic Gather. 


Figure 2. Gathering operands. 


2.2 Vector IF hardware 


A vector compare on the Cyber 205 generates a bit vector, with 
1’s where the test is true and 0’s elsewhere. There are several different 
methods to use bit vectors to execute conditional vector operations [4]. 
The choice of the best method depends on various run-time parameters, 
such as the vector length, the density of the vector (percent of 1’s in 
the bit vector), and the number of distinct vector operands in the code 
under control of the IF test. 


The vector arithmetic operations of the Cyber 205 have an 
optional control vector field, so that the result is stored only where the 
bit vector is 1 (or only where it is 0, for the ELSE part of an IF). 
Where the bit vector is 0, the result vector is left unchanged. Use of 
control vectors is a natural method to execute simple vector IF state- 
ments. 


When the density of the bit vector is small, other code sequences 
may be more efficient. An alternate method is to compress each 
operand vector into a temporary vector before performing any arith- 
metic, and decompress the result vectors afterwards. This shortens the 
vector length of the arithmetic operands. Obviously if the number of 
compresses and decompresses is larger than the number of arithmetic 
operations, then the control vector method will be faster, since 
compress and decompress operations run at the same speed as vector 
arithmetic. However, if the number of arithmetic operations is rela- 
tively large compared to the number of compress operations and the bit 
vector is sparse, then compressing the operands can pay off with short 
vectors throughout the IF block. 


A third method to handle IF statements is applicable when the bit 
vector is very sparse, and the number of compress operations is rela- 
tively small. In this method an index vector containing the index 
values of the true elements of the bit vector is built. The compress 
(and decompress) operations above are replaced with indexed gather 
(and scatter) operations using this index vector. Even though the 
indexed gather instruction generates fewer results per clock period, this 
method can be faster since the vector length of the indexed gather 
operation is Nxd, while the vector length of the compress operation is 
N (d is the density of the bit vector, i.e. the fraction of 1’s in the bit 
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vector; O<d<1, so Nxds<N). So, while the compress instruction may 
produce 2 results per clock (on a 2-pipe Cyber 205), thus taking N/2 
clocks (after startup), the gather instruction produces 0.8 results per 
clock (on average), thus taking 1.25xNxd clocks (after startup). If d is 
less than 0.4, this method has a chance of being faster. The additional 
cost of generating the index vector must be taken into account also. 


An example of a DO loop containing an IF statement is given in 
Figure 3(a). Using control vectors for this loop using control vectors 
gives code similar to that in Figure 3(b); the vector code is shown here 
using Fortran 200 primitives. The first vector instruction generates the 
bit vector into a compiler temporary vector, CV. The vector com- 
parison has a vector length of N. The next four vector instructions use 
the bit vector to control which elements of the result vector to store. 
Notice the linked-triad statement, performing a vector add and a vector 
multiply together. Each of these vector arithmetic instructions has a 
vector length of N. Assuming a 2-pipe Cyber 205, this vectorized loop 
would take 5xN/2 or 2.5xN clock periods, plus the time to start the 
four vector instructions, plus time for any scalar code. 

Figure 3(c) uses compress/decompress code. The first vector 
instruction is the same as before, generating a bit vector into a com- 
piler temporary vector. The next two vector instructions compress the 
vector operands into two more compiler temporary vectors. The vector 
compress instruction uses the bit vector to control the compress opera- 
tion, and simultaneously computes the vector length of the result, 
shown as ND in the program. Each of the four vector arithmetic 
instructions has a vector length of Nxd, where d is the density of the 
bit-vector. The last instruction decompresses the short result vector 
back into the original user array, again with a vector length of N. The 
three compress/decompress instructions would take 3xN/2 or 1.5xN 
clock periods (plus three startups). The four arithmetic operations 
would take 4xNxd/2 or 2xNxd clock periods (plus four startups). 


REAL A(N),B(N) 


DO 100 I = 1,N 
IF (A(I) .GT.0) THEN 


B(I) = (A(I) * A({I) + B(I) * B(I)) * 0.5 + X 
ENDIF 
100 CONTINUE 
(a) Original Program. 
vector vector 
length code 
N CV(1;N) = A(1;N) .GT. 0 
N WHERE (CV(1;N)) TA(1;N) = A(17N) * A(1;N) 
N WHERE (CV(1;7N)) TB(1sN) = B(17N) * B(1;N) 
N WHERE (CV(1;N)) B(17N) = (TA(1;N) + TB(1;N)) * 0.5 
N WHERE (CV(1;N)) B(17N) = B(1;7N) + X 


total time = 5*N/#pipes + 5 startups 


(b) Control Vectors. 

vector vector 

length code 
N CV(1;N) = A(1;N) .GT. 0 
N TA(1;ND) = Q8VCMPRS( A(1;N), CV(1;3N) ; TA(1;ND)) 
N TB(1;ND) = Q8VCMPRS( B(1sN), CV(1;:N) ; TB(1;ND)) 
N*d TA(1;ND) = TA(1;ND) * TA(1;ND) 
N*d TB(1;ND) = TB(1sND) * TB(1;ND) 
N*d TB(1;ND) = (TA(1;ND) + TB(1sND)) * 0.5 
N*d TB(1;ND) = TB(1;ND) + 10. 
N B(1;N) = Q8VDECMP( TB(1;ND), CV(1;N) ; B(1;:N)) 


total time = 4*N/#pipes + 4*N*d/#pipes + 8 startups 


(c) Compress/Decompress 
vector vector 
length code 
N CV(1;N) = A(1;N) .GT. 0 
N IV(1;N) = Q8VINTL(1,1;IV(1;N)) 


N IV(1;ND) = Q8VCMPRS( IV(1;N), CV(1s3N) ; IV(1;ND)) 
N*d TA(1;ND) = Q8VGATHR( A(1;ND), IV(1;ND) ; TA(1;ND)) 
N*d TB(1;ND) = Q8VGATHR( B(1;ND), IV(1;ND) ; TB(1;ND)) 
N*d TA(1;ND) = TA(1;ND) * TA(1;ND) 
N*d TB(1;ND) = TB(1;ND) * TB(1;ND) 

_ N*d TB(1;ND) = (TA(1;ND) + TB(1;ND)) * 0.5 

 N*d TB(1;ND) = TB(1;ND) + 10. 
N*¥d B(1;ND) = Q8VSCATR( TB(1;ND), IV(1;ND) ; B(1;ND)) 


total time = 
3*N/#pipes + 3*N*d/0.8 + 4*N*d/#pipes + 10 


(d) Indexed Gather/Scatter. 


startups 


Figure 3. Vector IFs. 


Adding in the time to generate the bit vector, this comes out to 
(2xd+2)xN clock periods + 8 startups. If d is less than 0.25, the 
compress/decompress method may be faster than the control vector 
method for this loop; of course, the extra startups must be factored in. 


Figure 3(d) uses indexed gather/scatter code to execute the vector 
IF. Again, the first vector instruction generates the bit vector. The 
second vector instruction initializes an index vector with the index set 
(1,2,3,....N). The third vector instruction compresses this index vector 
into itself; this creates a short index vector which contains the indices 
of the true elements of the bit vector. Each of these instructions has a 
vector length of N. The next two vector instructions gather the vector 
operands into compiler temporary vectors, analogous to the compress 
instructions in 3(c), with the difference that the vector length is Nxd, 
not N as in the compress instruction. The four vector arithmetic 
instructions have vector lengths of Nxd, and the last vector instruction 
scatters the short result vector back into the user array, also with a vec- 
tor length of Nxd. The first three instructions would take 3xN/2 or 
1.5xN clock periods (plus three startups). The three gather/scatter 
instructions would take 3xNxd/0.8 clock periods, plus three startups 
(remember gathers operate at a rate independent of the number of 
pipes). The four arithmetic operations would take 2xNxd clock 
periods, plus four startups, just as in 3(c). This totals to 
(1.5+5.75xd)xN clock periods, plus ten startups. Thus, if d is less than 
0.17, this has a chance of being faster than the control vector method 
(again, the startups must be taken into account). If d is less than 0.13, 
this has a chance of being faster than the compress method. 


2.3 Fortran 200 Language 


Programming the Cyber 205 is done predominantly in Fortran 
200, a Fortran-77 based language with extensions to allow users to 
access the vector instruction set of the Cyber 205. Some of the vector 
extensions are shown in the example programs in Figure 3. The semi- 
colon notation (A(1;N)) describes a starting point (A(1)) and a vector 
length (N), which is how the Cyber 20S hardware describes vector 
operands. Certain vector intrinsics (Q8VCMPRS, Q8VGATHR) use 
Cyber 205 vector instructions. In addition there are “‘special calls’’ to 
predefined subroutine names which map directly onto most Cyber 205 
machine instructions. 


Fortran 200 implements ROWWISE arrays, which are stored in 
row-major order (like Pascal) instead of column-major. In Figure 2(b), 
if the array D were declared ROWWISE, then no gather would be 
necessary in the vector code. 


The Fortran 200 compiler includes a vectorizer which, compared 
to other automatic vectorizers, is relatively weak [5]. Because of this, 
users of the Cyber 205 learn and use the Fortran 200 vector extensions 
when they require better performance from their programs. The exten- 
sions are hard to use, and make programs non-portable and hard to 
maintain. 


3. The KAP/205 Vectorizer 


The KAP/205 was designed to alleviate the difficulty of getting 
good performance on the Cyber 205. The KAP/205 accepts Fortran 
200 programs and vectorizes the DO loops, producing an equivalent 
program with some or all of the DO loops replaced by vector code. 
The KAP/205 uses Fortran 200 explicit vector notation, vector intrinsic 
functions and special Q8 calls (in-line assembler code) for vectorized 
DO loops. Messages and questions from the KAP/205 translator point 
out the parts of the program which might be further improved with 
additional help from the user. 


The KAP/205 is one of a family of KAP products. The KAP is 
designed to be a rehostable and retargetable Fortran translation system 
for discovery of parallelism. Our first products are all source-to-source 
vectorizers. 


The KAP uses the latest software technology for discovery of 
parallelism in programs. Much of our work is similar to that done at 
the Parafrase project at the University of Illinois [6,7,8]. Related tech- 
niques are also used in the PFC project at Rice University [9,10]. In 
Parafrase, as well as in the KAP, parallelism is discovered through the 
use of a precise data dependence graph which shows where data values 
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are generated and where they are used within a DO loop or DO loop 
nest. The data dependence graph is processed (with simple graph 
traversal techniques) to find potential problem areas, which appear as 
cycles in the graph. Each data dependence cycle is carefully examined 
to see if it can be broken and the loop executed in parallel, or if part or 
all of the loop must be executed serially. 


The KAP includes powerful algorithms for building a high qual- 
ity data dependence graph. Exact data dependence tests for the most 
common types of array references are implemented, as well as sophisti- 
cated symbolic tests when variables appear as loop bounds or in sub- 
scripts. Several auxiliary transformations improve the data dependence 
graph, such as recognition of loop induction variables and promotion of 
scalars into arrays. 


3.1 Retargeting KAP to the Cyber 205 


A general purpose vectorizer does not solve the problem facing 
the users of the Cyber 205. Not all vector operations are equally 
efficient on this machine, as explained in Section 2. Some operations 
to be avoided are gathers and scatters of arrays. Short vector opera- 
tions can sometimes be slower than performing the same code in scalar 
mode, due to the large startup time for vector instructions. (Interest- 
ingly, a vector divide with a vector length of 2 is faster than two scalar 
divides, because the vector divide unit is pipelined while the scalar 
divide unit is not; this is not true of other arithmetic operations.) Also, 
temporary arrays and vectors must be managed efficiently. Finally, 
many operations cannot be performed in vector mode on the Cyber, 
such as DOUBLE PRECISION arithmetic. 


Retargeting the KAP to the Cyber 205 is made easier due to the 
modular design of the KAP. One of the first steps of the KAP is the 
candidate finder. This step examines DO loops to see that the opera- 
tions within the DO loop are legal candidates for vectorization. For 
the KAP/205, statements with operations that cannot be executed in 
vector mode, such as DOUBLE PRECISION operations, are marked as 
non-vectorizable. READ and WRITE statements are similarly marked. 
DO loops consisting of all non-vectorizable statements are marked as 
completely serial, and are not studied further. 


Much of the KAP, such as the induction variable finder, is 
machine independent, so retargeting these steps was no work at all. 
The two largest tasks we faced were customizing the vectorizer to pro- 
duce and optimize vector code for the Cyber 205, and representing 
these vector constructs in efficient Fortran 200. 


3.2 Retargeting the Vectorizer to the Cyber 205 


The vectorizer of the KAP looks at different ways to execute 
each DO loop nest and chooses the best. For instance, the standard 
matrix multiply program shown in Figure 4(a) can be vectorized at 
least three different ways. In Figure 4(b), the inner DO loop was vec- 
torized, as might be attempted by a simple-minded vectorizer; this 
results in a inner product of a column of A with a row of B. For the 
Cyber 205, however, a periodic gather of the column of A is necessary. 
In Figure 4(c), the second DO loop was interchanged to be the inner 
loop and was vectorized; this produces a vector operation in the inner 
loop. On the Cyber 20S, this vector statement is actually a linked-triad 
(an add and a multiply, with one scalar and two vector operands), and 
would execute in one pass through the vector pipe. However, now two 
periodic gathers are required and a scatter of the result is also needed. 
In Figure 4(d), the outermost DO loop was interchanged and was vec- 
torized; this produces a simple linked-triad vector operation in the 
inner loop with all stride-one vector operands. 


The vectorizer step of the KAP actually attempts these different 
DO loop orderings. If DO loops are not perfectly nested, DO loop dis- 
tribution may be used in order to allow more loop interchanging. A 
Cyber 205-specific vector optimizer decides which of the loop order- 
ings should be chosen. Given the loop nest in Figure 4(a), the 
KAP/205 chooses the ordering in 4(d). 


The KAP/205 also attempts to collapse loops. The maximum 
hardware vector length on the Cyber 205 is 65,535; nested loops often 
have a combined limit of less than this, and can sometimes be executed 
as a single long vector. Loop collapsing combines two or more loops. 
Other simple and well-known transformations are statement reordering 


to allow vectorization, and cycle breaking to eliminate simple data 
dependence cycles. Figure S(a) shows a simple DO loop that can be 
vectorized only if the two statements are reordered; Figure 5(b) shows 
the equivalent vector Fortran 200 code. Figure 6(a) shows a DO loop 
with a cycle in its data dependence graph. Semantic analysis shows 
that this loop simply broadcasts an invariant value to the whole vector 
A, and thus can be easily vectorized. 


REAL C(L,M), A(M,N), B(L,N) 


DO 100 I = 1,L 
DO 100 J =1,M 
DO 100 K = 1,N 
100 C({I,J) = C({I,J) + A{I,K)*B(K,d) 


(a) Standard Matrix Multiply Program. 


po 100 I = 1,L 

DO 100 J = 1,M 
TA(1;N) = Q8VGATHP(A(I,1;N), M, N ; TA(1;N)) 
100 C(I,J) = C(I,J) + Q8SDOT( TA(1;N), B(1,J;N)) 


(b)Original inner loop (K loop) vectorized. 


DO 100 I = 1,L 

DO 100 K = 1,N 
TC(17M) = Q8VGATHP (C(I,1;M), L, M;TC(1;M)) 
TB(1;M) = Q8VGATHP(B(K,1;M), L, M;TB(1;M)) 
TC(1;M) = TC(1;M) + A(I,K) *TB(1;M) 


100 C(I,1;M) = Q8VSCATP (TC(1;M), L, M;C(I,1;7M)) 


(c) J loop vectorized. 


pO 100 J = 1,M 
DO 100 K = 1,N 
100 C(1,J;L) = C(1,J;L) + A(1,K;L) *B(K,J) 


(d) I loop vectorized. 


Figure 4. Loop Interchanging. 


REAL A(N),B(N),C(N) 
DO 100 I = 2,N-1 
A(I) = A(I) + B(I-1) 
100 B(I) = B(I) + 1. 
(a) Unvectorizable statement order. 


B(2;N-2) 


= B(2;N-2) + 1. 
A(2;N-2) = 


A(2;N-2) + B(1;N-2) 


(b) Equivalent vectorizable statement order. 


Figure 5. Statement Reordering. 


REAL A(N) 


DO 100 I = 
100 A(I) = 


(a) 
A(1;N) = 


(b) 


1,N 
A(5) 


A(5) 


Figure 6. 


3.3 Producing Efficient Fortran 200 


The hardest task of retargeting the KAP to the Cyber 205 was 
the production of efficient Fortran 200 code. Since the vector con- 
structs of Fortran 200 are so close to the hardware of the machine 
(even going so far as to define a vector DESCRIPTOR which maps 
onto the hardware vector operand descriptor), this code generation was 
very similar to code generation in a compiler. Our job was easier 
because we did not have to produce our own run-time system and did 
not have to compile the whole Fortran language. 


However, translating the vector constructs into Fortran 200 res- 
tricted the operations we could perform. For instance, Fortran 200 
does not allow LOGICAL vectors. In order to vectorize IF statements 
that test LOGICAL array elements, such as the one in Figure 7(a), an 
artificial method is needed in order to point a descriptor at the proper 
LOGICAL array. The KAP/205 generates a call to subroutine 
KQQASS (Figure 7(b)) which returns a descriptor that is used in the 
controlling expression. The code for KQQASS is a simple descriptor 
assignment, with the parameter types declared as INTEGER, since the 
type parameters are ignored by Fortran. 


Another problem was that while most of the Fortran intrinsic 
functions had a vector form, many of them did not allow for condi- 
tional evaluation. To use AMOD under control of a conditional 
expression, for instance, the arguments need to be compressed and the 
result decompressed. 


The only dynamic memory feature in Fortran 200 is dynamic 
allocation of descriptors. For this reason, the KAP/205 sometimes 
needs to generate index arithmetic to modify the pointer and length 
fields of the temporary descriptors. This makes the output of the 
KAP/205 for vectorized loops look very much like machine code. 


LOGICAL L(N) 
REAL A(N),B(I) 


DO 100 I = 1,N 
IF (L(N))A(I) = 
100 CONTINUE 


B(T) 


(a) Original program. 
DESCRIPTOR LD 
INTEGER LD 
CALL KQQASS( LD, L(1), N ) 
WHERE (LD .EQ. 1) A(1s;N) = B(1;N) 
SUBROUTINE KQQASS( D, A, N ) 


DESCRIPTOR D 
INTEGER D, A(*), 
ASSIGN D,A(1;N) 
END 


(b) Vectorized version. 


N 


Figure 7. Logical Vectors. 


4. Vector Optimization in the KAP/205 
As mentioned in section 3.2, the vectorizer of the KAP/205 actu- 


ally discovers many different possible methods to execute loop nests. 
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The “‘best’’ loop ordering is the one that will execute in the shortest 
time in all cases. Clearly this can be data dependent; take for example 
the matrix multiply case shown in Figure 4. To a casual observer, the 
vector code in Figure 4(d) is clearly better than the vector code in 
either of Figures 4(b) or 4(c); 4(d) has no gathers or scatters, and the 
vector operation is a linked-triad. It seems that this could hardly be 
improved upon. This assumes that the loop bounds are relatively close, 
which is certainly true for square matrices. However, cases do arise 
where the vector length of 4(d), L, is much smaller than either of the 
other two loop bounds. We can make a rough estimate of the amount 
of time to execute each vector code segment. For Figure 4(b), one 
gather operation (taking one startup time plus 1.25 clocks per item 
gathered) and one inner product (taking one startup plus N/2 clocks, 
assuming a two-pipe Cyber 205) are required, resulting in two startups 
plus 1.75XN clocks to execute the vector code. This must be multi- 
plied by the outer loop bounds to get the real time. Note that this is a 
very rough estimate; in fact startup times are not equal for all instruc- 
tions, and the serial loop overhead should be factored in. 


A similar study of Figure 4(c) counts two gather and one scatter 
operation (taking three startups plus 3x1.25 clocks per item, total) and 
one linked-triad (which takes three startups, one for the LINK, one for 
the MPY and another for the ADD, plus M/2 clocks). Figure 4(d) 
requires just the linked-triad code (three startups plus L/2 clocks). 
These numbers are summarized in Figure 8. 


If L, M and N are all the same order of magnitude, then the code 
in Figure 4(d) will be fastest. When L is relatively small, the time for 


4(d) can be dominated by the startup time, and 4(b) can look more 
attractive. If both L and N are small, but M is large, then even 4(c) 
can turn out to produce the best performance. When all the loop 
bounds are small, scalar execution of the original loop may be the best 
method. 


Arnold [11] showed that for the Cyber 205, performance of vec- 
tor Fortran 200 code could be estimated with reasonable accuracy, but 
the performance of scalar code is much less predictable. The presence 
of IF statements in the loop makes the prediction process even less pre- 
cise, but is adequate for a first approximation. 


Obviously this analysis can only be applied when the loop 
bounds are known. In many cases, the loop bounds are parameters to a 
subroutine or are read in. Also, the probability that each conditional 
branch is taken must be known to determine the scalar speed and to 
choose the best method to vectorize that conditional from among the 
three methods shown above (qv Section 2.2). When these values are 
not known, some other process must be used to decide what loop ord- 
ering to generate. A simpler process is to count important characteris- 
tics, such as number of vector statements, number of gathers and 
scatters needed, and so on. 


A third method to approach this problem is to symbolically esti- 
mate the execution time of each loop ordering, as we did in Figure 8. 
IF statements would be inserted to compare the symbolic time esti- 
mates using the actual run-time loop bounds in order to choose which 
loop ordering to execute. This requires generating code for all possible 
loop orderings. 


The actual method used in the KAP/205 is a simple time estima- 
tor based on Arnold’s results. The best one loop ordering is chosen 
based on the time estimates, and the code is generated from that loop 
ordering. . 


A(b) 2XLxXM LXMxNx1.75 
A(c) 6xLXN LXMxNx4.25 
3XMxN LXMxNx0.50 


Figure 8. 


4.1 User Interface 


The problems with all these are the time required to do the 
transformations, to estimate the time, and so on, and the necessity to 
tell the user somehow what happened to his program. A simple vector- 
izer, that looks only at innermost loops, is much less powerful than the 
KAP, but at least is easy to understand. If a loop did not vectorize, it 
is easy to look at the loop and see why it did not vectorize. The 
KAP/205 has the additional problem of somehow telling the user that 
any of these three loops could be vectorized, but the DOI (or what- 
ever) loop was vectorized because it was thought best; or, in some 
cases, a loop was vectorizable, but it was left scalar because it will be 
faster that way. This occasionally produces unexpected results (to the 
user). The user interface from the vector optimizer will have to be 
tuned as the KAP/205 receives more use. 


5. Speedup Results Using the KAP/205 


In this section, we will look at several speedup results using the 
KAP/205. 


Figure 9(a) shows execution time for three different versions of a 
matrix multiply program for a range of matrix sizes. The line marked 
“*FTN200”’ shows the execution time for the program shown in Figure 
4(a) without the use of the KAP/205. The Fortran 200 compiler does 
not vectorize the inner loop, giving only scalar performance. The line 
marked ‘‘simple vectorizer’’ shows the execution time for the program 
in Figure 4(b); this is the code that a simple vectorizer which only 
looks at innermost loops might produce; this code is much better than 
scalar execution. The line marked “‘KAP/205’’ shows the execution 
time for the program in Figure 4(d), which is what the KAP/205 pro- 
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duces for this program; this shows how important loop interchanging 
and vector optimization are for vector machines. Figure 9(b) shows the 
megaflops obtained by each of these versions of the program. Figure 
9(c) shows the speedup obtained by the KAP/205 over the Fortran 200 
compiler and over the hypothetical simpler vectorizer. 


Figure 10(a) shows execution times obtained by executing an 
EISPACK benchmark program with and without using the KAP/205, 
for a range of problem sizes. Figure 10(b) shows the speedup obtained 
by the KAP/205 over the Fortran 200 compiler alone. No hand code 
modification to the EISPACK subroutines was necessary to obtain these 
results. 


Not all programs get the speedup exhibited by these two exam- 
ples. However, over a wide range of programs, the KAP/205 obtains 
an average speedup of 1.7-2.0 over the Fortran 200 compiler alone. 
Some programs will not speed up at all, while others will get reason- 
able performance and still others will get fantastic speedup. 
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Abstract. The KAP/S-1 is a Fortran source-to-source vectorizer _ 


which translates serial Fortran DO loops into explicit vector syntax 
optimized for execution on the S-1 uniprocessor. This paper explains 
the optimizations of the KAP/S-1 which are peculiar to the S-1 archi- 
tecture, including transformations to improve performance of the data 
cache. 


1. Introduction 


The S-1 Mark Ila supercomputer [1] is a shared memory mul- 
tiprocessor. Each processor has a memory-to-memory vector instruc- 
tion set. In order to relieve the load on the shared memory, each pro- 
cessor has a private cache memory. This computer is designed to exe- 
cute existing and future programs used at the Lawrence Livermore 
National Laboratory. Most of these programs are now written in 
LRLTRAN, an extended Fortran language [2]. 


The KAP/S-1 is a Fortran precompiler which optimizes programs 
for the S-1 uniprocessor. The KAP/S-1 accepts Fortran-77 [3] with 
selected LRLTRAN extensions, and produces a modified program using 
vector syntax for those statements for which the S-1 compiler should 
generate vector instructions. Several optimizations are also done to 
improve the performance of the cache memory on the S-1, both for 
vectorizable and non-vectorizable loops. 


The next section of this paper describes the S-1 supercomputer in 
more detail, concentrating on the features that affect vectorization and 
optimization in the KAP/S-1. Following that we describe the KAP/S-1 
and how it optimizes programs. The fourth section focuses on the 
cache memory of the S-1 and how the KAP/S-1 attempts to improve 
the cache hit ratio. 


2. The S-1 Mark Ila Supercomputer 


The S-1 Mark IIa computer includes up to sixteen processors, 
which communicate through shared memory. Each processor has a 
private data cache memory and a vector instruction set. In this paper 
we deal only with optimization for a only with single processor. 


The cache memory is 16K 36-bit words organized into 4-way 
set-associative, 16-word cache lines. Because cache loads are done 16 
words at a time, random memory reference patterns will load down the 
cache-to-main memory link much more on the S-1 than on other 
machines. This is because each cache miss will result in 16 words 
being loaded from the main memory, even if only one of these words 
will be used. However, if memory references are stride-one (meaning 
contiguous memory addresses are referenced), then the cache memory 
will work well, since the first cache miss will load the missing word 
and 15 more words that will be referenced soon. This fits in well with 
the vector instruction set. 


The vector instruction set of the S-1 is relatively limited. Vector 
adds, multiplies, and other simple operations, as well as MAX, MIN, 
sum and product reductions are all available. The operands of all of 
these operations must be stride-one, however. The S-1 has a matrix 
TRANSPOSE instruction which can be used to transpose arrays which 


are referenced in the wrong order, or to gather a column from an array 
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into contiguous locations. None of the vector instructions include any 
method to handle conditionals. 


3. The KAP/S-1 Vectorizer 


The KAP/S-1 vectorizes and optimizes programs for execution on 
the S-1 uniprocessor. The KAP/S-1 accepts Fortran-77 programs aug- 
mented by selected LRLTRAN extensions and produces an equivalent 
program with some or all of the DO loops replaced by vector code. 
Vectorized DO loops are represented with vector syntax similar to the 
syntax proposed by the ANSI X3J3 committee, known as Fortran 8x 
[4]. The S-1 Fortran compiler accepts this syntax and generates vector 
code for the S-1. The KAP/S1 vectorizer is related to the KAP/205 
and KAP/ST-100 [5,6], and uses many of the same vectorization 
enhancement techniques. 


The vectorization was significantly tuned for the S-1. Because 
all vector operations on the S-1 require stride-one operands, stride-one 
array references are preferred. Non-stride-one vector operations are 
handled by the S-1 compiler, which inserts TRANSPOSE instructions 
to gather and scatter non-stride-one operands. DO loops can often be 
interchanged to obtain stride-one array references. For example, the 
DO J loop in Figure 1(a) could be vectorized, but the resulting vector 
operations would not have stride-one operands. However, by inter- 
changing the loops before vectorization, the KAP/S-1 will generate the 
code in Figure 1(b), which has all stride-one array references. 


REAL A(100,100),B(100,100),C(100,100) 


DO 100 I = 1,N 
DO 100 J = 1,N 
100 A(I,J) = A(I,J) + B(I,J) 


(a) Non-stride-one. 


REAL A(100,100),B(100,100),C(100,100) 


DO 100 J = 
100 A(1i:N,J) = 


(b) Interchanged for stride-one. 


1,N 
A(1:N,J) + B(1:N,J) 


Figure 1. 


Vectorization of the stride-one loop is not always possible; the 
stride-one loop in Figure 2(a), the DO I loop, cannot be vectorized. In 
order to obtain vector code, the arrays must be left non-stride-one, as 
in Figure 2(b). The KAP/S-1 tries different loop orderings and chooses 
the ‘“‘best’’ one, according to its criteria. These criteria include prefer- 
ring vector code over serial code, and stride-one array operands over 
non-stride-one array operands. The KAP/S-1 also optimizes the perfor- 
mance of the cache memory, as discussed in the next section. 

Interestingly, much of the initial work of customizing the KAP 
for the S-1 involved restricting the scope of vectorization to that which 
the S-1 hardware and the S-1 Fortran compiler would handle. For 


example, the KAP/S-1 precludes vector IFs, since the S-1 has no vector 
IF hardware. Also, the number of reduction and recurrence operations 
that the KAP/S-1 would recognize was limited to those for which the 
S-1 had instructions. Since the vector extensions used only the triplet 
(colon) notation for vector assignments, diagonal references (such as 
A(,I)) could not be vectorized. 


REAL A(100,100),B(100,100),C(100, 100) 
DO 100 J = 1,N 

pO 100 I = 1,N 

A(I+1,J) = A(I,J) + B(I,J) 


(a) Non-stride-one. 


100 


REAL A(100,100),B(100,100),C (100,100) 


DO 100 I = 1,N 


100 A(I+1,1:N) = A(I,1:N) + B(I,1:N) 


(b) Vectorized non-stride-one. 


Figure 2. 


4, Optimizing the Cache Memory 


Good performance of the cache memory Is critical for good per- 
formance of the S-1 supercomputer. As was already mentioned, the 
KAP/S-1 interchanges loops to get stride-one array operands for vector 
code. For scalar code, the KAP/S1 interchanges loops to get stride-one 
array references to improve cache performance. Two other transforma- 
tions were implemented to improve the cache memory performance. 
These were adapted from work by Abu-Sufah and others [7,8,9] which 
applies program transformations to improve virtual memory perfor- 
mance. The applications of these transformations to cache memories is 
Clear. 


4.1 Name Partitioning 


The first transformation is name partitioning. Name partitioning 
splits a DO loop into several loops, each of which refers to a non- 
intersecting subset of the arrays in the loop. The increases locality of 
reference by reducing the number of arrays to which each DO loop 
refers. In Figure 3(a), the statements in the DO loop can be split into 
two distinct groups: the first and third assignments refer to the set of 
arrays {A,B}, and the second and fourth assignments refer to the set of 
arrays {C,D}. Figure 3(b) shows how name partitioning would distri- 


DO 100 I = 1,N 

A(T) A(I) + B(I) + 
C(T) C(I) + D(I) + 
B(I) = B(I) / 2. 
D(T) D(I) / 2. 


a | 


tou a 


100 
(a) Original loop. 


(b) Name partitioned. 


A(1:N) 
C(1:N) 
B(1:N) 
D(1:N) 


A(1:N) + B(1:N) 
C(1:N) + D(1:N) 
B(1:N) / 2. 
D(1:N) / 2. 


++ 
WH 


(c) Original loop vectorized. 


A(1:N) = A(1:N) + B(1:N) + T 
B(1:N) = B(1:N) / 2. 
C(1:N) = C(1i:N) + D(1:N) + 7 
D(i:N) = D(1:N) / 2. 


(d) Name partitioned and vectorized. 


Figure 3. 
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bute the DO loop into two loops, each of which refers to a subset of 
the arrays. The original loop refers to four different arrays during each 
iteration. Each of the distributed loops refers to only two different 
arrays, thus improving locality of data reference. 


If the original DO loop were vectorized, as in Figure 3(c), the 
first vector statement will load arrays A and B into the cache memory, 
while the second vector statement will load arrays C and D. If the 
vector length, N, is very large, the cache will fill up when C and D are 
loaded, thus making A and B non-resident. This means that B will 
have to be reloaded into the cache when the third vector statement is 
executed. When the distributed loops are vectorized, as shown in Fig- 
ure 3(d), the second vector statement refers to the array B, which will 
already be cache-resident. 


4.2 Strip Mining 


Strip mining [10] is a more important transformation for the S-1. 
Strip mining is usually thought to be important for vector machines 
with a small maximum vector length, such as vector register machines 
(like the Cray [11]). Proper strip mining, however, can also improve 
virtual memory (and cache memory) performance. 


The DO loop in Figure 4(a) may be easily vectorized, producing 
the code in Figure 4(b). However, if N is very large, greater than 
8,000, then the first vector statement will overflow the cache memory, 
and will have to start reusing cache memory positions for both the A 
and B arrays. When the first statement is done, the cache memory 
contains A(N-8000:N) and B(N-8000:N). When the second statement 
is started, the first elements accessed are A(1) and B(1); however, these 
locations were flushed back to the main memory by the first statement, 
and thus need to be reloaded. To make matters worse, during the exe- 
cution of the second statement, each new element of A and B that is 
loaded will replace some later element of A and B in the cache, which 
will then need to be reloaded later in the vector statement, until all of 
the arrays A and B have been from main memory to cache and back 
twice. For loops which use more arrays, this effect is seen at shorter 
vector lengths. 


The solution is to strip mine the DO loop before vectorizing it, 
as shown in Figure 4(c). The DO IBLOCK loop steps between strips 
of the inner loop, and is left serial by the vectorization process. The 
inner loop can be vectorized, as shown in Figure 4(d). The BLOCK- 
SIZE is chosen so that the cache memory never overflows; a rough 
estimate is the size of the cache divided by the number of distinct 
arrays referenced in the loop (assuming stride-one array references); for 
this loop, BLOCKSIZE=8000 would probably be appropriate. The 
memory reference patterns of the inner loop are very different from the 
original loop. Since the vector length is never so large as to overflow 


DO 100 I = 1,N 
A(I) = A(I) + B(I) +T 
100 B(I) = B(I) / A(I) 


(a) Original loop. 


A(1:N) = 
B(1:N) = 


A(1:N) + B(1:N) + T 
B(1:N) / A(1:N) 


(b) Original loop vectorized. 


DO 100 IBLOCK = 1,N,BLOCKSIZE 
ISTART = IBLOCK 
IEND = MAX(N, IBLOCK + BLOCKSIZE - 
DO 100 I = ISTART, IEND 
A(I) = A(I) + B(I) +T 
B(I) = B(I) / A(T) 


(c) Strip mined. 


1) 


100 


DO 100 IBLOCK = 1,N,BLOCKSIZE 
ISTART = IBLOCK 
IEND = MAX(N, IBLOCK + BLOCKSIZE 
A(ISTART:IEND) = A(ISTART:IEND) 
B(ISTART: IEND) 
= B(ISTART: IEND) 
A(ISTART: IEND) 


1) 


“Nw++i! 
ba 


X 

100 B(ISTART:IEND) 
x 

(d) Strip mined and vectorized. 


Figure 4. 


the cache memory, the second statement will always find its operands 
in the cache memory. When the second strip starts, it overwrites the 
operands from the first strip in the cache memory; since they are not 
needed anymore (for this loop), they can be flushed with no cost later 
in the execution. The total amount of traffic between the cache 
memory and the main memory is reduced by a factor of two for long 
vectors, with a small overhead cost associated with executing the strip 
mining code. Notice that this type of strip mining only affects memory 
traffic when there is more than one reference to some array, since the 
memory traffic that is reduced is the second and subsequent cache 
loads. If there is only one reference to each array in the loop, then this 
type of strip mining will not affect the memory traffic at all. 


More dramatic performance improvements can appear in DO 
loop nests. Consider the doubly-nested DO loop in Figure 5(a). The 
DO J loop can be vectorized, resulting in the code in Figure 5(b). This 
code can have problems similar to those which occurred in the loop in 
Figure 4. If the loop bound, N, is very large, then the first iteration of 
the serial DO I loop will overflow the cache memory with elements of 
A and B. On the second iteration of the DO I loop, the initial part of 
the A array will have to be reloaded from the main memory. In 
essence, the A array will have to be loaded from the main memory N 
times, once for each trip around the serial loop. Simple strip mining, 
shown in Figure S(c), will not alleviate this problem, since the refer- 
ence pattern for the A array will not change. However, once the DO J 
loop is strip mined, the outer DO JBLOCK loop can be independently 
interchanged with the DO I loop, giving the code in Figure 5(d). Vec- 
torizing the inner loop, as shown in Figure S(e), will produce a 
different memory reference pattern. In Figure 5(e), BLOCKSIZE is 


chosen so that each vector statement will not overflow the cache. That . 


way, a strip of the A array can remain cache resident for the entire 
execution of the serial DOI loop. Each strip of A will be used N 
times, and then can be flushed back to the main memory since it is not 
used again in this loop. The memory traffic due to the A array is thus 
reduced by a factor of N. Loop interchanging of strip mined loops as 
shown here has clear applications for vector register machines also 
[12}. 


DO 100 I = 1,N 
DO 100 J = 1,N 


100 A(J) = A(J) + B(J,T) 


(a) Original loop. 
DO 100 I = 
A(1:N) = 

(b) Original loop vectorized. 


1,N 
100 A(1:N) + B(1:N,T) 


DO 100 I = 1,N 
DO 100 JBLOCK = 1,N,BLOCKSIZE 
JSTART = JBLOCK 
JEND = MAX(N, JBLOCK + BLOCKSIZE - 1) 
DO 100 J = JSTART, JEND 
A(J) = A({J) + B(J,T) 


(c) Strip mined. 


100 


DO 100 JBLOCK = 1,N,BLOCKSIZE 
JSTART = JBLOCK 
JEND = MAX(N, JBLOCK + BLOCKSIZE - 1) 
DO 100 I = 1,N 
DO 100 J = JSTART, JEND 
100 A(J) = A(J) + B(J,T) 


(d) Strip mined and interchanged. 


DO 100 JBLOCK = 1,N,BLOCKSIZE 
JSTART = JBLOCK 
JEND = MAX(N,JBLOCK + BLOCKSIZE - 1) 
pO 100 I = 1,N 
A(JSTART:JEND) = A(JSTART:JEND) + 
Xx B(JSTART: JEND, I) 


(e) Strip mined, interchanged, and vectorized. 


100 


Figure 5. 
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4.3 Vectorization and Cache Performance 


One unfortunate side effect of vectorization is a decrease in 
locality of reference. This happens because a loop which formerly 
executed several statements for a particular index value will, after vec- 
torization, execute each statement for all the index values before going 
to the next. Thus, if more than one statement in a loop referenced the 
same array element, as in Figure 3, in the scalar version that array ele- 
ment would remain in the cache through all of the references. After 
vectorization, the cache will be filled with all of the operands of the 
first statement before executing the second statement. If the vector 
length is long enough that all the operands do not fit in the cache, then 
cache misses will result when the second statement is executed. This 
effect is countered by proper use of name partitioning and strip mining, 
as shown above. In nested loops such as those in Figure 4, vectoriza- 
tion and cache transformations will result in fewer cache misses than 
the scalar code, rather than more. 


5. Conclusion 


In this paper we have shown an example of vectorization for the 
S-1, a vector supercomputer with a cache. By using loop interchang- 
ing, name partitioning, and strip mining, the KAP/S1 improves the 
cache hit ratio by a significant factor both for loops that are vectorized 
and for those which remain scalar. 
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Doacross: Beyond Vectorization for Multiprocessors (Extended Abstract) 


Ron Cytron 


IBM T. J. Watson Research Center 
Yorktown Heights, New York 
(work began while the author was at 
the University of Illinois at Urbana-Champaign) 


Although vector constructs could be executed with speedup on a multiprocessor, the resources of a 
multiprocessor are wasted unless there are constructs for which it outperforms a vector processor. This 
paper describes a method of executing iterative loops on a multiprocessor. Loops are classified by their 
parallelism and a class of loops is exposed that must execute sequentially on vector machines yet can execute 


with speedup on a multiprocessor. 


1.0 Motivation and Background 


Program performance can be improved through the execution 
of vector operations, where a single operation is applied 
concurrently to multiple operands. With the advent of vector 
processors came the need for vectorization: the compile-time 
identification of statements within loops that yield vector 
operations. However, programs do not consist solely of vector 
operations. Some form of multiprocessing is required to achieve 
concurrency for constructs of greater complexity than vector 
loops. However, current methods for programming 
multiprocessors either rely heavily on user assistance, or involve 
automatic techniques that are confined to identifying constructs 
that are even simpler than vector loops. This paper presents a 
compile-time technique for determining the parallelism of an 
arbitrary iterative loop, based on the dependences among 
statements within that loop. In particular, loops that must 
execute sequentially on a vector processor can execute concur- 
rently on a multiprocessor. 


1.1 Architectures 


Vector processors and multiprocessors both achieve concur- 
rency through multiple processing elements, but the capability 
and cost of each can be characterized by the following stylized 
architectures: An SEA machine (Single Execution of Array 
instructions) drives its processing elements in a parallel or 
pipelined manner from a control unit that processes a single 
stream of instructions. In a multiprocessor, each processing 
element is driven by a stream of instructions issued from its own 
control unit. One would expect that the multiprocessor, albeit 
more costly, would provide increased flexibility and performance 
over the SEA machine. 


1.2 Dependence 


A program can be regarded as a collection of operations that are 
constrained in their execution by dependences: _ certain 
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operations must complete before other operations can 
commence. Let a dependence graph be constructed for a 
program, such that directed edges (arcs) connect dependent 
operations.! A dependence exists between two operations, either 
because they must share or not share some data (data depend- 
ence), or because one operation determines whether the other 
operation is performed (control dependence). 


Consider two statements from a program: S, and S,. If S, must 
store any of its output before S, can fetch all of its input, then a 
flow dependence exists between statements S; and S,, denoted S, 
6 S;. An anti-dependence exists from S, to S,, denoted S; 6 S,, if 
S; requires data from an area in which S, stores its output. 
Finally, an output dependence exists from S, to S,, denoted S;, 6° 
S;, if statements S; and S, store into common storage locations. 
Whichever statement executes last in the sequential program 
must store its output last in the concurrent program. 


If statement S, determines whether statement S, is executed, then 
a control dependence exists between statement S, and statement 
S;, written S, 0 S,. 


1.3. Concurrent Execution of Loops 


This paper describes the effects of concurrently executing 
iterative loops. An iterative loop contains a sequence of 
statements (the Joop body) that is performed for a sequence of 
iteration variable values. The iteration space for an uninterrupted 
normalized loop with upper bound N is the sequence 1,2,...,N.? 


A compiler for an SEA architecture must search a program for 
vector loops: loops containing operations that can be executed 
simultaneously for all iterations. Vectorizing compilers consider 
the dependence issues mentioned above to determine the vector 
loops of a program [1, 2, 9, 15]. Statements that participate in a 
dependence cycle cannot be executed as vector loops. 


Consider the execution of a vector loop on a multiprocessor, 
where the iterations of the loop are executed on potentially 


In practice, the nodes of a dependence graph are often statements, which results in smaller graphs of coarser granularity. 
Any loop can be transformed into a normalized loop through do-loop normalization [14]. 


different processors. The absence of a dependence cycle 
guarantees that the statements of the loop can be ordered, such 
that all dependences point lexically forward. All dependences 
then have the form S,6S,, where i < j. If processors are driven 
synchronously, then a compiler can schedule instructions such 
that dependences are honored without explicit synchronization. 
Synchronization is necessary if processors operate asynchro- 
nously; however, by the time statement S, executes, statement 
S; should have finished. The processor executing statement S, 
would probably not have to wait for statement S, to complete. 
For more details on synchronization, see [11]. 


Loops with cycles of dependence must be executed sequentially 
on an SEA architecture. However, as shown in this paper, such 
loops can exhibit parallelism when executed on a multiprocessor. 


2.0 The Doacross Technique 


This paper presents a technique that models the execution of 
sequential loops, vector loops, and loops of intermediate 
parallelism. The analysis is first presented for a single loop and 
is subsequently extended for arbitrarily-nested loops. This work 
is then compared with other methods for executing loops on 
multiprocessors. 


2.1 The Doacross Schedule 


Consider a single loop L of s statements (S,, S,, ..., S,) and N 
iterations. Let virtual processors? be assigned to loop L such 
that virtual processor VP, executes iteration i. If loop Lisaa 
vector loop, all N iterations can execute concurrently. Consider 
the execution of statement S, of the vector loop. An instance 
of that statement exists in each virtual processor assigned to the 
loop. There is effectively no delay between the execution of 


In general, the delay d between consecutive iterations (virtual 
processors) can range from no delay (the vector loop case) to the 
time of the loop body (the sequential loop case). If T(S,, S,) is 
the time for executing statements S, through S,, inclusively, 
within an iteration of the loop (i < /), then a doacross loop has 
d = 0 for a vector loop and d = 7(S,, S,) for a sequential loop. 
These loops, and loops of intermediate parallelism, can be 
characterized by: 


T(S,, S,) —d 


x 100 percent 
T(S,, S,) 


parallelism = 


Sequential loops have 0% parallelism and vector loops have 
100% parallelism. Loops of intermediate parallelism are of the 
greatest interest in this paper, because they achieve no speedup 
on an SEA architecture. 


In a doacross loop of N iterations, the last iteration will 
accumulate (N — 1)d delay time before executing the loop body. 
When the loop body of the last iteration has been executed, the 
execution of the loop is complete: 


T(loop) = (N — 1)d + T(S;, S,) [1] 


2.2 Generalization of the Doacross Model 


In this section, the doacross model is extended to accommodate 
multiple loops. Figure 2 shows the syntactic model of multidi- 
mensional doacross for L nested loops; loop 1 is the outermost 
loop and loop L is the innermost loop. 


consecutive instances of S, , because all virtual processors can Bo 
start simultaneously. This style of execution corresponds to the DO, 1;=1,N, 
total partition algorithm [8] that produces the doall schedule delay ((¢,-1) * dy) 
[5]. B, 
Consider the execution of loop L as a sequential loop. With DO» to=1,N 2 
virtual processors assigned as above, the sequential schedule delay ((# 2-1) * do) 
requires that processor VP, waits until processor VP,_, finishes its By 
iteration. 
Vector and sequential loops can therefore be modeled by a 
doacross schedule with delay d: each iteration is assigned to a : 
virtual processor, but each virtual processor delays executing its By) 
loop body for the time period d, as shown in Figure 1. DO, 1p=1,Nz 
delay ((¢,-1) * d,) 
B, 
ENDDO, 
DO i=1,N Ay -} 

delay (d*(i-1)) 

S| 

So ‘ 

: A» 
ENDDO, 
, A, 
S, ENDDO, 
CONTINUE Ag 
Figure 2. Multidimensional Doacross Model 
Figure 1. A Doacross Loop 

; Virtual processors conveniently model an infinite processor system. Limited processor results appear in [4]. 
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The execution of a doacross loop at nest level / consists of an 
N-way fork, where the branch corresponding to the iteration 
variable i, delays (i, —1)d, clocks before beginning the execution 
of block B, At the end of a doacross loop, the branches join 
together. For example, a doubly-nested loop structure with no 
delay at either nest causes all iterations of both nests to execute 
in parallel. 


Consider the timing of a doacross loop at nest level /. Iteration 
N, delays (N, —1)d, clocks, and then executes block B,, DO,,, , 
and block A,. The time to execute the statements of Figure 2 is 
therefore: 


L 
By + >, (CN, 1d) +7(B) + T(A)) + Ag 
l=1 


[2] 


where 7(B,) and 7(A,) represent the time to execute a particular 
block of code from Figure 2. 


2.3 Dependence Considerations 


This section considers the effects of dependences on delay. 
Although all dependences in a loop can contribute to a loop’s 
delay, each dependence can be considered separately,* with the 
delay for a loop chosen as the minimum delay that satisfies all 
dependences. 


2.3.1 Data Dependence 


Consider a loop of s statements as shown in Figure 3. Consider 
two statements, S; and S;, where S, lexically precedes S; ( < i). 
An arc in the dependence graph of Figure 3 from S, to S, means 
that during the sequential execution of the loop, statement S;, 
computes data used by S,. Let J, represent the iteration in which 
S, computes the data, and let J, represent the iteration in which 
S, uses the data. The semantics of sequential programs execution 
guarantee that J, < J; the iteration in which the data is used 
cannot occur earlier than the iteration in which the data is 
computed. 


DO I=1,N 
delay ((I-1) * ??) 
Sy 


Figure 3. A Loop with Unknown Delay 


Consider a virtual processor assignment of the iterations of 
Figure 3 with no delay, as shown in Figure 4. Each column 


represents the statements that are executed sequentially within 


a virtual processor. If processors execute at approximately the 
same rate, virtual processor VP, will not reach statement S, of 
iteration J, (a column of Figure 4) before statement S, in virtual 
processor VP, has finished. The dependence, as shown in the 
schedule of Figure 4, is satisfied by the time it is needed, even 
in the absence of any delay. Because this dependence points 
forward in the program source, it is called a lexically-forward 
dependence; such dependences do not contribute to a loop’s 
delay. As mentioned in Section 1.3, some architectures may 
require synchronization to honor such dependences. 


Figure 4. Loop with Lexically-Forward Dependences 


Suppose the dependence were reversed, such that S, 6 S,. This 
type of dependence points backward in the program source, so 
it is called a lexically-backward dependence (abbreviated LBD). 
A lexically-backward dependence can be transformed into a 
lexically-forward dependence, if S, and S, do not participate in 
a dependence cycle, by reordering the two statements. Those 
lexically-backward dependences that cannot be eliminated by 
reordering are the only LBDs of interest in this paper: 
statements S, and S, participate in a dependence cycle, S, 6 S,, 
and j < i. The semantics of sequential programs guarantee that 
I,< I; , since S, executes no later than S, within the same 
iteration. A zero-delay loop schedule will not satisfy the 
dependence, since virtual processor VP, will arrive at statement 
S; before virtual processor VP, can satisfy the dependence by 
executing S, Delay must be introduced, such that S, of virtual 
processor VP, executes before S, of virtual processor VP,, as 
shown in Figure 5. 


‘3 Multiple dependence relationships can exist between nodes of a dependence graph. The dependence graph is therefore a multigraph, but each depend- 


ence can be treated separately. 
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VP, VP; VP; VPn 
Si idle idle idle 
: idle idle idle 
S; idle idle idle 
; .. idle idle idle 
S; | a idle idle 
i idle idle 
S, - SS; idle idle 
idle... , idle idle 
idle S; ie a idle 
idle . OE idle 
idle S, Oy idle 
idle idle... ; idle 
idle idle S; see S, 
idle idle aw 
idle idle 5. oe yh 
idle idle idle... 
idle idle idle S; 
idle idle idle 
idle idle idle S, 
Figure 5. Correct Schedule of a Loop with LBDs 


2.3.2 Control Dependence 


Control dependences are caused by conditional and uncondi- 
tional branching in programs. There are two cases of branching 
that must be considered: intra-loop branching and loop exits. 
The result of an intra-loop branch causes either the omission or 
repetition of a portion of the loop body, which affects the 
execution time of a loop. Delay is a probabilistic measure, and 
represents the time interval observed at the start of an iteration 
that is expected to satisfy dependences. The issue of intra-loop 
branching and the calculation of delay will be resolved by 
associating probabilities with statement execution times. 


In the absence of loop exits, all iterations of a loop are executed. 
Consider the program with a loop exit at statement S,. During 
its sequential execution, some iterations of the loop might be 
skipped. Like a loop with an LBD, delay must be introduced to 
accommodate the loop exit, as shown in Figure 6. 


VP, VP, VP; 
Sy idle idle 
.. Idle idle 
S, .. Idle idle 
es idle 
S, ; idle 
idle S, .. idle 
idle “35, 
idle S, — 
idle idle S, 
idle idle 
idle idle S, 


Figure 6. Correct Schedule of a Loop with a Loop Exit 


2.4 Calculation of Delay 


There are only two constructs that contribute to a loop’s delay: 
loop exits and lexically-backward dependences. The following 
sections show how to estimate delay for these two cases. After 
some definitions, a recursive timing formula is presented. 
Invoking the recursive timing formula once yields the (closed) 
formula for calculating delay in a single loop. The recursive 
formula can be invoked again to yield an equation for calculating 
delay in a double loop structure. The optimal solution to this 
equation lies in the linear programming domain, but several 
special cases are discussed. The general formulation for a loop 
nest of L loops is found by invoking the recursive timing formula 
L times. 


2.4.1 Dynamic Statement Instances 


Let a program consist of a static sequence of s statements: S,, 
S, , ..., 5, Consider statement S, that is surrounded by L 
normalized iterative loops. Let J, 4, ..., J, represent the iteration 
variables of the L loops, where J, is the iteration variable of the 
outermost loop, and J, is the iteration variable of the innermost 
loop that surrounds S,. Each iteration variable can assume only 
integer values in its iteration sequence. If the upper bound of 
iteration variable J, is N, then the iteration sequence of that 
iteration variable is <1, 2, ..., N> in the absence of loop exits. 


The L loops form an L-dimensional integer Cartesian space. 
The space is composed of discrete points, where each point 
represents a particular assignment of values to iteration variables 
from their corresponding iteration ranges [15]. In the absence 
of loop exits, statement S, will be executed for all possible values 
that the iteration variables can assume. The number of times 


statement S,; executes is therefore TL, : 
A dynamic statement instance, or simply an instance, is the 
execution of a statement for a specific point in the iteration 


A 
space of its surrounding loops [8]. The pair (S,,/ ) represents an 


A 
instance of statement S, for the iteration vector I The 
components of the iteration vector correspond to values of 


iteration variables for the loops surrounding S,: (i, &, ..., i,). 


A A 
Let J,,, represent the iteration vector J, truncated to its first q 
elements (1 <q < L). 


2.4.2 The Delta Vector 


Consider two statements, S, and S,, that participate in a 
lexically-backward data dependence: S; 6 S;; some dynamic 
statement instance of S, generates data or control information 
that is used by a dynamic statement instance of S, Let the 


A A 
instance of S;, be (S,,/) and let the instance of S; be (SJ). Let 
CM(S,, S;) be the number of loops that surround both S, and S;; 
these are called the common loops of S, and S;. The A-vector is 
defined as the difference in iteration index values of the two 
instances:> 


A A 
A= licus, S)| ~ J CN(S;, S;) | 
As developed by Wolfe [15], dependence information is 
available only for the common loops of S,and S,. This discussion 


3 A given dependence can give rise to multiple A-vectors, but these can be accommodated as distinct edges of the dependence multigraph, each treated 


separately. 
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considers only those A -vectors that are valid over the common 
loops of S; and S,, although the results can be generalized. 


As above, let T(S,, S;) be the time for executing statements S, 
through S, , inclusively, and let T(S,) be (an abbreviation for) the 


A 
time for executing only statement S,. Finally, let r(S,, J) be the 


A 
time at which (S,,J ) begins to execute. 


2.4.3 A Recursive Timing Formula 


A A 
Consider a statement instance (S,,J ), where L= | J |. Statement 
S; is surrounded by L loops, with J, as the innermost loop. The 


A 
time at which S, begins to execute with the iteration vector J can 
be defined in terms of the time at which the innermost loop starts 
and the time at which S, executes within the innermost loop. In 
terms of r(), 


A A 
7(S;, T) = 7(DO,, Ty 7-1)) + (i; —1)d, 


+ T(DO,, S) - TS) [3] 


A 
7(DO,, I\,-1,) refers to the time at which the innermost do loop 


A 

(DO,) begins to execute with the iteration vector J,,_,, . This 
term regards DO, as a statement within the scope of its outer 
loop (DO,_,). The rest of the expression represents the time at 
which statement S, executes, for iteration i, and delay d, : 
T(DO,, S,) accounts for the execution of statements from the 
DO, through S,, inclusively; T(S,) is subtracted to make the 
formula reflect the time at which S, starts. 


A 
The dependence SS, implies that for some iteration vectors, I 


A 
and J, statement S, must execute after statement S, has 
completed execution. In terms of 7(), 


A A 
1(S,J) > 7(S;, 1) + TS) 


A A 
Since J = I] - A, the above formula can be rewritten: 


r(S,I — A) > 1(Sp1) + TUS) [4] 


After adding 7(S,, S;) — T(S,) to both sides of Formula 4: 


r(S;,1 — A) ¢ TS, S) — TS) > 


(SI) + TS, S) 5] 


A 
By observing that for some DETLA-vector K: 
A A 
T(S;, K) = 1(S, K) + TCS, S;)) — T(S;) 
Formula 5 becomes: 


r(S,1 — A) > (8,1) + TS, S) [6] 


A 
Statement S, with iteration vector J — A cannot start execution 
until 7(S,, S,) time has elapsed since the start of statement S; 


A A A 
with iteration vector J. Since J =J — A, the instance of 
statement S;, in the iteration in which S, sinks the dependence for 
SOS; cannot execute until T(S,, S,) time after the instance of 
statement S, begins to execute in the iteration in which statement 
S, sources the dependence. 
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Using Formula 3, the references to 7() on the left and right of 6 
become: 


A A 
7(S,, I- A) = 1(DO,,U — A)jr-1 1) + (i, —_= Ar —1)d, 
+ T(DO,, S;) — T(S,) 


and 


A A 
T(S;, I) = 7(DO,,(7) \L-1}) + (i, —1)d, + T(DO;,, S;) _ T(S;) 
where A, is the last (innermost) entry in the A-vector. Formula 


6 can then be rewritten: 


A 
1(DO,,U0 —,A))7-1)) — Arde 
7(DO,, f Tal 1) + T(S;, S;) 


> 


[7] 


2.4.4 The Single Loop Case 


Formula 7 references the delay in the innermost loop, the 
innermost A- vector entry, and the static execution time between 
statements S; and S,. Formula 7 also requires the time at which 


A 
the next outer loop begins to execute with iteration vectors J and 
I —A , each truncated before the innermost loop (length 
L — 1). Consider the case of a single iterative loop; because a 
single loop cannot have an outer loop, the recursive references 
to tQ) vanish in Formula 7, leaving: 


a A,dr > T(S;, S;) 


In terms of linear programming [6, 7], the above formula 
represents a constraint for the delay parameter d,. This 
constraint should be set to minimize the objective function: 


T(DO,) = (N — 1)d, + T(S;, S,); 


which is the time to execute N iterations of a doacross loop 
containing s statements and delaying d, time between successive 
iterations. A single loop can contain a lexically backward 
dependence only when the iteration in which data is used occurs 


after the iteration in which data is computed. A, was constructed 
A A A 
from a particular J, — J, for $6S, . J, must therefore occur after 


I, , assuring A, < 0. The delay d, is easily solved: 
T(S;, S;) 


> 
L= —A, 


A 

The delay due to the pair of statement instances (S,, J) and 
(S,, J) is calculated from the time interval between S, and S, and 
the distance (in iterations) across which the data is passed. Since 
the above formula provides a lower bound for d,, all delays 
arising from pairs of statements in a loop can be satisfied by 
choosing the delay of maximum value for all pairs. An algorithm 
for determining the delay for a loop in this manner is given in 
[4]. 

Consider the example shown in Figure 7. Each assignment 
statement is linked to the next assignment statement via a 
forward and backward arc in the data dependence graph. 
Figure 7 includes a table of estimated execution times for the 
statements. To estimate d (the delay for the loop), all pairs of 
statements involved in lexically-backward dependences (56S,) 
must be considered. For each pair (S,, S,), the function 7() must 
be evaluated. Note that in this example, the A-vector for each 
backward dependence is (-1). The delay for the loop should be 
4 time units per iteration, since the pairs (S;,S,) and (S,,S;,) both 


yield T() = 4. 
example. 


Figure 7 contains the timing schedule for this 


By delaying each iteration, Figure 7 shows how the S,éS, 
dependence is satisfied. The second processor commences 
execution of S, at time step 5; S, finishes execution in the first 
processor during time step 4. C(1) will not be requested in VP, 
until VP, has calculated C(1). Note also that the third processor 
in Figure 7 is really unnecessary; VP, finished its loop body 
during time step 7, so VP, could have executed the third 
iteration. This is an example of limited processor scheduling, 
which is discussed in [4]. 


DO i=1,N 
delay((i-1)*d) 
Sy: A(i) = Bli-l) + 37 
So: 
S3: 
S¢: 


Figure 7. A Single Loop Example 


2.4.5 The Double Loop Case 

Because a single loop has no outer loop, the references to r() 
vanished from Formula 7. A double loop contains an inner loop 
and an outer loop; if the references to r() in Formula 7 are 
expanded using Formula 3, a new formula results that reflects 
the situation of two loops: one at nest LZ and one at nest L — 1, 
surrounded by a third loop at nest L — 2: 


A 
1(DO,_ 1,0 = A))\1-2)) += Ar-14,-1 —_ Aa, > [8] 


A 
T(DO,_}, Ti 7-2 ) + T(S;, S;) 
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Since there are only two loops, then the references to 7() vanish 
from Formula 8, yielding: 


— Ay idp_-1 — Ard, 2 TUS; S;) 


[9] 


Formula 9 imposes conditions on d, and d,_, such that depend- 
ences within two nest levels are satisfied; the optimal setting of 
the two delays for a single dependence in a two-loop environ- 
ment cannot be determined from Formula 9. The global 
objective is to minimize the execution time of the two-loop 
construct; the execution time for the multidimensional doacross 
model, restricted to two loops surrounding s statements, is given 
by: 


T(DO,_,) = (N;-1 —1)d,_, + 


(N, -1)d, + TS, S) [10] 


The minimization of Formula 10 with respect to the constraint 
of Formula 9 is a linear programming problem, with Formula 10 
as the objective function. The objective can be simplified by 
symbolically replacing constant terms: 


T(DO,_,) =7,4,_) + Nod, + 73 [11] 
where 
nm = N,-1-1 
2 = N_-1 


Consider the case in which both A,_, and A, are negative. The 
boundary expression for Formula 9 can then be written: 


T(S;, S;) + Ardy 


d;_, 2 ae [12] 
or 
dy, 2 a — ad), 
where 
T(S;, S;) 
lair or 
Ar 
me eng 


Consider the case where all delay is charged to the outermost 
loop d,_,; no delay is charged to the inner loop, sod, = 0. The 
objective function becomes: 


Similarly, all delay can be charged to the inner loop d,; no delay 
is charged to the outer loop, so d,_, = 0. The objective function 
becomes: 


ay 
T(DO,_;) = ges [14] 


A comparison of Formula 13 with Formula 14 decides whether 
it is best to charge all delay to the outer loop or to the inner loop. 


If n, < = , then all delay should be charged to the outer loop. 


Consider the assignment of delay when the loop bounds are the 
same (n, = 72); if a, < 1 , then all delay should be charged to the 
outer loop. Recall that a, was calculated from the ratio of A, to 


A,_,;- The decision as to which loop should be assigned all delay 
is reduced to a comparison of the A-vector elements for the loops 
involved. 


When viewed as a linear programming problem, the two cases 
given above form the extreme points of the convex set delineated 
by Formula 9. The general solution must be one of the extreme 
points [7], so the two cases provide the general solution for a 
double loop where A,_, and A, are negative: 


Choose When 


N; -,;71 A; _ 
d, =0 charge outer loop West * ~, 
L 


(Nz —1) 


d,—,=0 charge inner loop otherwise 


Similar inequalities can be derived for A-vectors of other signs. 
An interesting case occurs where A, is positive® and d,_, = 0. 
For Formula 9 to be satisfied, d, must become negative: 
iteration 7 is started d time units after iteration i + 1. A doacross 
schedule for a negative delay d corresponds to running the loop 
in reverse with a delay of — d. The accommodation of negative 
delays requires straightforward modifications of the timing 
formulae (such as Formula 10) to account for the magnitude, 
and not the sign, of delays. Of greater concern is the conflict 
that could arise when two dependences place unsatisfiable 
constraints on delay: one dependence could require a positive 
delay while another dependence requires a negative delay. In the 
following section, an efficient algorithm is presented that 
sacrifices optimality but avoids such a conflict. 


2.4.6 A General Solution 


The single loop case was derived by eliminating the references 
to rQ) from Formula 8 in the absence of any outer loops. 
Expanding r() once by substituting Formula 3 into Formula 8 
exposed the solution to the double loop case. Further expansion 
of r() leads to the general delay formula for a dependence nested 
in L loops between statements S, and S;: 


L 
—~ Ad, > TS, S) [15] 
l=1 


When posed as a linear programming problem, Formula 15 is the 
constraint under which the following objective function should 
be minimized: 


i 
T(DO,) = \(N,-1)d, + T(S;, S,) 
l=1 


The objective function is exactly the formula used to time the 
multidimensional doacross model in Formula 2. The simplex 
algorithm can find optimal values for the delay in all loops, and 
although this algorithm’s possible running time is exponential, its 
expected running time is reasonable [6]. However, the presence 
of multiple dependences and A-vectors in a loop greatly 
increases the complexity of choosing optimal delays. This 
section presents an algorithm that performs a single pass over the 
loops surrounding a block of statements to establish the 
minimum delay that should be observed in the surrounding loops 
due to dependences in that block of statements. Consider a nest 


. A, _, is always negative for a flow dependence from a sequential program. 
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of loops as shown in Figure 2 and the calculation of delay d, at 
nest k to satisfy the dependence SS, when each statement is 
from block B, or block A,. Let A be the distance vector associ- 
ated with SOS, Assume d,,,, d,,2, ..., d, are already known, but 
d, , d, ..., d,_; are as yet undetermined. Since only one pass is 
desired over the loops, d, must be set such that Formula 15 is 
satisfied even if d,, d,, ..., d,_, are all 0. Therefore, d, is set such 
that for SS, and the dependence’s associated distance vector 
A, 


L 
~ S\Ad, > TS, 5) 
l=k 


The algorithm consists of two nested traversals of the loop nest 
shown in Figure 2. The outer traversal calls for the assignment 
of delays that satisfy the dependences of blocks B, and A,. The 
inner traversal considers the dependences and assigns delays 
appropriately in the loops that surround blocks B, and A,. 


Input: 
A nested structure of L loops as in Figure 2 
Output: 


Delays d,, d,, ..., d, that satisfy all dependences in the 
L loops. 
for j:=L downto 1 do 


Let y be the dependence graph for statements in blocks 
B, and A,. 
for k:=j downto 1 do 
Let Q = all pairs {((S,, S,),DELTA)} such that 
i > j and (S6S,) € y with distance vector A. 
Let OQ = QU {((S,, S,), — 1)} if S, exits from loop 
k. 


; . 
= Ad, + TS, 
lak+1 P 
d, = Max = 


end for 


Si) 


end for 


Note that the algorithm establishes the delay at each loop as the 
maximum delay required by dependences associated with that 
loop. At loop k, if some dependence required a negative delay 
while others required a positive delay, then loop k would be 
assigned a positive delay. Any dependence that requires a 
negative delay will be eventually accommodated as the algorithm 
proceeds to the outermost nest, since an LBD must have some 
A, < 0,1 < k (the dependence is from a sequential program). 


2.5 Comparison with Other Methods 


This section compares doacross with two other techniques for 
multiprocessor compilation of loops. The first method, loop 
distribution, is the direct application of a transformation 
developed for SEA compilation that separates the vector and 
sequential statements of a loop into multiple loops [10, 14]. The 
second method, partitioning, is targeted for multiprocessors [8]; 
vector and sequential statements are not separated, but 
statements that participate in a dependence cycle are placed in 
a partition and the iterations of a loop are pipelined through the 
partitions [5]. Neither of these two methods provides the 
performance of doacross. 


2.5.1 Loop Distribution 


Consider a loop L of N iterations with g statements participating 
in dependence cycles and s — g statements that appear in vector 
loops after loop distribution. Let the statements be numbered 
such that statements S,, S,, ..., S, are the statements that partic- 
ipate in dependence cycles and statements S,,;, S,,2, --., S, are 
the statements that appear in vector loops after distribution. 
Only one processor is used for the sequential execution of the g 
statements; the time for executing the sequential statements is 
therefore N x T(S,, S,) . N processors are used for the vector 
execution of the s — g statements; the time for executing the 
vector statements is therefore T(S,, S,) — T(S,, S,) . The total 
time for executing the statements of loop L is 
(N — 1) x T(S,, S,) + T(S,, S,) . The time for executing the 
distributed loops is identical to the time for executing a doacross 
schedule of loop L when S,6S, and the delay is therefore 
T(S,, S,) . In this example of loop distribution, S, and S, are 
members of a strongly-connected component of the dependence 
graph of loop L. Since this condition does not imply S,6S,, the 
delay for the doacross schedule of loop L is bounded by 
T(S,, S,). 


Distributing the loops causes another problem for 
multiprocessors even if S,6S,. The sequential portion of the loop 
uses One processor, but the vector portion utilizes N processors. 
If N processors are dedicated to the distributed loops of loop L, 
then N—1 processors are idle during the execution of the 
distributed sequential loops. These sequential loops dominate 
the computation for large N. If processors are allocated 
dynamically, the processor allocation for loop L grows from 1 
to N processors when a distributed vector loop is encountered 
and diminishes from N to 1 when a distributed sequential loop 
is encountered. This fluctuation in processor usage is expensive 
when the overhead for processor allocation and deallocation is 
significant. In [4], a formula is developed for determining the 
maximum number of processors that a doacross loop with a 
given delay can use. If S,6.8,, the delay assigned by doacross is 
T(S;, S,) . The number of useful processors (as determined by 
[4]) is T(S,, S.)/T(S,, S,) . These processors are allocated for 
the duration of the execution of loop L. In terms of program 
performance, doacross can use fewer processors yet provide at 
least the performance of loop distribution for multiprocessors. 
In terms of processor usage, the schedules produced by doacross 
utilize processors more uniformly than the schedules produced 
by loop distribution. 


2.5.2 Partitioning 


In [8], there are two types of partitioning: total partitioning and 
partial partitioning. Total partitioning is possible only when a 
loop contains no dependence cycles; such loops correspond 
exactly to zero-delay loops under doacross. These loops are also 
called forall loops in [5]. Doacross assigns delays to loops that 
contain lexically-backward dependences that (by definition) 
participate in a dependence cycle. 


Consider a loop L of s statements: S, , S,, ..., S,. Partial 
partitioning uses the distance vectors of dependences to divide 
the index set of a loop L into independent partitions; each 
partition can then be assigned to a processor. The number of 
partitions created by the partial partitioning algorithm is given 
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by the greatest common divisor of all negated distances in the 
dependence graph of loop L. Consider the loop of Figure 7. 
There is a dependence arc between each pair of adjacent 
statements, and data is either passed within an iteration (A = 0) 
or across one iteration (A = —1). The greatest common divisor 
of the negated distances is 1; only one partition is created by the 
partial partitioning algorithm of [8]. Consider the pipelining 
algorithm of [5] that is based on partial partitioning. The 
statements of a loop are partitioned by the strongly-connected 
components of the data dependence graph associated with the 
loop. Let IJ, be the most time-consuming partition of a loop. 
The delay associated with a pipeline schedule is the time for 
executing the statements of II. 


Doacross introduces a delay that accounts for the most 
time-consuming lexically-backward dependence. Consider the 
partition II,; unless there is a dependence from the last 
statement of II, to the first statement of II,, the delay introduced 
by doacross is strictly less than the delay introduced by [8] and 
[5] pipelining. The example of Figure 7 shows that doacross 
provides improved performance over partitioning. Pipelining 
places S, through S; in a single processor; since the dependence 
5,68, does not exist in Figure 7, doacross can schedule multiple 
processors for the execution of the loop. 


3.0 Summary and Extensions 


This paper characterizes loops that obtain no performance gain 
on an SEA machine, yet benefit from execution on a 
multiprocessor. Such loops are parameterized by a measure of 
their parallelism. This measure defines the optimization problem 
of rearranging statements to minimize delay (thus maximizing 
parallelism). In [4], this problem is formally defined and a 
heuristic is given for its solution. For special dependence graphs, 
Munshi has shown optimal solutions [12]. Doacross also 
assumes that statements appear in the same order within each 
processor. Cuny has shown that by allowing the order to differ 
between processors, delay can be reduced [3]. In [4], the 
techniques presented in this paper are applied to routines from 
the EISPACK package [13], with very favorable results. 


For this work to be useful, a mapping of iterations to processors 
must be defined. [4] presents a processor allocation algorithm 
for doacross loops, modelled after the predictor-corrector 
algorithms for numerical analysis. 
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Abstract—Large-Grain Data Flow (LGDF) combines dataflow 
techniques with traditional sequential programming to provide a 
high level model for parallel processing. The user designs parallel 
applications using networks of program modules connected by 
datapaths. The resulting specification is automatically 
transformed for efficient parallel scheduling on a particular under- 
lying architecture. Program execution is controlled via 
producer/consumer data synchronization. This paper describes 
the LGDF parallel processing approach, and proposes methods for 
implementing efficient LGDF schedulers for a specific computer 
that employs a ring communication architecture, the CDC 


CYBERPLUS. 


1. INTRODUCTION 


The difficulty in achieving reliable, efficient, maintainable, and 
portable parallel programs argues strongly for the use of higher 
level, possibly more restricted, models of parallelism than those 
provided by basic machine facilities. Programmers can use the 
higher level model to design applications, and then can rely on 
automatic means (such as a preprocessor) to generate a concrete 
implementation tailored for a particular parallel machine. Our 
experience has shown that programming within the restrictions of 
a suitable higher level parallel model can make parallel program 
design significantly more reliable and even easter than design 
using a sequential computation model(1]. 


Several other approaches to solving the problems of parallel pro- 
gramming have been proposed. Automatic transformation, advo- 
cated by Kuck and others|2], envisions development of software 
tools to automatically transform "dusty deck" FORTRAN for 
parallel processing. Complex control and data dependencies make 
this approach very difficult, and results can be poor. Another 
problem with the automatic transformation approach is that it 
does not lead to development of novel, highly efficient parallel 
algorithms, which user-visible parallelism tends to encourage. 


At least for the near future, layered languages show promise for 
making up for the deficiencies of FORTRAN as a parallel process- 
ing language. Users code parallel operations using macros. The 
macros are expanded to produce low level synchronizing 


operations suitable for a particular parallel machine. This tends 
to lead to fewer bookkeeping errors in the coding of process syn- 
chronization, as well as providing a degree of portability of code 
between different parallel processors. Ideally, one can simply 
modify the definitions of the macros to get layered language pro- 
grams running on a different parallel machine. 


A particular layered language approach, Large-Grain Data Flow 
(LGDF) is described in Section 2. Section 3 presents a brief dis- 
cussion of the architecture of the CDC CYBERPLUS emphasizing 
characteristics that are relevant to parallel program interaction. 
Possible LGDF scheduler implementation strategies for 
CYBERPLUS are discussed in Section 4. Conclusions and exten- 
sions are given in Section 5. 


0190-3918/86/0000/0845 $01.00 © 1986 IEEE 
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2. OVERVIEW OF LGDF 


Large-Grain Data Flow (LGDF) combines some features of 
dataflow computation with subroutine-like code resembling tradi- 
tional sequential programming. A comparison between LGDF and 
sequential coding is illustrated in Fig. 1. The main difference 
between the data flow and sequential versions for this small pro- 
gram lies in the way that the specified computation (C=A+B) is 
scheduled. Scheduling occurs in traditional sequential program- 
ming via the flow of control, represented by the successive values 
of the program counter. The subroutine P10 will be called when 
the process reaches appropriate control points in the program. In 
contrast, the LGDF program P10 is scheduled for execution asyn- 
chronously, whenever appropriate, based on the arrival of the 
input values A and B, and on the availability of buffer space to 
write out a new computed value for C. 


SEQUENTIAL LARGE-GRAIN DATA FLOW 
PROGRAMMING PROGRAMMING 
SUBROUTINE ADD(A, B, C) begin_(p10) 
C = A+B C =A+B A 

clear_(A) ADD 
RETURN clear_(B) p10 Cc 
END set_(C) 

suspend_ B 

end_(p10) 


Fig. 1. Traditional and LGDF programs for C = A+B. 


The program shown in Fig. 1 is more fine-grained than the typical 
LGDF program, which corresponds to 5 to 50 or more FORTRAN 
statements. 


2.1. LGDF NETWORKS 


Datapaths link LGDF programs to form networks (Fig. 2) that 
allow communication of data values from one program to 
another. Datapaths can be in one of two states: empty or full. 


d04 
di2 
dog 
d14 
d10 
d13 
d11 


Fig. 2. LGDF network. 


An LGDF program is activated (asynchronously) depending only 
upon the state of its associated input and output datapaths: 


Execution Rule—A program may start an execu- 
tion cycle if and only if all of its input datapaths are 
in the full state and all of its outputs are in the 
empty state. 


Programs can change the state of an output datapath from 
empty to full by means of the LGDF function set_. A set_ has the 
effect of making the data associated with a datapath available to 
activate downstream programs. In a similar fashion, programs 
can clear_ their inputs in order to make the space available for 
writing into by upstream programs. 

LGDF networks coordinate parallel process activation via data 
flow interactions. Each datapath are in a network represents 
both a shared memory and an associated data flow interlock 
mechanism. The control mechanism provided by the program 
execution rule can be used to synchronize access to shared data 
areas with a simple producer/consumer protocol. This protocol 
precludes race conditions for shared data (write/read conflicts). 


An LGDF program that has finished computation and data flow 
interactions for the current execution cycle can suspend_. At that 
point, it is subject to the: 


Data Flow Progress Rule—Upon suspension of an 
execution cycle, a program must have cleared at 
least one input or set at least one output datapath. 
Otherwise, it is terminated. 


Programs that are in the terminated state cannot be scheduled for 
further execution. However, unaffected parts of the same network 
may still execute. By designing programs that satisfy the Data 
Flow Progress Rule, many types of parallel program design prob- 
lems can be avoided. 


LGDF networks can be combined into hierarchies, since any node 
in a network may be defined either by an LGDF program or by 
another LGDF network. The meaning of this is the same as if the 


lower level network were substituted graphically for the higher 
level network abstraction. 


Applications designed using LGDF are implemented with the aid 
of a prototype set of software tools that rely on macro-expansion 
techniques to generate appropriate scheduling code for various 
target computers. Several steps are involved in LGDF program- 
ming. First, a hierarchical set of LGDF network data flow graphs 
is designed, based on logical data dependencies inherent in the 
application. The structure of the resulting network is encoded in 
a wirelist file. Data declarations for each datapath in the net- 
work are packaged separately. The programmer then writes the 
LGDF programs, combining FORTRAN with a small set of LGDF 
macros. Finally, the LGDF programs are macro-expanded to pro- 
duce compilable FORTRAN code for the particular machine, and 
this code is compiled and executed. For more details on the 
software tools and use of LGDF for parallel processing see [3]. 


The basic idea in LGDF programming is to balance the average 
grain size of LGDF programs in a network with scheduling over- 
head in order to reduce the significance of the overhead. This will 
in general depend on the capabilities of the given architecture, 
the efficiency with which the LGDF implementation mechanisms 
exploit those capabilities, the particular application algorithm, 
and the strategy for data flow parallelization of the algorithm. 


3. CYBERPLUS ARCHITECTURE 
A Control Data CYBERPLUS Parallel Processing System|4] [5] 


consists of an ensemble of up to 16 very fast processors (20 
nanosecond cycle time) organized in a ring structure and inter- 
faced to a host CYBER computer. There is a great deal of inter- 
nal parallelism possible within one CYBERPLUS CPU. Each pro- 
-cessor can contain up to nineteen independent functional units. A 
crossbar interconnection provides for the routing of results from 


846 


CYBER MEMORY RING 


64 BITS/60 NS 


ORY 


N 
I CYBERPLUS CYBERPLUS 


Zz CYBERPLUS ae CYBERPLUS 
(2) (15) 
Bg |_| fed |_| 


SYSTEM RING 
16 BITS/20 NS 


(1) (16) 
[| 


APPLICATION RING 16 BITS/20 NS 


64 BITS/20 NS 
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Fig. 3. Typical CYBERPLUS ring group. 


the output of each functional unit to the input of any or all func- 
tional units on the next cycle. A CYBERPLUS processor contains 
two types of memory, a bulk memory (up to 512K 64-bit words) 
for data and a smaller (4K 240-bit words) program memory for 
executable code. Code overlays can be stored in the bulk memory 
for rapid transfer to the program memory as needed. 


At a higher level of parallelism, up to 16 CYBERPLUS processors 
can be interconnected in a ring group and interfaced to the host 
mainframe, as shown in Fig. 3. Up to four ring groups can be 
interfaced to the one host computer. 


Two rings allow for the transfer of 16-bit data, the System Ring 
and the Application Ring. These rings interconnect the 
CYBERPLUS processors. The System Ring also provides an inter- 
face to the host CYBER computer through its standard I/O chan- 
nel. The CYBER Memory Ring provides a 64-bit wide data path 
between the CYBERPLUS bulk memories and the memory of the 
host computer through a direct memory access port. The 
CYBERPLUS Memory Ring allows 64-bit transfer of data 
between the CYBERPLUS bulk memories. 


3.1. SYSTEM AND APPLICATION RINGS 


This dual ring system provides a high speed parallel channel for 
interconnecting the processors and the host computer of a 
CYBERPLUS Parallel Processing System. The rings are formed 
by connecting the individual ring ports of the processors in a cir- 
cular structure. Each ring port has an input and output register. 
These channels are 29 bits wide, with 16 bits of data and 13 bits 
of contro] information making up a ring packet. Ring packets 
move sequentially and synchronously around the ring from one 
register to the next, at the rate of 20 nanoseconds per transfer. 
This results in a peak data transfer capability between processors 
of 800 megabits per second per ring. A ring port can remove a 
ring packet from the ring and place the same or a new ring 
packet on the ring on every cycle. Ring packets currently on the 
ring have priority over ring packets trying to enter the ring from 
a processor. 


Each processor has two ring ports, one for the System Ring and 
one for the Application Ring. The System Ring ports respond to 
an extended instruction set that provides additional control over 
the CYBERPLUS processors. The host uses the System Ring to 
transfer programs and data to and from each processor in a ring 
group and to initiate the execution of programs in the processors. 


The System Ring and the Application Ring can be connected to 
circulate data in opposite directions, allowing processors equal 
access to both left and right neighboring processors. Both rings 
can also be used by application programs for the transfer of data 
among the processors and for synchronization of the application 
processes running in the processors. 


3.2. CYBER AND CYBERPLUS MEMORY RINGS 


The CYBER Memory Ring and the CYBERPLUS Memory Ring 
are supported by the Direct Memory Access Unit (DMAU). Via 
the System Ring, a program initiates, monitors, and terminates 
all of the data transfers conducted by the DMAU. The DMAU 
also provides a queueing system to allow the stacking of data 
transfer requests, reducing the transition time from one data 
transfer request to the next. A data transfer request may specify 
multiple destinations, allowing the broadcast of a block of data to 
any or all processors in the ring group on one trip around the 
ring. 

The CYBERPLUS Memory Ring allows block transfers between 
the local bulk memories of the processors to proceed at a rate of 
one 64-bit word each 20 nanosecond cycle. This results in a max- 
imum transfer rate of 3200 megabits per second. Block transfers 
between CYBERPLUS local memories and CYBER host memory 
use the CYBER Memory Ring and proceed at a rate limited by 
the maximum access rate of the host memory system (800 mega- 
bits per second). 


The DMAU provides a much higher bandwidth data path between 
the host computer and the CYBERPLUS local memories than the 
System Ring (800 vs 24 megabits per second). It also supports 
higher interprocessor data rates than the System and Application 
Rings alone (3200 vs 1600 megabits per second). 


4. IMPLEMENTATION OF LGDF ON CYBERPLUS 


Our primary goal in designing an LGDF scheduler for the 
CYBERPLUS is to minimize scheduling overhead. To achieve 
this, we need to minimize the time when one processor must lock 
out other processors from accessing shared scheduler tables. 
Efficient mechanisms must also be designed for the transfer of 
data among LGDF processes. Secondary considerations include 
providing the ability to schedule arbitrarily large LGDF networks, 
and scheduling processes efficiently with any number of processors. 
Also important is the ability to simulate LGDF parallelism on a 
single CYBERPLUS processor. This aids in implementing large 
networks, provides the ability to debug applications in a more 
controlled (less parallel) setting, and provides greater flexibility in 
assigning processes to processors. 


The smaller the average grain size in a given application, the 
more opportunity there is for parallel computation. However, for 
the CYBERPLUS, it would be inefficient to schedule computations 
on a very low level, such as the basic arithmetic operations. The 
amount of sequential computation in an average grain should be 
large enough relative to the scheduling process to reduce overhead 
to an acceptable percentage, say ten percent or less. 


Since the amount of code required for an LGDF scheduler is typi- 
cally quite small, we have chosen to implement a resident 
scheduler in each processor. This avoids the overhead of passing 
messages between a processor and a single central scheduling pro- 
cess. In addition, parts of the scheduling process can accom- 
plished in parallel. However, to maintain scheduler integrity, the 


distributed LGDF schedulers must obtain exclusive access to cer- 
tain system information contained in shared tables for a portion 
of the scheduling process. 


4.1. LGDF SCHEDULER TABLE INTERLOCKING 


Even though the CYBERPLUS has no shared global memory, the 
effect of a shared table can be implemented by keeping a copy of 
it in each processor’s local memory. A processor wishing to 
update a system table does so under the protection of an interlock 
scheme. The processor then updates its local copy, broadcasts the 
new table to all other processors via the CYBERPLUS Memory 
Ring, and releases the interlock!. 


tt might seem more efficient to broadcast updates to system 
tables, rather than the entire table, but the tables required to 
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To implement the required interlocking, a token passing scheme is 
used. Two different types of tokens are required: D tokens to 
protect access to (subsets of) datapath state tables, and P tokens 
to protect access to (subsets of) LGDF process state tables. In the 


following discussion, we will assume that only two tokens exist, 
one of each type. The processor which currently owns each token 
is known by the schedulers on all other processors. If a processor 
needs a token which it does not own (e.g., because an LGDF pro- 
cess which it is executing requests a clear_ or set_), it sends an 
interrupt message over the ring to the current owner. If the 
current owner of the token no longer needs it, the token is sent to 
the requesting processor by broadcasting the processor number of 
the new owner to all processors via the ring. Thus, when owner- 
ship changes, each requesting processor is notified of the new 
owner and can determine if it now has the token. If a requesting 
processor did not get the token on a change of ownership, a new 
request can be made to the new owner for the token. 


The requirement that the scheduler be able to handle arbitrarily 
large LGDF networks, combined with the limited program 
memory of CYBERPLUS, indicates that LGDF programs should 
be stored as overlays in the CYBERPLUS bulk memory. When 
the scheduler wishes to execute an LGDF process, it loads the 
overlay from bulk memory into program memory and then 
branches to it. This also allows each processor to execute any of 
the LGDF processes of the application. 


P (process) token interlocking behaves as follows. First, the 
scheduler obtains the P token which allows it to examine the pro- 
cess queue. The process queue contains the name of all processes 
which have previously been identified as eligible to execute. A 
process is removed from the queue and the revised queue is broad- 
cast to all other processors via the CYBERPLUS Memory Ring. 
The P token is then released and the appropriate LGDF process 
overlay is loaded into program memory and executed. When the 
process suspends, it returns to the scheduler, which checks for vio- 
lation of the LGDF Data Flow Progress Rule, then attempts to 


select a new process for execution. 


If the process queue was empty when the P token was obtained, 
the scheduler must also obtain the D (datapath) token. All eligi- 
ble processes (i.e., that satisfy the LGDF Execution Rule) that are 
not currently executing are placed into the eligible process queue, 


and scheduling resumes as described above’. If the eligible process 
queue is still empty, the scheduler enters an idle loop. When any 
processor changes the state of a datapath (full to empty or empty 
to full), it sends a special "change message” to all processors via 
the ring. Any schedulers in the idle loop will be interrupted by 
this message. They then compete for the P token in order to 
obtain a process to execute. 


To achieve independence of the scheduling code from the number 
of processors, all processors are treated as equals (there are no 
masters or slaves). All CYBERPLUS processors contain the same 
application programs and scheduling code, as shown in Fig. 4. 


The processors compete for the chance to advance the state of the 
computation by running any eligible process, as described above. 


4.2. IMPLEMENTING LGDF DATAPATHS 


Since each processor can execute any of the LGDF processes, all 
datapaths of the network must be available to each processor. 
This can be accomplished by keeping a copy of each datapath in 
the bulk memory of each processor. A process which wants to 
write on a datapath first updates its local copy of that datapath, 


store the data flow state for LGDF networks tend to be very 
small (a few bits per datapath and per process). 


? Note that the P and D tokens are not explicitly released. 
The scheduler is merely willing to pass the tokens to whichever 
scheduler requests it next. 


SCHEDULER SCHEDULER 
TC Oe Sa” 
SYSTEM TABLE | SYSTEM TABLE SYSTEM TABLE SYSTEM TABLE 
CPI CP2 CP3 CP4 


Fig. 4. LGDF scheduler on CYBERPLUS. 


then broadcasts the update via the CYBERPLUS Memory Ring. 
The process then can then change the state of the datapath to 
full by executing a set_. (In the simplified producer/consumer 
model of LGDF parallel processing under consideration here, each 
datapath can only be written by a single LGDF process, so no 
write/write conflicts are possible). A drawback to this scheme is 
that when datapaths represent large arrays, the duplication can 
consume large amounts of bulk memory. 


Several alternatives exist when this duplication causes a problem 
with resources. As shown in Fig. 5, the memory of the host pro- 
cessor can be used as a global memory for the processors in the 
ring group. In this scheme, the producer process writes the array 
to host memory and places a pointer to it in the datapath, which 
is then broadcast to all processors in the ring group. When the 
consuming process executes, it can load the array directly into its 
local memory, thus avoiding duplication in other processors. 


Fig. 6 shows another alternative that avoids the transfer time to 
and from host memory. However, this method requires a memory 
manager on each processor as part of the scheduler. In this 
scheme, the producer process obtains a block of local memory 
from its scheduler to contain the large array to be passed to the 
consumer process. The address of this block and _ processor 
number are placed in the datapath, which is broadcast to all 


other processors in the ring group. The producer process can 
suspend and allow other processes to execute in that processor, 
but the block containing the array is protected by being reserved 
through the memory manager. When the consumer process for 
this array executes, the datapath will contain the processor 
number and the address of the required array. The array is 
transferred directly to the consumer processor via the Memory 
Ring. The scheduler then sends a message via the ring to the pro- 
ducer processor’s memory manager, releasing the block of memory 
in that processor. 


Fig. 5. Datapath via CYBER memory. 
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Fig. 6. Datapath via managed memory. 


Another method for handling extremely large arrays is through 
use of disk storage on the host (or directly on the ring group). 
The array is written to disk, perhaps in piecemeal, by the pro- 
ducer processor. A descriptor pointing to it is placed in the data- 
path, and the datapath is broadcast via the CYBERPLUS 
Memory Ring to all other processors. The consuming process then 
references the datapath to obtain the disk descriptor and reads 
the data, again perhaps piecemeal, directly into its local memory. 


5. CONCLUSIONS AND EXTENSIONS TO WORK 


An alternative approach to placing schedulers and all code and 
data in each CYBERPLUS processor is to download code and 
data from the CYBER host onto a CYBERPLUS only as needed, 
with all scheduling done by the host. This implementation would 
reduce duplication of program and data storage, but the schedul- 
ing overhead would be much larger. Thus, a larger average grain 
size would be required to keep the significance of overhead small. 
Another possible strategy is to allow LGDF programs running on 
the host to interact transparently with LGDF programs on the 
CYBERPLUS processors, and migrate back and forth automati- 
cally. 


The design described in this paper envisions no distinction among 
CYBERPLUS processors -- any processor can run any LGDF pro- 
cess. For very large systems, there could be specialized schedulers 
on different CYBERPLUS nodes. One or more CYBERPLUS pro- 
cessors would be dedicated to each portion of the total LGDF net- 
work. This would require the ability to subset the shared system 
tables. For example, in a CYBERPLUS configuration with multi- 
ple ring groups, subsets of the network could be assigned to each 
ring group, with a master scheduler located on the CYBER host 
coordinating data flow between sub-nets. 
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Macro vs Micro Dataflow: A Programming Example 
Maya B. Gokhale 


University of Delaware 
Newark, Delaware 19716 


Although there are yet to be developed formal methodolo- 
gies for parallel algorithm design, several important principles (do- 
main decomposition, substructuring, dataflow) have been estab- 
lished as useful in writing parallel algorithms. This paper is an 
exploration into the principle of data flow, examined at two levels- 
macro (or task level) and micro (or instruction level). Several 
versions of concurrent algorithms are presented for a simple linear 
algebra problem. The macro dataflow programs use an extended 
Fortran for the Denelcor HEP. The micro dataflow programs use a 
dataflow programming language Manchester Dataflow (MaD). Pro- 
gramming effort and efficiency considerations in each environment 
are contrasted. 

1. Introduction 


‘The design of efficient parallel algorithms is still in the realm of 


black art.” However, because of the great progress in develop- 


ing parallel algorithms, there are certain principles of parallel pro-. 


gramming which have been established. This paper investigates a 
specific methodology- dataflow- by means of a case study. Two 
parallel environments are studied. The first extends the tradi- 
tional sequential program by supporting concurrent cooperating 
tasks which share data structures (task level parallelism). The 
Denelcor HEP is our example of this programming environment. 
The second is a tagged token dynamic dataflow system, the Manch- 
ester Dataflow Machine [1] programmed in Manchester Dataflow 
Language (MaD). This machine supports parallelism at all levels, 
from task to individual instruction. 


The next section expands on the notions of task level and 
instruction level parallelism. Each machine is described briefly. 
Then, we discuss a simple problem: solving an upper triangular 
» system of simultaneous equations using back substitution. An al- 
gorithm is outlined to solve the problem under each of the two 
computation models. Efficiency issues under each model are dis- 
cussed. Finally we compare and contrast the programming effort 
required for each model. 


2. Two levels of Parallelism 


In both task level (also called macro) and instruction level (micro) 
parallelism, the programming model involves concurrent execution 
of instructions. At the macro level, the unit of concurrency is the 
process or task. Within a task, instructions are executed sequen- 
tially. A machine to support such a form of parallelism usually 
follows the sequential computer model. The concurrency is sup- 
ported in hardware by duplication of processing units (PU). Each 
PU has its own PC, ALU, and register set. The programmer spec- 
ifies the concurrency by using such primitives as “create task” (or 
“fork”) to start a new concurrent process, and “synchronize” (or 
“join”) to coordinate execution of concurrent processes. 


Instructional level parallelism might at first glance be thought 
to be at the extreme end of task level parallelism. However, in 
a task level parallel environment, there is some measureable over- 
head associated with task set up to context switch among tasks. 
Efficiency demands that the task be sufficiently large that con- 
text switch overhead does not overwhelm the computation time. If 
parallelism at the instruction level is to be supported, an entirely 
different computation model (and therefore machine realization) is 
required. 


In practice, data flow has been the most popular model de- 
veloped for instructional level parallelism. Rather than instruction 
execution controlled by one (or more) program counter(s), an in- 
struction is executed on a data flow machine when all the operands 
required by the instruction are available. Completion of an instruc- 

-tion causes a new result to be propagated to other instructions 
which use it as an operand. 
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2.1 The Denelcor HEP 


The HEP is a shared memory multiprocessor with a tagged data 
memory. Each location may be either full or empty. A location 
may be read only if the tag indicates that it is full. A location may 
be written only if the tag indicates that it is empty. The HEP is 
programmed in an extended Fortran IV. The extensions provide for 
asynchronous variables used for process synchronization; process 
creation (a fork command); a coroutine facility; and the ability to 
specify local variables. 


The challenge in creating efficient programs for the HEP is to 
set up independent concurrent instruction streams. The instruc-| 
tion steams related to one problem constitute a Task System. The 
processes can be synchronized through the scoreboarded memory 


‘[2]. 


2.2 The Manchester University Data Flow Machine 


Machine language for a data flow machine takes the form of a 
directed graph rather than the traditional linear sequence of in- 
structions. A node in the graph represents a unit of computation; 
an arc represents the flow of data into and out from the units of 


computation. A node may “fire” or be executed whenever data. 


(also called a “token”) is available on the input arcs, and as a re- 
sult, produce data on the output arcs. Variable names in a data 
flow language denote values rather than locations [4]. For this rea- 
son, reassignment to variables (e: g., I := I + 1) is not generally 
permitted. 


The simple scenario outlined above becomes more complicated 
in the presence of multiple sets of data for an instruction. To keep 
data sets distinct, some implementations provide for data tagging. 
Each token carries with it an identification as to the block level at 
which it belongs. The tag can also be used to differentiate elements 
of an array. 


Definition of array elements poses a special problem for data 
flow machines. The semantics of the data flow programming lan- 
guage must ensure that reassignment to individual elements is not 
permitted. However, when a generalized subscript expression is 
used to index an array, there is no way to guarantee at compile 
time that reassignment will not occur. For this reason, data flow 
languages often have restrictions on the definition of arrays. In the 


Manchester Dataflow language (MaD) [3], assignment to individ- 
‘ual array elements is not allowed. Instead, the programmer must 
‘formulate one expression which computes the values of all elements 


of the array. 


3. Algorithm Design for Parallelism: An Example 


At this point, using dataflow as the guiding principle, we would 
like to investigate algorithm design for each form of parallelism dis- 
cussed above, macro and micro. Our goal is to discover: 1) styles of 
programming and 2) machine characteristics which influence both 
algorithm performance and ease of programming. 


The vehicle we use for the investigation is a very simple prob- 
lem in linear algebra. The problem is to solve a set of simultaneous 
equations. The coefficient matrix is assumed to have been reduced 
to upper triangular form by some standard technique. Thus, back 
substitution can be performed on the coefficient matrix to solve for 
the unknowns. 


The problem may be formulated as follows: 


.Given a set of n equations in n unknowns, 


a(1,1)z(1) (1, 2)2(2) a(i,n)z(n) = 6(1) 
0 a(2, 2)2(2) a(2,n)z(n) = 06(2) 
(1). 
0 0 
0 0 a(n,n)z(n) = b(n) 


find 2(t),i=1---n , where x(i) is given by 
a(t) = b(t) — pacga 2 (8, &) + 2(K) 
a(t, 1) 
3.1 Back Substitution on the HEP 
We first look at a concurrent task-based solution to (2). Let each 
process P(i) be responsible for the computation of x(i). 
P(i): 
(1) Form the sum a(i,k)*(k), 
(2) Take b(i) - (1). 
(3) Let x(i) = (2) / a(i,i). 


(2) 


k= iti ... on. 


In addition to the code for each process, we must plan for task 
‘initiation and synchronization among the tasks. For synchroniza- 
tion, semaphored variables are used. We will make the x vector 
a semaphored variable, a ”$” variable in HEP Fortran. The “$” 
variables are placed in the special scoreboarded memory. $x(n) 
may only be read by P(n-1) when the location is FULL. The al- 
gorithm is written so that only P(n) writes $x(n). Therefore, even 
if P(n-1) tries to read $x(n) before the value has been computed, 
the process will merely wait on the FULL condition. The program- 
‘mer is responsible for correct use of the scoreboarded variables to 
synchronize tasks. The graph of Figure 1 shows the flow of shared 
data (e. g., the $x variables) among tasks. 


i re &,- 
des des variables among tasks 


Unfortunately, we cannot merely prepend an x with ”$”. When: 
process P(i) reads a semaphored variable, it sets the semaphor . 
EMPTY. Thus all the other processes are stalled: each is waiting _ 
for FULL on the variable, and this event will not occur again. Our 


‘solution to this problem involves having multiple copies of the vari- 
able, one for each process using it. Figure 2 shows the HEP version 
of the back substitution algorithm. x(i) is the local copy of P(i)’s 
result. The scoreboarded copies are written to $x(Processor-num, 
i), where the processor-num can vary from | to i-1. 


For i=n to 1 by -i 
Create process P(i) 


P(i): 
sum=0 
for k=n to iti by -1 
sum = sum + 
a yebe®) * $x (i,k) 
a(%) = 


b(s Boll 
for k=i- i ‘to 1 by -i 
$x(k,i) = x(i) 


Figure 2. Synchronized Task-based Back Substitution 


With this approach, counting only floating point operations 


and synchronization stores, a 4-equation problem takes 20 timesteps. 


Each additional equation adds a cost of 4 timesteps. 
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_ A slight modification to the order of evaluation of intermediate 
terms of the program in Figure 2 results in an improved execution 
time. We revise the body of P(i) as in Figure 3. The subtraction 
of the sum products on the left from b(i) on the right is done at 
the beginning rather than at the end. With this approach, the cost 


of each additional equation is 3 timesteps rather than 4 because of 


the overlapped subtraction. 

In summary, there are two classes of program design we have 
employed. The optimizations represented by Figure 3 could be 
made by a fairly standard mechanical optimizer. They involve 
expression rearrangement and partial loop unfolding. However, 
the decision at the outset to create a process for each equation; 
to run the loops from n downto 1; the synchronization method: 
these design decision are not as readily mechanizable. This class 
of decision is made by the knowledgeable programmer. 


P(n): 
for k = n-1 to : by =1 
$x(k,n) = alan 
P(i) (i<n): 
sum = b(i) - 
a(i,n)*$x(i,n) 
for k=n-1 to iti by -1 
sum = sum - 
a(i,k)*$x(i,k) 
xi) = 2 
for eal ts iti by -1 
$x(k,i) = x(i) 


Figure 3. Task-based Back Substitution Optimized 


3.2 Back Substitution on the Data Flow Machine 


We next implement an algorithm to solve the same problem on 
the data flow machine. On the HEP we manually decompose a 
problem into concurrent tasks. We must observe from the problem 
description which groups of operations can be done in parallel and 
then explicitly code those groups into discrete tasks. In contrast, on 


‘the data flow machine, the concurrency is managed automatically . 


by the hardware. We can look at the algorithm as a whole rather 
than as a collection of cooperative tasks. The difficulty on the data 
flow machine is to formulate and code the algorithm correctly i in 
the data flow language. 


On the HEP a vector $x in the scoreboard memory holds the 
values of x(i). Elements of the vector are defined incrementally by 
different tasks. In MaD this incremental definition is not allowed. 
Instead the vector must be defined completely by one expression. 


This constraint results in a programming style unfamiliar to most | 


conventional language programmers. 


In MaD an array is represented by a (possibly multi-dimen- 
sional) STREAM. A stream is a sequence of objects of some type 
terminated by a special token called “end-of-stream”. The vector of 
unknowns must be declared in MaD as a “stream of real”. Instead 
of the familiar definition | 


z(t) = (b(t) — sum) /a(s,¢) 


| _ we must use list processing operators such a CONS to put together 


the vector x. By using CONS, we create a new list which is the old 


list plus a new x(i). 


The vector x is produced as follows: 
The value of one element of x depends on values of other, 


previously computed elements. For example, the value of x(n-1) 


depends on the value of x(n). In MaD we handle this problem by 
iteratively producing new versions of x. Initially x is null. Then a 
new version of x is created which has one element, written [x(n)}. 
The next version of x has two elements, [x(n-1), x(n)], etc. Essen- 


tially, we are manipulating the vector as a list, and using the cons 


operator to put a new element to the front of the list. 

This seems to violate the single assignment rule. However, this 
limited form of reassignment is permissible in MaD as long as the 
reassignment is done within a repetitive block, and we differentiate 
between “old” and “new” values of x at each iteration. The MaD 
code for computing x is as follows: 

new x = cons( (b(i) - sum) / a(i,i), x(i)) 

The next problem is the computation of sum, which is the name 
used for the 

sum(a (i,k)*x(k) ), k = itl ... n. 


This computation illustrates another way to specify concurrency 
in MaD, the FOR EACH clause. Given a stream, in this case, the 


current list x, the clause “for each x” causes some action to be 


performed involving each element of the list x. For the sum block, : 


the product a(i,k)*x(k) is formed. sum is computed in a nested 
block within a while loop in Figure 4, which gives the complete 
(but incorrect) block in which x and sum are computed. Although 
this code seems correct at face value, it contains a subtle timing 
problem. Consider the sequence of computation for x and sum: 


Initially x = [b(n)/a(n,n)] and sum = 0. 
The new value of sum is a(n-1,n)*x(n). The new value of x is [(b(n- 
1) - 0) / a(n-1,n-1), x(n)]. This is incorrect for x(n-1) because the 
sum used in the computation of x is the OLD sum. Because an 
incorrect value has been generated for x(n-1), an incorrect value 
is computed for sum at the next iteration, and the error is propa- 
gated to each subsequent x(i). This problem occurs even if sum is 
initialized to a(n-1,n)*x(n). The new x will be 

C (b(m-1) - a(m-i,n)*x(n)) / a(n-1,n-1) , x(n)] 
but the new sum will be 

a(n-2,n-1)*x(n) instead of 

a(n-2,n-1)+#x(n-1) + a(n-2,n) *x(n) 


The problem is that the new sum computation needs the NEW 
x, not the old one. The correct code should require the keyword 
“new” in the for each clause: 


for each xi in new x do... 


Unfortunately, the implementation of the MaD compiler used 
for this example does not support the use of a “new” version of a 
variable on the right hand side of a definition. The final (correct) 
code from which results on parallelism are taken is shown in Figure 
5. We are forced to cycle through the loop 2n times. On the “odd” 
cycle, we update sum, and on the “even” cycle, we update x. 


declare x: stream real; i: 
initi := n-1; sum := 0; 
x := [b(n)/a(n,n)J; [* x initially contains x(n) *) 
while i > 0 do [* iterative loop *] 
new i := i-1; 
{* add the next x(i) to the front of x *] 
new x := cons((b(i)-sum)/a(i,i), x); 
new sum := declare [* a nested block *] 
prod: real; k: integer; xi: real; 
init k := iti; prod := 0; 
for each xi in x do 
[* concurrent loop *] 
new prod := prod + a(i,k)#xi; 
new k := kti; 
return prod; (+ as the new sum *] 
return x; 


integer; sum: real; 


Figure 4. (Incorrect) Data Flow Back Substitution 
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ee NO ee ee TE ee er Ne ee 
declare x: stream real; i: integer; sum: real; 
oddd: boolean; 

initi := n-1; sum := 0; oddd := true; 

x :* [b(n)/a(n,n)}; (* x initially contains x(n) *] 
while i > 0 do [* iterative loop *] 

new i := if oddd then i else i-1; 
[+ add the new x(i) to the front of x *] 

new x := if not oddd then cons((b(i)-sum)/a(i,i), x) 


else x; 

new sum := if oddd then 

declare {+ a nested block *] 
prod: real; k: integer; xi: real; 


init k := itl; prod := 0; 

for each xi in x do 
[* concurrent loop *] 
new prod := prod + a(i,k) «xi; 
new k := kti; 

return prod [+ as the new sum *) 

else sun; 

new oddd := not oddd; 
return x; 


Figure &. (Correct) Date Flow Back Substitution 


The results for a 4-equation problem are summarized in Fig- 
ure 6. Si is the total number of machine instructions executed. 
Sinf is the total number of time steps needed for the computation 
given as many processors as could be used. The Average Paral- 
lelism figure determines the maximum number of processors which 
can be utilised, in this case 9 (8.8). The number of result tokens 
generated can be greater than the number of instructions because 


‘some instructions generate two result tokens. The Max Match- 


ing Store Occupancy indicates how many tokens were waiting in 
the associative memory for another token with a matching tag to 
arrive. 


si = 2526 Sinf = 288 

Average Parallelism of 8.8 

3114 Result Tokens Generated 

Max Matching Store Occupancy = 115 Tokens 


Figure 6. Data Flow Back Substitution: Unoptimized 

‘In this example, as in the HEP case, there are optimizations 
which the programmer can use to increase performance. In the 
computation of sum, prod is created. iteratively in spite of the for 
each construct. The old value of prod is used at each step to com- 
pute a new value. MaD (and most data flow languages) have a 
more parallel construct to accomplish the same function. We can 
create a stream of a(i,k)*x(k), k = i+1.. n. All these multiplica- 
tions can occur in parallel. Another list processing operator ALL is 
used to create a stream from the products. When individual data 
items are defined separately, the all operator can be used to gather 
them into a stream. This operator cannot be used to define x be- 
cause a new element of x is defined in terms of previously computed 
elements would not be available incrementally. However, previous 
values of prod are not needed in this case. When the entire stream 
is produced, we can apply the reduction “+!” to the steam. This 
identical to the APL “/” operator: a “+” is inserted between ad- 
jacent stream elements, and the new expression is evaluated. If 
logarithmic reduction is implemented, the n-element steam can be 
added in log n time steps. The results in emulation from this opti- 
mization are summarized in Figure 7. Although the total number 
of time steps and the maximum matching store occupancy have 
decreased, there is an added cost in this optimization: the total 
number of instructions executed has increased, and more result to- 
kens have been generated. A maximum of 10 PE’s could be utilized 
in this case. 


In this case, as in the task-based dataflow, we have observed 
two categories of optimizations. The iteration replaced by the re- 
duction operator is an optimization well within the range of current 
optimization techniques. The timing problem and the generation 
of the stream are in the realm of expert programmer. 


S1 = 2604 Sinf = 265 
Average Parallelism of 9.8 
3210 Result Tokens Generated 


Max Matching Store Occupancy = 101 Tokens 


seep nape nA A SL ST LC 


Figure 7. Data Flow Back Substitution: Optimized 


4. Conclusions from the Experiment 


We find even in this simple problem that the programmer bears 
the burden of management of parallelism for both task level and 
instruction level parallel machines. On the HEP, we can observe 
that 

- The programmer bears complete responsibility for organizing 
the task system, including the decisions as to which groups of 
operations should go into which tasks, 
The programmer must synchronize among tasks correctly. Shared 


variables must be allocated in the scoreboarded memory, and 
a proper discipline of definition and access maintained so that. 


race conditions and deadlock are avoided. 


The programmer must arrange the order of evaluation within 
a task to maximize concurrency. Shared variables should be 
computed as early as possible so that waiting tasks can con- 
tinue. 


Actual allocation of functional units to tasks is handled au- 


tomatically. Thus at the Fortran level, the programmer need 


not statically assign processors to tasks. 


The first two tasks are not readily automated. However pro- 
gram analysis tools such as graphical display of data dependencies 
could help the programmer detect errors and/or possible optimiza- 
tions. The third can be accomplished to a certain extent by me- 
chanical optimization, and the fourth is already handled by the 
operating system and hardware. 


With instruction level dataflow, we have seen that the pro- 
grammer is not really freed from these concerns in the data flow 
environment. Data flow languages (of which MaD is a fair repre- 
sentative) do limit the expressive power of the programmer. We 
note that 


- Rather than having access to the more problem-oriented data 
structures of vectors and matrices, we are forced to use lists 
and list processing constructs to build the vectors. 


As illustrated in the first version of back substitution in MaD, 
subtle timing problems can still crop up. We are required 
to have a detailed understanding of timing considerations as 
in hardware design (“old” as opposed to “new” values in an 
iteration) to come up with a correct program. Although we 
do not have to explicitly code synchronization, we still have to 
be aware of what is produced when. 


The programmer also bears responsibility for using perhaps 
less familiar forms of optimize the program. In this example, 
we had to know about the combination of FOR EACH, ALL, 
and reduction to parallelize computation of sum. If we code 


the algorithm in a sequential way, it is not going to be magi- . 


cally parallelized by virture of running on a data flow machine. 


852 


Of these points, the first could be remedied by using a lan- 
guage which allows partial assignment assignment to an array in’ 


conjunction with runtime hardware to detect duplicate definition 


(perhaps with I-structures [5]). The second is once again in the 
realm of expert programmer. The third could be done by a me- 
chanical optimizer. 


We see that the principle we have chosen to investigate - 
dataflow - although manifested differently at the macro and mi- 
cro levels, is nontheless invaluable as a means of structuring an 
algorithm. In order to design the task system for the HEP version 
of the program we first observed the inherent data dependencies 


of the problem (Figure 1), and then used those data dependen- 


cies to generate a structured synchronization mechanism among 
the tasks. We voluntarily observed the discipline that each vari- 
able x(i) had one producer and many consumers, and used the 
scoreboarded memory resource to implement this discipline. This 
dataflow usage of the shared variables made the program more eas- 
ily understandable and better organized. At the instruction level, 
we were also guided by the flow of data to operators to arrive at 
a correct and efficient algorithm. Thus, the notion of data flow, 
that data dependencies and only data dependencies should govern 
a computation sequencing is a valuable tool by which to structure 


-a concurrent algorithm. 
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1 Introduction 


This paper presents several parallel algorithms for 
graph problems, in particular for perfect graphs. 
Our main result is a deterministic MC algorithm 
for solving the two processor scheduling problem, 
answering an important open problem posed in [VV85]. 
We also present an NC algorithm for transitively 
orienting comparability graphs. By combining these two 
results, we obtain an NC algorithm for the matching 
problem on co-comparability graphs (the complements 
of comparability graphs) and nearly co-comparability 
graphs. In addition our transitive orientation algorithm 
gives us NC algorithms for several additional problems, 
such as identifying permutation graphs and finding 
the maximum weighted clique and optimal colorings in 
comparability graphs. Comparability, co-comparability, 
and permutation graphs are all subclasses of perfect 
graphs. 

Scheduling problems have a long history and ex- 
tensive literature. The most fundamental scheduling 
problems involve unit time execution tasks with prece- 
dence constraints restricting the order of execution. 
When the number of processors varies, the problem is 
NP-complete [Ull75]. There are no published poly- 
nomial time algorithms for a fixed number of proces- 
sors greater than two. The first polynomial time algo- 
rithm for the two processor case was published in 1969 
[FKN69]. Faster algorithms were given by Coffman and 
Graham [CG72] in 1972, and a decade later, Gabow 
[Gab82,GT83] found an asymptotically optimal algo- 
rithm. Recently, Vazirani and Vazirani have published 
a randomized parallel solution [VV85]. Like Fujii et. al. 
they use the connection between matching and two pro- 
cessor scheduling, so their algorithm relies on an RNC 
matching subroutine such as [KUW85b] or [MVV]. 

In contrast, our scheduling algorithm [HM86b] is de- 
terministic and does not require the aid of a matching 
subroutine. Therefore we are able to exploit the rela- 
tionship between matching and two processor schedul- 
ing in the other direction, obtaining a deterministic par- 
allel matching algorithm for co-comparability graphs. 

This work was supported by in part by a grant from the 


AT&T Foundation, ONR contract N00014-85-C-0731, and NSF 
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The only ingredient required to convert our schedul- 
ing algorithm into a matching result is an NC transi- 
tive orientation subroutine. This routine takes an undi- 
rected graph and directs the edges so that the result- 
ing digraph is transitively closed. Those graphs which 
can be transitively oriented are called comparability 
graphs. The complements of comparability graphs are 
co-comparability graphs. Kozen, Vazirani and Vazirani, 
in independent work, coupled a transitive orientation 
routine with our two processor scheduling algorithm to 
achieve an VC matching algorithm on co-comparability 
graphs [KVV85]. Our transitive orientation subroutine 
is also the key element in our algorithms for testing 
for permutation graphs and finding maximum weighted 
cliques and optimal colorings on comparability graphs. 


2 Main Theorems and Applications 


In this section we give some definitions, state our main 
results, and prove several important consequences. We 
use [a, b] to denoted a directed arc and (a,b) to denote 
an undirected edge between vertices a and 6. 

A perfect graph is an undirected graph where the 
chromatic number and maximum clique size of every 
induced subgraph coincide. A precedence graph is an 
acyclic, transitively closed digraph. Thus if arcs [a, 5] 
and [b,c] are in a precedence graph, then so is the arc 
[a,c]. A comparability graph is an undirected graph with 
the property that every edge can be assigned a direction 
such that the resulting graph is a precedence graph. The 
complement of a comparability graph is a co-compara- 
bility graph. Precedence graphs are equivalent to partial 
orders. Some graphs, such as a simple three-cycle, are 
both comparability and co-comparability graphs. 

The undirected graph G = (V, £) is a permutation 
graph if there exists a pair of permutations on the 
vertices such the edge (v,v’) € E if and only if v 
preceeds v’ (or v’ preceeds v) in both permutations. 
Permutation graphs are equivalent to the comparability 
graphs of partial orders with dimension two. A graph 
is both a comparability graph and a co-compara- 
bility graph if and only if it is a permutation graph 
[PLE71]. Permutation graphs, comparability graphs 
and co-comparability graphs are all subclasses of perfect 


graphs [Gol80]. 


An instance of the wos processor ‘ scheduling problem 
is given by a precedence graph G = (V,E). Each 
vertex represents a task whose execution requires unit 
time on either of two identical processors. If there is a 
directed edge from task t to task t’, then task ¢ must be 
completed before task t’ can be started. A schedule is a 
mapping from tasks to timesteps such that at most two 
tasks are mapped to each timestep and for all tasks t 
and t’ if t must preceed t’ (¢ ~ t’) then t is mapped to 
an earlier timestep than t'. The length of a schedule is 
the number of timesteps used. An optimal schedule is 
one of shortest length. 

The maximum matching problem on co-compara- 
bility graphs and the two processor scheduling problem 
are closely related. If G is a co-comparability graph 
and G is a transitive orientation of G’s complement, 
then the pairs of tasks mapped to the same timestep 
in an optimal two processor schedule of G correspond 
to a maximum matching in G. Furthermore, there 
is a sequential algorithm for converting any maximum 
matching for G into an optimal two processor schedule 
for G [FKN69]. In [VV85] it was conjectured that 
this process is inherently sequential, but with our two 
processor scheduling algorithm it can be solved quickly 
in parallel. 


Theorem 1: Two processor scheduling is in NC. 


Proof: We outline an O(log*n) time algorithm in 
section 3. Further details can be found in [HM86b]. 
U 


Theorem 2: There is an NC algorithm which detects 
if an undirected graph is transitively orientable, and if 
so finds a transitive orientation. 


Proof: We present such an algorithm in section 4. See 


also [KVV85]. QJ 


Corollary 2.1: There is an NC algorithm which de- 


tects whether or not a graph is a permutation graph. 


Proof: Graph G is a permutation graph if and only 
if both G and G are comparability graphs [PLE7]1]. 
Therefore, by running our transitive orientation algo- 
rithm on both G and G, we can determine if G is a 
permutation graph. [] 


Corollary 2.2: There is an NC algorithm which finds 
a maximum node-weighted clique in comparability 
graphs. 


Proof: Given a comparability graph G, we find a 
transitive orientation, G. Examine any k-path in Cc 
Because G is transitively closed, the nodes on the k- 
path form a k- clique | in G. Similarly, every k clique 
in G is a k-path in G. Thus the problem of finding a 
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maximum node-weighted clique in G reduces to finding 
a maximum weight path in G. Since G is a DAG, 
standard parallel techniques (i.e. transitive closure) can 
be used to find a heaviest path in G. [J 


Corollary 2.3: There is an NC algorithm which finds 
a minimal node-coloring of comparability graphs. 


Proof: Given a comparability graph G, we find a 
transitive orientation, G. We say that a vertex v is 
on level i in G if the longest (directed) path from v to 
a sink contains exactly i vertices. Clearly any pair of 
nodes on the same level are not adjacent in G, so they 
can be assigned the same color. Every node on level 
t > 1 is a predecessor of at least one node on level 2 —1. 
Therefore, if G has k levels then G has a path of length 
k and G has a k-clique. Since no coloring can use less 
colors than the size of the largest clique, this yields an 
optimal coloring. [J 


Theorem 3: There is an NC algorithm for finding 
maximum matchings on co-comparability graphs. 


Proof: One such algorithm is given in section 5. [] 


This theorem is extended to nearly co-comparability 
graphs in section 5. 


Corollary 3.1: Maximum matchings for permutation 
graphs and partial orders of dimension 2 can be found 


in NC. 


Proof: As stated above, these graphs are co-compa- 
rability graphs. [] 


Corollary 3.2: There is an NC algorithm which finds 
maximum matchings on interval graphs. 


Proof: Interval graphs are a subclass of co-compara- 


bility graphs [GH64]. [] 


3 Two Processor Scheduling 


The scheduling algorithm is built around a routine that, 
for any precedence graph, computes the length of the 
graph’s optimal schedule(s). This length routine is 
applied repeatedly in order to find an actual optimal 
schedule for the input graph. 

Let G = (V,-~) be the precedence graph we are 
interested in. If t ~ t’ then t is a predecessor of t’ and t’ 
is a successor of t. For any pair of tasks, t,t’ € V, define 
Vi to be the set of tasks which are both successors of t 
and predecessors of ¢t’ and Gi, to be the subgraph of G 
induced by V;;. The schedule distance between tasks ¢ 
and t’, SD(t, t’), is the length of an optimal schedule for 

t Ift At! then SD(t,t’) =0. 


level 


i jump 
7 3 
6 if 
5 0 
4 3 
3 2 
2 1 
1 3 


tbot 


Figure 1: This is a precedence graph containing fifteen 
tasks (transitive arcs have been omitted). The special 
tasks tz.) and ts: are added when computing the length 
of optimal schedules for G. The levels of the original 
graph are on the left and the jump sequence is on the 
right. 


Lemma 3.1: Let S be a set of tasks such that for all 
te S: 

i. t~t <?'; 

ii. SD(t,t) > k; and 

ili. SD(#,t’) > 1. 

Then SD(¢,t’) > k+1+ |S|/2. 


Proof: Count the number of timesteps required to 
schedule those tasks between t and t’. There must be at 
least k timesteps before the first task in S' is scheduled. 
It takes at least |S|/2 timesteps to complete the tasks 
in S. After the last task in S has been completed, 
at least | additional timesteps are required. Therefore 


SD(t,t’) >k+14+|S{/2.0 


The distance algorithm (see figure 2) uses a method 
like transitive closure to compute the schedule distance 
between all pairs of tasks in a precedence graph G = 
(V,A). It initially guesses that the scheduling distance 
between each pair of tasks is zero. By repeatedly 
applying lemma 3.1 to each pair of tasks in parallel the 
algorithm refines its guesses. Below we prove that after 
log |V| iterations, the algorithm’s guess for each pair of 
tasks has converged to schedule distance. The distance 
algorithm has a straightforward implementation on an 
n° processor P-RAM taking O(log? n) time. 
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do(x, x) := 0; 
for 7 := 1 to [logn] do 
for all t,t’ with t ~< t’ do in parallel 
for all 0 < k,l <n—1 do in parallel 
St,t',k,l = {s :t<~s~<t, 
d;_i(t, s) 2 k, 
d;_1(s, t’) = I}; 
d;(t, t’) = S01 4 #0 {d;-1(t, t’), 
b+ 1+ [ISeenal/21}5 
SD(x,*) := dnogny(*, *) 


Figure 2: The distance algorithm. 


P, 
P, 


time 1 2 3 4 5 6 7 8 9 


10 


Figure 3: This is a lexicographically maximal jump 
schedule for the graph in figure 1. Each of the sets 
xX; 1s boxed. 


Lemma 3.2: The distance algorithm always computes 
the schedule distance between every pair of tasks. 


Proof: Lemma 3.1 guarantees that the distances 
computed by the algorithm are never greater than the 
the schedule distances. 

In [CG72] it is shown how to construct sets of tasks 
Xo, X1, ++) Xk for any precedence graph such that: 


e Those tasks in any x; are predecessors of all tasks 
in x;-1, and 

e The length of optimal schedules for G is 37; [|xi|/2| 
(See figure 3). 


Our algorithm doesn’t compute these ys, we simply use 
their existence to prove that the distances the algorithm 
does compute converge to the scheduling distance. 
Examine how the distance algorithm computes the 
schedule distance between an arbitrary pair of tasks, ¢ 
and t’. Let x1, X2)---) Xz be a set of ys for Gi, xo = {t}, 
and xz41 = {t'}. After the first iteration of the outer 
loop, the distance computed between any task in Xi 
and one in xi42 is at least [|xi41|/2]. After the second 
iteration, the distance computed between any task in x; 
and any task in xi+4 is at least [yi41/2] + [Xi+2 /2| + 
[xit3/2] (S = x2, k = [xii /2]1, | = [xi41/2]). Each 


iteration we double the number of ys accounted for. 


After log k iterations, the computed distance between 
t and ?’ is at least the optimal schedule length for Gi,, 
and thus at least SD(t,t’). 

Since G contains n tasks, each Gi, has at most n — 2 
ys. Therefore, after [logn] iterations the algorithm 
computes the schedule distance for each pair of tasks. 


O 


The distance algorithm can be used to compute the 
length of optimal schedules for a graph. Augment the 
graph with two dummy tasks, ¢,., and thot, which are 
a predecessor and successor (respectively) of all other 
tasks in G. Now SD(ttop,tbot) is the length of G’s 
optimal schedules, and can be found using the distance 
algorithm. 

The method for converting the distance algorithm 
into one which finds an optimal schedule involves several 
constructions. For the sake of brevity this paper 
contains only an outline of our method. Interested 
readers should consult [HM86b] for a more detailed 
presentation. 

The search for an optimal schedule can be restricted 
to the class of Lezicographically Mazimal Jump (LMJ) 
schedules. Each task ¢ in the precedence graph is 
assigned a level equal to the number of tasks in the 
longest path from ¢ to a sink. A level schedule gives 
preference to tasks on higher levels. More precisely, 
suppose levels L,...,1 +1 have already been scheduled 
and there are k unscheduled tasks remaining on level I. 
If k is even level schedules pair the k tasks with each 
other and there is no jump from level J. If k is odd, a 
level schedule pairs k—1 of the tasks with each other and 
the remaining task ¢ is paired with a task from a lower 
level I’ < I. In this case, level 1 jumps to level I/. We 
assume that there are an unlimited number of dummy 
tasks on level 0 which can be paired with any other 
tasks. The jump sequence of a level schedule is the levels 
jumped to, listed in decreasing order by level jumped 
from (see figure 1). The Lezicographically Maximum 
Jump (IMJ) sequence is the jump sequence (resulting 
from some level schedule) that is lexicographically 
greater than any other jump sequence resulting from 
a level schedule. An LMJ schedule is a level schedule 
whose jump sequence is the LMJ sequence. Note that 
our definition of LMJ is similar to the definitions of 
highest level first in [Gab82] and [VV85]. The following 
theorem establishes the importance of LMJ schedules. 


Theorem 4: [Gab82] Every LMJ schedule is optimal. 
U 


Our two processor algorithm uses the distance 
algorithm to find the LMJ sequence and which jump 
Gif any) a pair of tasks can be used for. In general, 
there will be many possible pairs for each jump. A path 
doubling computation finds a consistent set of task pairs 
for the jumps. The remaining tasks are paired up within 
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levels. Since there are never precedence constraints 
between two tasks on the same level, this pairing can 
be done arbitrarily. An LMJ schedule is obtained by 
sorting the resulting set of task pairs (both for jumps 
and within levels). 


4 Transitive Orientation 


The transitive orientation problem is nontrivial because 
some edges cannot be oriented independently. If the 
edges (a,b) and (6,c) are in the graph to be oriented, 
but the edge (a,c) is not, then the edges (a,b), (b,c) 
cannot be oriented independently. If we choose the arc 
[a,b] then we are forced to include the arc [b,c] in the 
transitive orientation (see figure 4). The binary relation 
I reflects this simple kind of forcing. [PLE71]. Given 
G = (V,E), we say that [a, d|I'[a’,b'] if (a,b) € E; 
(a’, 0!) € E and either a = a’ and (0,0') gd FE orb=0' 
and (a,a’) ¢ E. Thus [a, d|T[a, b] and if [a, b|T'[c, b] then 
[b, aJ'[8, c], however [a, b] I'd, al. 

The reflexive, transitive closure of I‘, I*, is an 
equivalence relation on the possible orientations of edges 
in &. For obvious reasons, we call these equivalence 


classes implication classes. If A is a set of arcs (e.g. an 
implication class) then A* denotes the set of undirected 
edges {(a,b) : [a,b] € AV [b,a] € A}, and A“? is the set 
of arcs {[b,a] : [a,b] € A}. A set of arcs A is consistent 
if AN A-! = @, and is inconsistent when AN A # O. 

Implication classes and I'-decompositions have been 
studied by M.C. Golumbic. Many of the lemmas in this 
section also appear in [Gol77] or [Gol80]. 


Lemma 4.1: If A # B are implication classes of G 
then either A = B™ or A*N B* = 9. 


Proof: Assume that (a,b) € A*N B*. Without loss of 
generality, let [a,b] € A. If [a,b] € B then B = A since 
implication classes are equivalence classes. Therefore 
[b,a] € B, and [b,a] ¢ A. By definition, if [a, bjT'[a’, b'] 
then [6, a]I'[b’, a’]. Thus some [c, d|I™“[a, 6] if and only if 
[d,clI*[b,a],soA=B"*.Q 


b h 9 
(a, Ae, 8 ‘ 
a Oe acee Cc 


5. h 9 
(b,a]I'[b,c] (4, elI[f, ell Uf, g]P...P le, dl... lh, ] 


Figure 4: Graphs and Implication Classes 


Given an undirected graph G; = (V,E) pick any 
implication class B, and delete it, forming G2 = (V, E— 
By). Next form G3 by removing some implication class 
B, from G2. Continue the process until removing B, 
from G;, results in a graph with no edges. The sequence 
of implication classes removed, B,, B2,...,.By, is called 
a I’-decomposition of G. The following theorem points 
out the usefulness of I-decompositions. 


Theorem 5: (TRO Theorem [Gol80]) Let B,,B2, ..., By 
be a T'-decomposition of an undirected graph G. The 
following statements are equivalent: 


i. G is a comparability graph. 
li. Every implication class of G is consistent. 
lil, Each B; in the T-decomposition is consistent. 


Furthermore, when these conditions hold the union of 
the B;s is a transitive orientation of G. 


Proof: The proof of this theorem requires several 
technical lemmas, and thus is beyond the scope of 
this paper. The interested reader is referred to 


[Gol77,Gol80]. Q 


The TRO theorem suggests a sequential algorithm 
for transitively orienting comparability graphs. One 
can take any edge, orient it arbitrarily, find the 
associated implication class, add the implication class 
to the transitive orientation and remove it from the 
comparability graph. Repeating this procedure yields 
a [-decomposition of the comparability graph and 
therefore a transitive orientation. This is essentially the 
algorithm in [PLE71]. 

In order to parallelize this algorithm it is neccessary 
to understand how implication classes change during a 
I-decomposition. We will see below that the changes 
are very simple: implication classes are either merged 
with other implication classes or remain unchanged. 


Lemma 4.2: Let A and B (A* # B*) be implication 
classes of G = (V,E). Then all of the arcs in A are in 
the same implication class of G' = (V, E — B*). 


Corollary 4.2.1: Let B be an implication class of G = 
(V, E). Every implication class of G’ = (V, E — B*) is 


the union of implication classes of G. 


Proof: ATI relationship between two arcs is lost 
only if one of the arcs is deleted or a triangular 
edge added. Removing B* certainly doesn’t add any 
triangular edges, and by Lemma 4.1 A*/N B* = 9, so all 
edges of A* remain in G’. [] 


Three edges, (a,b),(a,c),(b,c) in an undirected 
graph G form a tricolored triangle if G has three 
implication classes, A, B, and C such that A* #£ B* # 
C* # A* and (a,b) € A*,(b,c) € B*, and (a,c) € C. 
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We say that A and B are triangle related, written AAB, 
if either A = B™ or there is a tricolored triangle in G 
with one edge in A* and another edge in B*. 


Lemma 4.3: Let G = (V,E) be an undirected graph 
with implication classes A and B, A* #4 B*. A is not an 
implication class of G! = (V, E — B*) <> AAB. 


Proof: (<) Since AAB, G contains a tricolored 
triangle (b,c) € A*,(a,c) € B*,(a,b) € C*. Without 
loss of generality, let [b,c] € A and C # A be the 
equivalence class containing [b,a]. The edge (a,c) is 
not in EF — B*, so [b, c|'[b, a] in G’. Since A is not closed 
under the I‘ relation, it can not be an implication class 
of G’. 

(=) By Lemma 4.2 A is a proper subset of some 
implication class for G’. Therefore in G’ and arc [b, c] € 
A is [ related to an arc [b,a] ¢ A (the case where 
[b, cJI'[a, c] is analogous). Let C' be the implication class 
of G containing [b,a] (A # C # B). Since [b, c]T[}, a] 
in G’ but not in G, the triangular edge (b,c) must be 
in B*. Thus (b,c) € A*,(a,b) € C*,(a,c) € B* forma 
tricolored triangle in G and AAB. {J 


Lemma 4.4: Let the implication classes B,,...,B, of 
G = (V, E) be an independent set under the A relation. 
Then in G’ = (V, E—B}), {Bo,..., By} is an independent 


set (under A) of implication classes. 


Corollary 4.4.1: If implication classes B,,...,B, of G 
form an independent set under the A relation, then 
B,,...,B, are the first k implication classes in a T 
decomposition of G. 


Proof: By Lemma 4.3, Bo,...,B, are all implication 
classes of G’. Assume to the contrary that G’ contains 
a tricolored triangle (a,b) € By, (a,c) € B¥,(b,c) € A*. 
The arc [b,c] belongs to some implication class C’ of 
G. By Lemma 4.2, Bt #4 C* F B;. Therefore, 
(a, b), (a,c), (b,c) is a tricolored triangle in G and B;AB, 
— contradiction. [] 


Lemma 4.5: Let B,, Bo,...,B, be a maximal indepen- 
dent set under the A relation for some graph G, = 
(V, EZ). Every implication class of G4; = (V, E — B* — 
By — ... — Bf) is the union of at least two implication 
classes of G. 


Corollary 4.5.1: The number of implication classes 
for G is at least twice the number of implication classes 
for Gr4i- 


Proof: Every other implication class of G is A 
related to some B;. Therefore, by Lemma 4.3 every 
implication class of G is merged with at least one other 
implication class of G during the formation of Gy41. 


Since implication classes are never split (Lemma 4.2), 
every implication class of G,4; must be the union of at 
least two implication classes of G. [J 


The input to our algorithm is an undirected graph 
G, = (V,E). Our algorithm’s output is either G, a 
transitive orientation of Gj, or an indication that G4 
has no transitive orientation. The algorithm proceeds 
in several iterations. Graph Gy, is initialized to (V, @). 


During iteration 7 the algorithm determines a maximal 


independent set of implication classes for G;. These 
implication classes are deleted from G; forming G;41, 
and added to G: forming Gisa From Lemma 4.5 the 
number of implication classes in G is halved each 
iteration. Therefore, after log n iterations G will contain 
no edges and G will be a transitive orientation of the 
original graph. 
Each iteration consists of the following four steps: 
1. Determine the implication classes of G;. This can 
be done using standard parallel techniques such 
as solving two-SAT formulae or finding connected 
components [SV82]. 


2. Determine the A relation on implication classes. 


3. Use a maximal independent set subroutine, such 
as [KW84,Lub85], to obtain a maximal indepen- 
dent set, M of implication classes. 


4. In parallel, for each of the Ajs in M, delete A} 
from G;, and add A; to G,. 


Step 3 is the most expensive of these steps, requiring 
O(log? n) time and n‘ processors. The logn iterations 
can therefore be done in O(log® n) time on n* processors. 


5 Maximum Matching 


The two processor scheduling and transitive orientation 
algorithms can be used to find maximum matchings on 
co-comparability graphs. To find a maximum matching 
on the co-comparability graph G = (V, E), first create 
the comparability graph G = (V, {(a,6) : (a,b) ¢ E}). 
Applying the transitive orientation routine converts G 
into a precedence graph. An optimal two processor 
scheduled can be found for the precedence graph using 
our scheduling algorithm. We will see below that the 
pairs of tasks scheduled together form a maximum 
matching of G. 

__ Let S be any optimal two processor schedule for 
G. A task-pair of S is a pair of tasks mapped to the 
same timestep by S. Since there are no precedence 
relationships between tasks in a task-pair and each task 
is mapped to a single timestep, the set of task-pairs 
of S form a matching in G. Because S is an optimal 
schedule, no schedule has more task-pairs. 


A task is available at some point in a schedule if 
it can be executed without violating the precedence 
constraints. 
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Lemma 5.1: If. co-comparability graph G has a perfect 
matching then G has a schedule where every task is in 
a task-pair. 


Proof: We say a pair of tasks is mated if the pair is. 
in the perfect matching. Construct the schedule (and. 
modify the ”mated” relationship) iteratively as follows: 


If two mated tasks are both available, 
schedule one such mated pair. Otherwise 
find two mated pairs, (t,t’) and (s,s), such 
that ¢ and s are available and there is no 
precedence relationship between t’ and s’. 
Schedule t with s and mate ?t’ with s’. 


Note that there are never precedence constraints be- 
tween a pair of mated tasks. This method clearly takes 
two tasks each timestep and does not violate the prece- 
dence constraints. What we must show is that it always 
constructs a schedule for G. 

Assume to the contrary that at some point it does 
not find a pair of tasks to schedule. Let U be the set 
of available tasks and U’ be the set of tasks which are 
mated to tasks in U. Since the method fails, UNU' = @ 
and there is a precedence relationship between every 
pair of tasks in U’ (ie. U’ is totally ordered). Let 
t’ be the task in U’ which preceeds all other tasks in 
U'. Since t’ ¢ U, there must be some t¢ € U such that 
t < t'. However, by the transitivity of precedence, t also 
preceeds its mate — contradiction. [] 


Lemma 5.2: Let G = (V,~<) be a precedence graph 


=/ 
and S a two processor schedule for G = (V — {t}, ~). 
A single timestep containing t can be inserted into S$ 


yielding a schedule for G. 


Proof: Let t’ be the last predecessor of t in S. Insert 
task t immediately after the timestep containing 1’. 
Obviously there are no precedence conflicts between t 
and its predecessors. Since S is valid schedule, there 
are no precedence conflicts between tasks in V — {t}. 
Therefore any precedence conflict which is violated is 
of the form t ~ ¢. By transitivity t’ also preceeds 7, 
so ¢t comes strictly after t/ in S. Since ¢ is inserted 
immediately after ¢’, task t appears before ¢ in the 
modified schedule. [] 


Let M be the tasks in a maximum matching on 
G. The above Lemmas suggest a way to obtain a 
schedule, S, for G = (V, <) where the paired tasks of 
S are precisely the tasks in M. Start by finding an 
optimal schedule, S’ for the subgraph of G induced by 
M and add the tasks in V — M one at a time. One 
NC implementation of this algorithm involves bucket 
sorting the tasks in V — M based on which task-pair of 
S' they follow. By topologically sorting the tasks within 
each bucket we can quickly determine where each task 
should be inserted. 


Theorem 6: The task-pairs of any optimal schedule 
for G form a maximum matching on G. 


Proof: Let M be the tasks in some maximum matching 
of G. Let S be an optimal schedule for the subgraph 
of G induced by M. By Lemma 5.1, the task-pairs 
of S form a maximum matching on G. By Lemma 
5.2 we can insert the other tasks of G one at a time 
without disturbing the task-pairs. Therefore, the task- 
pairs of the resulting schedule for G form a maximum 
matching on G. Since every optimal schedule has the 
same number of task-pairs and the task-pairs of every 
schedule form a matching, the task-pairs of any optimal 
schedule for G forms a maximum matching on G. [J 


If G is not transitively orientable it may still be 
possible to find a maximum matching in G = (V, E). 
Assume we are given a set U, consisting of O(logn) 
edges, such that G UU is transitively orientable. The 
following method finds a maximum matching in G. 

For each S’ € 2° such that S’ is a matching find 
(in parallel) a maximum matching in G’ = (V — {v : 
(v,v’) € S’}, E—S. A maximum matching for G occurs 
whenever the cardinality of the maximum matching for 
G’ plus |.S’| is maximal. 

A graph G is a k-nearly comparability graph when: 
— G has at most klogn inconsistent implication 

classes and 


— each inconsistent implication class of G is split into 
consistent implication classes by the addition of 
at most k edges. 


A k-nearly co-comparability graph is the compliment of 
a k-nearly comparability graph. 

Let G be a k-nearly co-comparability graph (for 
some constant k). The following is an outline of an 
NC algorithm for finding a maximum matching in G. 
In parallel examine each set, T, of k edges not in G. 
Determine which inconsistent implication classes are 
split when T is added to G. For each inconsistent 
implication class A, pick any set of k edges which splits 
A into consistent implication classes. At most k*logn 
edges are picked, so the above method can be used to 
find a maximum matching for G. 


‘6 Conclusions 


Although the algebraic approach was used to obtain 
the first parallel matching algorithms [KUW85b,MVV\, 
these are randomized algorithms. It is interesting that 
we can obtain deterministic matching algorithms for 
wide classes of graphs using a purely combinatorial 
approach. Perhaps the combinatorial approach will 
yield deterministic algorithms for matching on other 
classes of graphs as well. 
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It was surprising how much more difficult comput- 
ing the actual schedule was than simply computing its 
length. In higher complexity classes such as P and NP 
it is often easy to go from the decision problem to com- 
puting an actual solution, because of self reducibility. 
However this does not necessarily seem to be the case 
for parallel complexity classes. To support this observa- 
tion we note that the random NC algorithm for finding 
the cardinality of a maximum matching is much sim- 
pler than the random NC algorithm for determining an 
actual maximum matching [KUW85a]. 

There are several open problems related to schedul- 
ing. We are attempting to extend our two processor 
result to the case when the tasks have nonuniform start 
times and/or deadlines. 
straints are restricted to in-trees or out-trees there are 
parallel algorithms for generating schedules on an ar- 
bitrary number of processor [DUW84,HM86a]. It is an 
open problem whether interval-ordered tasks [PY79] can 
be scheduled in parallel. 

One variant of the two processor problem that 
we know to be NP-complete (by reduction from the 
clique problem) allows incompatibility edges as well 
well as precedence constraints. When there is an 
incompatibility constraint between two tasks they can 
be executed in either order, but not concurrently. 


When the precedence con- 


Incompatibility constraints arise naturally when two 
or more tasks need the same resource, such as special 
purpose hardware or a database file. 


References 


[CG72| E.G. Jr. Coffman and R.L. Graham. Op- 
timal scheduling for two processor systems. 
Acta Informatica, 1:200—213, 1972. 


D. Dolev, E. Upfal, and M. Warmuth. 
Scheduling trees in parallel. In Proc. Inter- 
national Workshop on Parallel Computing 
and VLSI, pages 1-30, Amalfi, 1984. 


M. Fujii, T. Kasami, and K. Ninamiya. Op- 
timal sequencing of two equivalent proces- 
sors. SIAM J. Appl. Math., 17(4):784—789, 
1969. 

H.N. Gabow. An almost-linear algo- 
rithm for two-processor scheduling. JACM, 
29(3):766-780, 1982. 

P.C. Gilmore and A.J. Hoffman. A char- 
acterization of comparability graphs and of 
interval graphs. Canad. J. Math, 16, 1964. 
M.C. Golumbic. Comparability graphs and 
a new matroid. J. Combinatorial Theory 


(B), 22(1):68-90, 1977. 


[DUWs84] 


[FKN69] 


[Gab82] 


[GH64] 


[Gol77] 


[Gol80] 


[GT83] 


[HM86al] 


[HM86b] 


[KUW85a] 


[KUW85b] 


[KVV85} 


M.C. Golumbic. Algorithmic Graph Theory 


and Perfect Graphs. Academic Press, New 
York, 1980. | 


H.N. Gabow and R.E. Tarjan. A linear 
time algorithm for special case of disjoint 
set union. In Proc. 15th STOC, 1983. 


David Helmbold and Ernst Mayr. Fast 
scheduling algorithms on parallel comput- 
ers. Advances in Computing Research, 1986. 
to appear. 

David Helmbold and Ernst Mayr. Two pro- 
cessor scheduling is in NC. In Proceed- 
ings of the 1986 Agean Workshop on Com- 
puting: VLSI Algorithms and Architectures, 
July 1986. 

R.M. Karp, E. Upfal, and A. Wigderson. 
Are search and decision problems computa- 
tionally equivalent? In Proc. 17th STOC, 
1985. 

R.M. Karp, E. Upfal, and A. Wigderson. 
Constructing a perfect matching is in ran- 
dom NC. In Proc. 17th STOC, 1985. 

D. Kozen, U.V. Vazirani, and V.V. Vazi- 
rani. Ne algorithms for comparability 
graphs, interval graphs, and testing for 
unique perfect matching. In Fifth Confer- 
ence on Foundations of Software Technol- 


ogy and Theoretical Computer Science, New 
Dehli, 1985. 


860 


[KW84] 


[Lub85] 


[MVV] 


[PLE71] 


[PY79] 


ISV82} 


[Ul75] 


[VV85] 


R.M. Karp and A. Wigderson. A fast paral- 
lel algorithm for the maximal independent 
set problem. In Proc. 16th STOC, 1984. 

M. Luby. A simple parallel algorithm for 
the maximal independent set problem. In 


Proc. 17th STOC, 1985. 

K. Mulmuley, U.V. Vazirani, and V.V. 
Vazirani. Parallel algorithms for rank and 
matching. private communication, Nov. 22, 
1985. 

A. Pnueli, A. Lempel, and S. Even. Tran- 
sitive orientation of graphs and identifica- 
tion of permutation graphs. Can. J. Math.., 
23(1):160-175, 1971. 


C.H. Papadimitriou and M. Yannakakis. 
Scheduling interval-ordered tasks. SIAM J. 
Computing, 8(3), 1979. 


Y. Shiloach and U. Vishkin. An O(log n) 
parallel connectivity algorithm. J. Algo- 
rithms, 3(1):57-63, 1982. 

J.D. Ullman. NP-complete schedul- 
ing problems. J. Comput. System Sci., 
10(3):384-393, 1975. 

U.V. Vazirani and V.V. Vazirani. The two- 


processor scheduling problem is in RNC. In 
Proc. 17th STOC, 1985. 


Communication-Efficient Parallel Graph Algorithms 


Charles E. Leiserson 
Bruce M. Maggs 


Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 


Abstract—Communication bandwidth is a resource ignored by 
most parallel random-access machine (PRAM) models. This pa- 
per shows that many graph problems can be solved in parallel, not 
only with polylogarithmic performance, but with efficient com- 
munication at each step of the computation. We measure the 
communication requirements of an algorithm in a model called 
the distributed random-access machine (DRAM), in which com- 
munication cost is measured in terms of the congestion of memory 
accesses across cuts of an underlying network. The algorithms are 
based on a communication-efficient variant of the tree contraction 
technique due to Miller and Reif. 


1. Introduction 


Underlying any realization of a parallel random-access machine 
(PRAM) is a communication network that conveys information 
between processors and memory banks. Yet in most PRAM 
models, communication issues are largely ignored. The basic as- 
sumption in these models is that in unit time each processor 
can simultaneously access one memory location. For truly large 
parallel computers, however, computer engineers will be hard 
pressed to implement networks with the communication band- 
width demanded by this assumption. The difficulty of building 
such networks threatens the validity of the PRAM as a predictor 
of algorithmic performance. This paper introduces a more re- 
stricted PRAM model, which we call a distributed random-access 
machine (DRAM), to reflect an assumption of limited communi- 
cation bandwidth in the underlying network. 

We measure the cost of communication in a network in terms 
of the number of messages that must cross a cut of the network, 
as in [9] and [12]. Specifically, a cut S = (A, A) of a network’ is 
a partition of the network into two sets of processors A and A. 
The capactty cap(S) is the number of wires connecting processors 
in A with processors in A, i.e., the bandwidth of communication 
between A and A. For a set M of messages we define the load 
of M on acut S = (A, A) to be the number of messages in M 
between a processor in A and a processor in A. The load factor 


of M on S is 

load(M, S) 
cap(S) ’ 
and the load factor of M on the entire network is 


A(M) = maxA(M, S). 


A(M, S) = 


The load factor provides a simple lower bound on the time re- 
quired to deliver a set of messages. For instance, if there are 10 


This research was supported in part by the Defense Advanced Research 
Projects Agency under Contract N00014—80—C-—0622. Charles Leiserson is 
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1We assume that the communication network is an interconnection network, 
meaning that the processors are interconnected as a graph, and routing of 
messages is performed by the processors. The generalization to a routing 
network, where routing can be done by switches that are not processors, is 
straightforward, but complicates the definitions. 
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messages to be sent across a cut of capacity 3, the time required 
to deliver all 10 messages is at least the load factor 10/3. 

Networks in which the load-factor lower bound can be met to 
within a polylogarithmic factor as an upper bound include volume 
and area-universal networks, such as fat-trees [9,12] and meshes 
of trees [13]. The load factor can be also met to within a poly- 
logarithmic factor as an upper bound by the standard universal 
routing networks, such as the Boolean hypercube, the butterfly 
(a.k.a. FFT, Omega), and the cube-connected cycles, but the 
lower bound is weak because every cut of these networks is large 
relative to the number of processors in the smaller side of the 
cut. Networks for which the load factor lower bound cannot in 
general be approached to within a polylogarithmic factor as an 
upper bound include linear arrays, meshes, and high-diameter 
networks in general. 

Whereas communication is essentially free in PRAM models, 
the cost of communication in a DRAM depends on the locality of 
memory accesses as measured by the load factor of an underly- 
ing network. The DRAM is an attempt to abstract the essential 
communication characteristics of volume and area-universal net- 
works without relying in detail on any particular network. Much 
as the PRAM can be viewed as an abstraction of a hypercube, in 
that algorithms for a PRAM can be implemented on a hypercube 
with only polylogarithmic performance degradation, the DRAM 
can be viewed as an abstraction of a volume or area-universal 
network. Fast, communication-efficient algorithms on a DRAM 
translate directly to fast, communication-efficient algorithms on, 
for example, a fat-tree. 

This paper shows that many graph problems for a graph 
G = (V, E) can be efficiently solved with O(|E|) processors in the 
DRAM model. The algorithms we give apply to all of the popular 
PRAM models because a PRAM can be viewed as a DRAM in 
which communication costs are ignored. In fact, the algorithms 
we give can all be performed on an exclusive-read, exclusive-write 
DRAM, and when run on a PRAM, they are nearly as efficient 
in the PRAM model as corresponding concurrent-read, exclusive- 
write PRAM algorithms in the literature. 

The remainder of this paper is organized as follows. Section 2 
contains a specification of the DRAM model and the implemen- 
tation of data structures in the model. The section demonstrates 
why the “recursive doubling” technique frequently used in paral- 
lel algorithms is inefficient in the DRAM model. It also defines 
the notion of a conservative algorithm as a concrete realization 
of a communication-efficient algorithm, and gives a “Shortcut 
Lemma” that forms the basis of the conservative algorithms in 
this paper. Section 3 presents a conservative “recursive pairing” 
technique that can be used to perform many of the same functions 
as recursive doubling. Section 4 presents a linear-space, conser- 
vative “tree contraction” algorithm based on the ideas of Miller 
and Reif [17]. Section 5 presents treefiz computations, which are 
generalizations of the parallel prefix computation [3,7,18] to trees. 
We show that treefix computations can be performed using the 
tree contraction algorithm of Section 4. Section 6 gives short, 


efficient parallel algorithms for tree and graph problems, most 
of which are based on treefix computations. Section 7 contains 
some concluding remarks. 


2. The DRAM model 


This section presents the DRAM model. We show how a data 
structure can be embedded in a DRAM and define the load factor 
of a data structure. We demonstrate that many existing PRAM 
algorithms are not communication efficient in the DRAM model 
by examining the “recursive doubling” technique [21] used exten- 
sively by algorithms in the literature. We introduce the notion 
of a conservative algorithm as one in which the load factor of 
each set of memory accesses can be bounded above by the load 
factor of the input data structure. Our conservative algorithms 
are based on a simple lemma that shows how pointers in a data 
structure can be “shortcut” without increasing the load factor. 

A DRAM consists of a set of n processors. All memory in the 
DRAM is local to the processors, with each processor holding 
a small number of O(lgn)-bit registers. A processor can read, 
write, and perform arithmetic and logical functions on values 
stored in its local memory. It can also read and write memory in 
other processors. (A processor can transfer information between 
two remote memory locations through the use of local tempo- 
raries.) Each set of memory accesses is performed in a memory 
access step, and any of the standard PRAM assumptions about 
simultaneous reads or writes can be made. Our algorithms use 
only mutually exclusive memory references, however, so these 
special cases never arise. 

The essential difference between a DRAM and a PRAM is that 
the DRAM models communication costs. We presume remote 
memory accesses are implemented by routing messages through 
an underlying network. Each cut S = (A, A) of the processors 
has an assigned capacity cap(S). For aset M of memory accesses, 
we define load(M, S) to be the number of accesses in M between 
a processor in A and a processor in A. The load factor of M on S 
is \(M,S) = load(M, S)/cap(S), and the load factor of M on the 
entire network is \(M) = maxs \(M,S). The basic assumption 
in the DRAM model is that the time required to perform a set M 
of memory accesses is \(M). 

Because the load factor of a set of memory accesses is the time 
required to perform them, the embedding of a data structure 
within a DRAM affects the time required to perform operations 
on the data structure. A natural way to embed a data structure 
in a DRAM is to put one record of the data structure into each 
processor. The record can contain pointers to records in other 
processors, as well as auxilliary local storage. As an example of 
how the embedding of a data structure influences communica- 
tion costs, consider an embedding of a list in which alternate list 
elements are placed on opposite sides of a narrow cut. If each el- 
ement fetches a value from the next element in the list, the load 
factor across the cut is large. Thus, a set of memory accesses 
that theoretically takes unit time in the PRAM model can re- 
quire considerably more time due to network congestion. On the 
other hand, there may be a better embedding for the list in which 
the number of list pointers crossing any cut is small compared to 
the capacity of the cut. 

We can measure the quality of an embedding by generalizing 
the concept of load factor to a set of pointers. The load of a 
set P of pointers across a cut S = (A, A), denoted load(P,S), 
is the number of pointers in P from a processor in A to a 
processor in A or vice versa, and the load factor is A(P) = 
maxg load(P, S)/cap(S). The load factor of a data structure is 
the load factor of the set of its pointers. 
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Figure 1: A cut of capacity 3 separating two halves of a linked list. The 
load of the list on the cut is 1. At the final step of recursive doubling, each 
element on the left side of the cut accesses an element on the right, which 
induces a load of 8 on the cut. 


For many problems, good embeddings of data structures can 
be found (see Section 7). Even if a good embedding of a data 
structure is found, however, there is no guarantee that stan- 
dard PRAM algorithms will behave efficiently on the data struc- 
ture. To illustrate this point, consider the “recursive doubling” 
or “pointer jumping” technique which is used extensively by algo- 
rithms in the literature. The idea is that each element ¢ of a list 
initially has a pointer p(t) to the next element in the list. At each 
step, element ¢ computes p(?) — p(p(t)), doubling the distance 
d(t) between ¢ and the element it points to (until it points to the 
end of the list). This technique can be used, among other things, 
to compute the distance of each element to the end of the list. For 
each element 2, d(t) is initially one. At each pointer jumping step, 
element 7 computes d(t) + d(t)+d(p(t)). In a PRAM model, the 
running time on a linked list of length n is O(Ign). Variants of 
this technique are used for path compression [17,19,21], vertex 
numbering [20,21], and parallel prefix computation [21]. 

In the DRAM model, however, recursive doubling can be ex- 
pensive even if a data structure has a good embedding. Figure 1 
shows a cut of capacity 3 separating the two halves of a linked 
list of 16 elements. In the first step of recursive doubling, the 
load on the cut is only 1 because the only access across the cut 
occurs when element 8 copies the pointer of element 9. In the 
second step the load is 2 because element 7 accesses element 9 
and element 8 accesses element 10. In the third step, the load 
is 4, and in the fourth step, the load is 8, as each of the first 
eight elements makes an access across the cut. Since the load 
factor of the cut in the fourth step is 8/3, this set of accesses 
will require a least 3 time units. Whereas the capacity of the 
cut was large enough to support the memory accesses across it 
in the first step, by the fourth step it was insufficient. The only 
way to guarantee that recursive doubling runs fast is to ensure 
that every cut of the network is sufficiently large to accommodate 
worst-case communication patterns. In the next section, we shall | 


“show how a recurstve pairing strategy can perform many of the 


same functions as recursive doubling in a communication-efficient 
fashion. 

All of our algorithms have the property that the load factor of 
memory accesses in any step is bounded by the load factor of the 
input data structure. We define a set M of memory accesses to be 
conservative with respect to another set M’ of memory accesses 
if \(M) < A(M"'), and we make the natural generalization of 
this definition to pointers and data structures. A conservative 
algorithm is one all of whose memory accesses are conservative 
with respect to the input data structure. 

An algorithm that communicates only across pointers in the 
input data structure is conservative, but may require time linear 
in the diameter of the data structure to pass information between 
two elements. The following simple, but important, lemma shows 
how to shortcut pointers in the input data structure without in- 
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Figure 2: The Shortcut Lemma. In each of the four cases illustrated, the 
load factor across the cut is either unchanged or diminished by replacing 
a—-bandb—c witha—c. 


creasing communication requirements. 


Lemma 1 (Shortcut Lemma) Let P be the set of pointers in 
a data structure containing potnters a — b and b — c. Then the 
set P’ of pointers defined by 


P’ = PU{a—c}— {a+ b,b — c} 
ts conservative with respect to P. 


Proof: We shall show that A(P’,S) < A(P,S) for any cut S 
of the underlying network, which implies that (P’) < A(P). 
Consider the eight ways in which a, b, and c can be assigned to 
sides of the partition induced by a cut S. Half the cases can be 
eliminated by symmetry if we assume that a is on the left side. 
In each of the four remaining cases, the load factor across the 
cut is either unchanged or diminished when a — } and b — ¢ are 
replaced with a — c, as is shown in Figure 2. fj 


We shall typically use a straightforward generalization of the 
Shortcut Lemma. Specifically, we can shortcut any set of pointer- 
disjoint paths in a data structure without increasing the load 
factor. 

Because the algorithms presented in this paper are based on 
the Shortcut Lemma, they are not only conservative, but they 
are also independent of the cut capacities of the DRAM and of 
the embedding of an input data structure in the DRAM. Thus, 
independent of the underlying network, the algorithms are cor- 
rect, and if the embedding of the input data structure is good, 
the algorithms run fast. Moreover, for a specific embedding on a 
specific DRAM, the running time can be analyzed precisely. 


3. List contraction 


In this section we show that a conservative “recursive pairing” 
algorithm, Algorithm LC can perform many of the same functions 
on lists as recursive doubling. The idea is to construct a O(Ig n)- 
height binary contraction tree whose leaves are the elements in 
the list. After building the contraction tree, operations such as 
broadcasting from the root or parallel prefix can be performed in 
a conservative fashion. 

Algorithm LC is a randomized algorithm, and with high prob- 
ability, the height of the contraction tree and the number of steps 
on a DRAM are both O(lgn). A deterministic variant based on 
deterministic coin tossing [5] runs in O(lgnlg* m) steps, where 
m is the number of processors in the DRAM, and produces trees 
of height O(lg7). 

Algorithm LC requires a constant amount of extra space for 
each element in the input list. Each processor contains two ele- 
ments, an element in the list, and a spare element that will act 
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as an internal node in the contraction tree. We call the two ele- 
ments in the same processor mates. Each element holds a pointer 
to an unused internal node, which for each list element initially 
points to its mate. The use of spare nodes allows the algorithm to 
distribute the space for the internal nodes of the contraction tree 
uniformly over the elements in the list. Spare internal nodes are 
used in [1] and [14] for similar reasons, but in a different context. 

Algorithm LC now proceeds as follows.? (For a more complete 
description, see [16]). In each iteration, each element in the list 
randomly picks either its left or right neighbor. If two elements 
pick each other, they merge, and the left element takes control. 
A new internal node of the contraction tree is made using the 
spare of the left element. The spare for the new node is the spare 
of the right element. The new node’s left child is the left element, 
and its right child is the right element. The new nodes and the 
unpaired nodes then form themselves into a “contracted” list in 
the terminology of Miller and Reif [17]. 

To describe the efficiency of randomized algorithms such as 
Algorithm LC, we shall use the term “with high probability,” 
by which we shall mean “with probability 1— O(1/n*) for any 
constant k > 0,” where n is the size of the input. 


Theorem 2 With high probability, Algorithm LC takes O(lgn) 
steps to construct a contraction tree for a circular list of n 
elements. fj 


Theorem 8 With high probabilsty, the hetght of the contraction 
tree is O(Ign). 


Proof: The height of the contraction tree is not greater than the 
number of iterations of Algorithm LC. §j 
We can now prove that Algorithm LC is conservative. 


Theorem 4 Algorithm LC 1s conservative. 


Proof: The key idea is that the order of the list elements and 
their spares is preserved by the operation of contraction. By 
convention, let the mate of an element in the input list lie in the 
order between that element and its right neighbor. Then in each 
iteration, an active element’s spare lies between the element and 
its right neighbor in the contracted list. Thus, both the pointers 
of the contracted list and the pointers between active elements 
and their spares correspond to disjoint paths in the input list. 
The memory accesses in a step of the algorithm correspond to 
a set of pointers between active elements and their left or right 
neighbors in the contracted list, or to a set of pointers between 
active elements and their spares. fj 

Once a contraction tree has been constructed, it can be used to 
broadcast a value to all of the elements of the list, to accumulate 
values stored in each element of the list, and more generally, 
for performing prefiz computations. Let D be a domain with 
a binary associative operation @ and an identity element ¢. A 
prefix computation [3,7,18] on a list with elements 21, 22,...,2n 
in D puts the value y; in element 7 for: = 1,2,...,n, where 
Y= 210720: 8%. 

A prefix computation on a list can be performed by a conser- 
vative, two-phase algorithm on the contraction tree. The leaves 
of the contraction tree from left to right are the elements in the 
list from x, to z,. The first phase proceeds bottom up on the 
tree. Each leaf passes its z value to its parent. As the algorithm 
proceeds, each internal node receives values from its left and right 
children, call them z; and z,. The node saves value z2;, and passes 
2, ® 2, to its parent. The second phase proceeds top down after 
the root receives values from its children. The root then passes 


2A similar algorithm works for circular lists. 


é to its left child and its 2; value to its right child. Each child 
receives a value from its parent, call it zp, and passes that value 
to its left child and z @ zp to its right child. Each leaf receives 
the correct y value. 

The algorithm performs the prefix computation in O(lgn) 
steps. At each step, the algorithm communicates across a set 
of pointers in the contraction tree, all of which are the same dis- 
tance from the leaves in the first phase and from the root in the 
second. That this computation is performed in a conservative 
fashion is a consequence of the following lemma. 


Theorem 5 Let CT be a contraction tree computed by Algo- 
rithm LC on an input list L, and suppose P 1s a set of pointers 
of CT in disjoint subtrees of CT. Then P 1s conservative with 
respect to L. 


Proof: An inorder traversal of CT alternately visits list elements 
(leaves) and their mates (internal nodes) in the same order that 
the list elements and mates appear in L. Thus, the pointers in P 
correspond to disjoint paths in ZL. By the Shortcut Lemma, any 
set of pointers that correspond to disjoint paths in the list Z are 
conservative with respect to L. §j 

Algorithm LC, which constructs a contraction tree in O(lgn) 
steps, is a randomized algorithm. By using the “deterministic 
coin tossing” technique of Cole and Vishkin [5] the algorithm can 
be performed nearly as well deterministically. Specifically, the 
randomized pairing step can be performed deterministically in 
O(lg* m) steps on a DRAM with m processors, where lg” m is 
the number of times the logarithm function must be successively 
applied to reduce m to a value no greater than 1. The overall 
running time for list contraction is thus O(lgnlg* m). 


4. Tree contraction 


This section presents a conservative tree contraction algorithm, 
Algorithm TC, based on the tree contraction ideas of Miller and 
Reif [17]. The algorithm uses a recursive pairing strategy to build 
a contraction tree for an input rooted binary tree in much the 
same manner as Algorithm LC does for a list. Algorithm TC is 
a randomised algorithm, and with high probability, the height 
of the contraction tree and the number of steps on a DRAM are 
both O(ign). A deterministic variant based on deterministic coin 
tossing [5] runs in O(lg nlg* m) steps, where m is the number of 
processors in the DRAM, and produces trees of height O(lgn). 

We now outline Algorithm TC which performs tree contrac- 
tion. (For a more complete description, see [16].) At each step, 
nodes in the input tree are paired. The pairing strategy has each 
node pick from among its neighbors according to how many chil- 
dren it has. A leaf picks its parent with probability 1. A node 
with exactly one child picks its child or its parent, each with 
probability 1/2. A node with two children picks each child with 
probability 1/2. The root, which has no parent, picks its children 
with equal probability. If two nodes pick each other, they merge 
and the parent takes over. A new internal node of the contraction 
tree is made from the spare of the parent in the pair. The spare 
of the new node is the spare of the child in the pair. One child 
in the contraction tree is the parent of the pair, and the other 
child in the contraction tree is the child of the pair. The new 
nodes and the unpaired nodes form themselves into a new tree, 
which is guaranteed to be binary by the the pairing strategy. The 
algorithm is applied recursively to the new tree. 

After the input tree has been contracted to a single node, it 
may be “expanded” by undoing the contractions in the reverse of 
the order in which they occurred. When an internal node of the 


contraction tree expands to a parent-child pair, the tree pointers _ 
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of these nodes are used to restore the tree to the state it had when 
the contraction took place. These pointers have been undisturbed 
by the algorithm since the nodes merged. The expansion requires 
only constant space at each node. In the next section we will see 
that tree expansion allows us to describe treefix computations 
recursively. 

The proof that, with high probability, Algorithm TC takes 
O(lg) steps to contract an input rooted binary tree to a single 
node requires three technical lemmas. The first lemma shows that 
in a binary tree, the number of nodes with two children and the 
number of leaves are nearly equal. The second lemma provides 
an elementary bound on the expectation of a discrete random 
variable with a finite upper bound. The last lemma presents a 
“Chernoff” [4] type bound on the tail end of a binomial distribu- 
tion. 

In the theorems and lemmas that follow, let Vo, Vj and V2 
denote the sets of nodes (excluding the root) with sero, one, or 
two children, respectively, in a rooted binary tree of |V| nodes. 
Let [Vo|, |Vi|, and |V2| denote the sizes of these sets and let d(r) 
be the degree of the root. 

The first lemma shows that in a binary tree, the number of 
nodes with two children (excluding the root) is equal to the num- 
ber of leaves plus the degree of the root. 


Lemma 6 |V2| + d(r) = |Vo| ff 


The next lemma provides a lower bound on the probability 
that a discrete random variable with a finite upper bound will be 
larger than some fixed value. 


Lemma 7 Let X < 6 be a discrete random variable with expected 
value p. For w < b, 


Pr(X >w) > 


p—w 
b—w SS 


The final lemma presents a bound on the tail end of a binomial 
distribution. Consider a set of t independent Bernoulli trials, 
each occurring with probability p of success. The probability 
that fewer than s successful trials occur is 


B(s,t,p) = > ( : ) omc pi. 


The lemma bounds that probability B(s,t,p) that fewer than s 
successes occur in ¢ trials when t > 2s and p < 1/2. 


Lemma 8 For t > 2s and p < i, 


B(s,t,p) < (#2) ((1—p)*) (*) fi 


We now prove that with high probability, Algorithm TC takes 
O(lgn) steps to contract a rooted binary tree to a single node. 
The key observation in the proof is that for each node that pairs 
with its parent, the number of nodes in the tree decreases by one. 


Theorem 9 With high probability, Algorithm TC takes O(lgn) 
contraction steps to contract a rooted binary tree of n nodes to a 
single node. 


Proof: The proof has three parts. We use Lemma 6 to show 
that that if a rooted binary tree has |V| nodes, the expected 
number of nodes pairing with a parent is at least |V|/4. Next, 


we call a pairing step “successful” if at least |V|/8 nodes pair 
with a parent. In the resulting contraction, the size of the tree 
decreases by at least a factor of 7/8. We use Lemma 7 to prove 
that the probability that a pairing step is successful is at least 
1/3. Finally, we use Lemma 8 to show for any constant k that 
after alogg;7n steps, for some constant a > 2, the probability 


that fewer than logs ,7n successful steps occur is O(1/n*). 

We first show that the expected number of nodes pairing with 
a parent is at least |V|/4. A node is picked by its parent with 
probability 1 when its parent is a degree 1 root, and 1/2 other- 
wise. Thus a leaf pairs with its parent with probability at least 
1/2, and a node (other than the root) with one child picks its par- 
ent with probability at least 1/4. Let P be the number of nodes 
pairing with a parent. Apply Lemma 6 to the simple bound on 
the expected value of P, 


to yield the desired result: 
|[Vo| + |[Vi| + [Vo] + d(r) a lV | 


4 4 


Now we show that the probability that a pairing step is success- 
ful is at least 1/3. At most half of the nodes pair with their par- 
ents. Using Lemma 7 with b = |V| /2, w = |V| /8, and p > |V| /4 


E(P) > 


we have 
Pr(P 2 V1/8) 2 WEI = 3° 
2 3 


Finally, we show that with high probability, Algorithm TC 
takes O(lgn) contraction steps to contract the input tree to a 
single node. In the contraction following a successful pairing 
step, the size of the tree decreases by a factor of 7/8 or more. 
After logg,,n successful steps, the tree must consist of a single 
node. By Lemma 8 with p = 1/3, the probability that fewer than 
s = logs 7” successful steps occur in as steps is 


2 a logs/7 n 
B(logg 77, alogg;7n,1/3) < 2 ((3) ae) : 


For any 
value k, we can choose a so that B(logg/7n, logs ;7n, 1/3) is 
O(1/n*). In particular, for k = 1, setting (2/3)“ae < 7/8 yields 
a>s.fj 

We now prove that Algorithm TC is conservative. 


Theorem 10 Algorithm TC 1s conservative. 


Proof: The key idea is that each active element in the contracted 
tree is a “representative” of a subgraph of the input tree that has 
been contracted to a single node. The contracted subgraphs, 
which are trees, are disjoint in the input tree. The representative 
and spare of a subgraph are either elements in or mates of ele- 
ments in the subgraph. The pairing strategy ensures that each 
subgraph is adjacent by an edge to at most one subgraph which 
is higher in the input tree, and to at most two subgraphs which 
are lower. The representative of the subgraph has pointers to 
the representatives of these subgraphs, and to the spare of the 
subgraph. 

As in the list contraction algorithm, memory accesses in a step 
of the tree contraction algorithm corresponds to a set of dis- 
joint paths in the input data structure. Since each subgraph is 
connected, the pointers between representatives and spares cor- 
respond to disjoint paths in the input tree. Similarly, any set 
of pointers between each representative and the representative 
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of at most one of the two adjacent subgraphs lower in the input 
tree corresponds to a set of disjoint paths in the input tree. The 
memory accesses in a step correspond to a set of pointers between 
representatives and spares or to a set of pointers between each 
representative and the representative of at most one of the two 
adjacent subgraphs lower in the input tree. fj 

Tree contraction can be performed conservatively and deter- 
ministically in O(lgnlg* m) steps on a DRAM with m processors 
using the deterministic coin tossing algorithm of Cole and Vishkin 
[5]. The key idea is that in Algorithm TC, the nodes in the tree 
that can pair form chains, and by Lemma 6 these chains contain 
at least half the tree edges. The chains can be oriented from child 
to parent in the tree, and deterministic coin tossing can be used 
to perform the pairing step in O(lg* m) steps. 


5. Treefix computations 


This section presents a generalization of the parallel prefix com- 
putation to binary trees. We present two kinds of treefiz 
computations—rootfiz and leaffiz—and show how they can be 
implemented by an O(lgn)-step conservative algorithm in lin- 
ear space. As we shall see in Section 6, treefix computations 
can greatly simplify the description of many parallel graph algo- 
rithms in the literature, and moreover, treefix computations can 


be performed by conservative algorithms. 


We begin with a definition of treefix computation. 


Definition 11 Let D be a domain with a binary associative op- 
eration ® and an identity element ¢. Let T be a rooted, binary 
tree in which each vertert € T has an assigned value x; € D. The 
rootfiz problem 1s to compute for each vertex: € T with parent 7, 
the value y; = y; @2;, where y; = € tft 18 the root. The leaffiz 
problem is to compute for each verter: ET with left child 7 and 
right child k, the value y; = 2; ® y; ® yx, where y; = € sft has 
no left child and y, = € sft has no right child. 


Simple examples of treefix problems are computing the depth of 
each vertex in a rooted binary tree and computing the size of each 
subtree. These and other examples appear in the next section. 
Like the prefix computation on lists, treefix computations can 
be performed directly on the contraction tree. To simplify the de- 
scription here, however, we describe a recursive version. We exe- 
cute one contraction step and then recursively perform a treefix 
computation on the new tree. The treefix values for the input 
tree can be computed immediately from the treefix values for the 
new tree. The recursion level of a node in the contraction tree 
can be maintained by recording the step in which it was created. 


Theorem 12 A rootfiz or leaffiz computation can be performed 
by a conservative randomized algorithm which, with high prob- 
ability, takes O(lgn) steps, or by a conservative deterministic 
algorithm which takes O(lg nlg* m) steps, where m ts the number 
of processors tn the DRAM. 


Proof: We first describe the computation for rootfix. First the 
input binary tree T is transformed to a new tree T’ by one con- 
traction step. Each new node u in 7” resulting from the pairing 
of parent p and child c in T passes input value z, to the child in 
T’ that u inherited from c. Node u passes ¢ to its other child. 
Each unpaired node v in J" passes € to each of its children. Each 
node in TJ” receives a value from its parent, call it z. Each new 
node u computes 2), —- 2 @ Zp. Each unpaired node v computes 
zi + z@2,. A rootfix computation is performed recursively on 
T’ using the x’ values as input and yielding y’ values as output. 
The contraction step from T to T’ is then undone. Each new 


node u passes yy, to p and c. Node p computes yp « y/, and c 
computes y, «— y! @2z-. Each unpaired node v computes y, «— yj. 

We now describe a computation of which leaffix is a special 
case. Each node z in T is assigned input values 2;, l;, and r1;. 
Node 2 with left child 7 and right child k computes output value 
yj = 2; 0y; Ol, Oy, Or,, where y; and l; are ¢ if has no left child 
and y, and r; are ¢ if 2 has no right child. For the special case of 
leafix, |; and r; are both e. First, a contraction step transforms T 
-to T’. Consider each new node u in 7” resulting from the pairing 
of parent p and left child c in T with input values zp, lp, rp and 
Ze, l, respectively. (The cases where c is a right child, or where 
c has a right child or is a leaf, are similar.) Node u computes 
Lt, — Zp @ Ze, Ui, — 1, @ ly, and r!, — ry. Each unpaired node 
v computes 2}, + zy, ll + 1,, and r, + r,. The computation 
is performed recursively on T’ using the z’ values as input and 
yielding y’ values as output. Each node passes its y’ value to its 
parent in J’. Each node receives values from its left and right 
children, call these values z; and z,. Each new node u passes 2; 
to c and both z and z, to p. The contraction step from T to JT" 
is undone. Node c computes y, «— 2, @ z; @l,. Node p computes 
Up — tp O27-02% Ol, Oly @ 2 Ory. Each unpaired node v 
computes yy + yi. 


6. Conservative algorithms 


This section presents a collection of conservative DRAM algo- 
rithms, all of which use treefix computations. The algorithms 
use two processors per edge of an input graph G = (V, E£) and re- 
quire constant extra space in each processor. Since the algorithms 
are based on shortcutting pointers in the input data structure, 
they are independent of the underlying DRAM or embedding of 
the data structure. 

We represent each vertex in an undirected graph G = (V, E) by 
a doubly linked inctdence ring of processors, one for each edge. 
Each element of the incidence ring contains pointers to the next 
and previous elements in the ring, and one pointer for a graph 
edge. For each edge (u,v) € E the element in the incidence ring 
for u contains a pointer to an edge element in the incidence ring 
for v, and vice versa. A directed graph is represented in the same 
doubly linked fashion, but the graph edges are labeled with their 
directions. 

We represent trees with arbitrary vertex degrees by an inci- 
dence ring structure as well. If the tree is directed, each ring has 
a unique principal element that points toward the root. Breaking 
the incidence ring before the principal element yields the stan- 
dard binary tree representation of the tree [10, pp. 332-333]. 

We now present brief descriptions of the algorithms. The per- 
formance is given terms of the number of steps on a DRAM when 
the input representation has size n. We assume the implicit tree 
contractions in the algorithms are performed by the randomized 
Algorithm TC. Deterministic bounds can be obtained by multi- 
plying the number of steps by O(lg” m), where m is the number 
of processors. An upper bound on the actual performance can be 
obtained by multiplying the number of steps by the load factor 
of the input. 

Generalized treefix. Perform a treefiz operation on a ds- 
rected tree with arbitrary vertez degree. The input values {z;} are 
stored in the principal elements of the tree, which is where the 
output values {y;} are to be placed. The leaffiz value at a node? 
whose children have values y;,y2,---,Ye 18 Yi = % OY O y2 © 
---@®y,. Each element that is not principal stores the identity 
element ¢ for its value. A binary treefix computation performed 
on the binary tree representation underlying the tree computes 
the desired values. Performance: O(lgn). 
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Tree functions. Given a directed tree, compute for each node 
the number of descendents, tts hetght, or its depth. The number 
of decendents for each node can be computed by a leaffix compu- 
tation with @ as integer addition and 2; = 1 for all nodes. The 
height of a node can also be computed by a leaffix computation 
where a ® b = max(a+ 1,5 + 1), the identity is e = —oo, and 
x; = 0 for all nodes. The depth of a node can be computed by a 
rootfix computation with @ as addition and z; = 1 for all nodes 
except the root which has value 0. Performance: O(lgn). 


Rooting an undirected tree. Pick a root of a tree with 
undtrected graph pointers and orient the graph pointers toward 
the root. Form an “Eulerian tour” of the pointers of the repre- 
sentation [20] by directing each element of the tree to link its 
incoming ring pointer with its graph edge directed outward and 
its graph edge directed inward with its outgoing ring pointer. 
Each graph edge is used twice in the tour, once in each direction, 
but each ring pointer is used only once. Using the variant of 
Algorithm LC which works for circular lists, form a contraction 
tree of the tour. Choose the root of the contraction tree to be 
the root of the tree, and break the tour so that it begins with the 
root. Use parallel prefix to number each node according to its 
first occurrence in the tour. Use contraction trees to distribute 
the smallest value in each incidence ring to the elements of the 
ring. Orient each graph edge from the larger value to the smaller. 
Performance: O(lgn). 

Rerooting a directed tree. Given a directed tree and an- 
other distingutshed vertex k, reortent the graph edges of the tree 
to point to k. The algorithm for rooting a tree can be used by 
picking k as the root instead of the root of the contraction tree, 
but a single treefix computation suffices. Perform a leaffix com- 
putation with z, = 1 and z; = 0 if ¢ # k, and use Boolean OR 
for ®. Each principal element whose leaffix value is 1 lies on the 
path from 2; to the root. Reverse the direction of the graph 
pointers of these elements. (Note: rerooting a tree changes the 
principal elements.) Performance: O(lgn). 

Tree-walk numberings of a binary tree. Number the nodes 
of a binary tree according to the order they would be vistted in 
a preorder/inorder/postorder tree walk. For each of the walks, 
we will compute y,, the number of nodes visited before the left 
subtree of k. Use a leaffix computation to compute the number 
stzex of the subtree rooted at k. We first compute the preorder 
numbering. (For the purposes of these numbering algorithms, we 
consider the root to be a left child.) If node k is a left child, set 
z, to 1. If node k is a right child, set z;, to 1 plus the size of its 
sibling subtree. A rootfix computation with + yields y,, which 
is the preoder numbering of node k. The inorder numbering can 
be computed similarly. If node k is a left child, set x; to 0. If 
k is a right child, set z, to 1 plus the size of its sibling subtree. 
Compute y, for each node using a rootfix computation with +. 
The inorder numbering of node k is 1 plus yz plus the size of its 
left subtree. The postfix numbering can be computed by setting 
z, to 0 if node k is a left child, and by setting 2; to the size of its 
sibling subtree if k is a right child. After computing y, using a 
rootfix computation with +, the postfix numbering of node k is 1 
plus yz plus the sizes of its two subtrees. Performance: O(Ign). 

Prefix /postfix numbering of a directed tree. Number the 
edges of an arbttrary dsrected tree according to the order they are 
visited in preorder/postorder tree walk. The problem reduces to 
prefix/postfix numbering on the underlying binary tree represen- 
tation. Performance: O(lgn). 

Diameter and center of a tree. The diameter ts the length 
of the longest path tn the tree. A center 1s a vertex v such that 
the longest path from v to a leaf ts minimal over all vertices in 
the tree. The diameter can be determined by rooting the tree and 


using rootfix to find the furthest leaf from the root. Reroot the 
tree at this leaf. The distance from the new root to the furthest 
leaf is the diameter. (Based on an analog algorithm attributed to 
J. Wennmacker [6].) A center of the tree can be determined by 
finding a median element of the path that realizes the diameter. 
Performance: O(lgn). 

Centroid of a tree. A centroid ts a verter v such that the 
largest subtree with v as a leaf is minimal over all vertices itn 
the tree. A centroid can be determined by rooting the tree and 
computing the sise of each subtree. By broadcasting the size m 
of the tree from the root, each graph edge in each incidence ring 
can determine the number of elements on the other side of the 
edge. For each incidence ring, compute the maximum of these 
values. A vertex with the minimum of these maximum values is 
a centroid. Performance: O(lgn). 

Separator of a tree. A separator [15] is a partition of the 
vertices of an m-vertez tree into three sets A, B, and C, with 
|A| < 2m, |B| = 1, and |C| < 2m, such that no edge of the 
tree goes between a vertex in A and a vertex in C. Determine 
a centroid of the tree. This vertex is the separator vertex in B. 
It remains to partition the remaining vertices between A and C. 
For each graph edge in the incidence ring, count the number of 
vertices in the subtree on the other side of the edge. Put the 
largest subtree in A. Use parallel prefix on the incidence ring to 
compute a running sum of the sizes of the other subtrees. Put 
all subtrees whose prefix value is at most am in C,, and put the 
remainder in A. Performance: O(lgn). 


Subexpression evaluation. Given a directed tree in which 
each leaf has a value and each internal node has an operator from 
{+,-,:;+}, compute for each internal node the subexpression 
rooted at that node. A single leaffix computation suffices using 
the ideas of Brent [2] and Miller and Reif [17]. Performance: 
O(lgn). 

Minimum cost spanning tree. Given an undirected input 
graph G = (V,E) and a cost function w : E — R, determine 
a set F © E of edges such that each verter in V is incident 
on an edge of F, and the sum of the weights of the edges in F 
is minimal. We give a conservative DRAM implementation of 
Sollin’s algorithm (8, section 5.5]. We assume without loss of 
generality that the edge weights are distinct—otherwise, we can 
assign the weight of a graph edge e between two incidence ring 
elements with addresses a and 6 to be (w(e), max(a, b), min(a, b)) 
and then compare weights lexicographically. We determine F by 
marking edges in G. Initially, no edges are marked. At each 
step of the algorithm, the currently marked graph edges form a 
subforest of F. Break each incidence ring by removing a single 
Ting pointer and direct the resulting linear list. At each step 
of the algorithm, the marked graph edges and the ring pointers 
form a set {T;} of rooted trees, where the index ¢ of the tree is 
the address of the root. The algorithm proceeds as follows. For 
each tree 7;, use a rootfix computation to broadcast 2 to all of 
the elements in 7;. Use a leaffix computation on T; to determine 
an edge e € E' with the smallest weight w(e) connecting an edge 
element u € T; with an edge element v € T;, where 2 # 7. If no 
such edge exists, the algorithm terminates. If T; picks the same 
edge as T;, the tree with smaller index does nothing. Otherwise, 
mark e as a member of F, directing it into T;, and reroot T; with 
u as the new root. Repeat this procedure until the algorithm 
terminates. Performance: O(Ig* n). 


Connected components. Given an undirected input graph . 


G = (V,E), determine a labeling 1: V — Z such that such that 
I(v) = U(v') af and only if v and vu’ are in the same connected 
component of G. The algorithm is the same as the minimum 
spanning tree algorithm, choosing the weight of a graph edge e 
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between incidence ring elements with addresses a and 6 to be 
max(a, 6), min(a,b). The label of a vertex is the index of its tree. 
Performance: O(lg? n). 

Biconnected components. Two edges of an undirected graph 
G = (V, E) are in the same biconnected component if they lie on 
a@ common simple cycle. Determine a labelingl: E — Z such that 
I(e) = U(e’) tf and only if e and e’ are in the same biconnected 
component of G. We give a conservative DRAM implementation 
of the biconnectivity algorithm of Tarjan and Vishkin [20]. We 
assume that the reader has some familiarity with that algorithm. 
Find a (directed) minimum spanning tree T = (V, F) of G. Num- 
ber the vertices in the minimum spanning tree in preorder. Use 
leaffix computations to compute for each vertex v three values: 
nd(v), low(v), and high(v). Here nd(v) is the number of descen- 
dants of v, while low(v) and high(v) are the lowest and highest 
vertices (with respect to the preorder numbering of T) that are 
either a descendant of v or adjacent to a descendant of v by an 
edge of E — F. Build a new graph G’ where the edges of F 
are the vertices of G’. Let e be an edge from u to p(u), where 
p(u) is the parent of u in F. The adjacency ring for u in G acts 
as the adjacency ring for e in G’. Add two kinds of edges to 
G'. For each edge {w,v} in E — F such that v + nd(v) < w, 
add an edge {{v, p(v)}, {w, p(w)}} to G’. For each edge (v, p(v)) 
of F such that v # 1 and p(v) ¥ 1, and low(v) < low(p(v)) or 
high(v) > p(v)+nd(p(v)), add an edge {{v, p(v)}, {p(v), p(p(v))}} 
to G’. It can be verified that the representation of G’ is conserva- 
tive with respect to the representation of G. Find the connected 
components of G’. Two edges of F are in the same block if as 
vertices in G’ they are in the same connected component. Fi- 
nally, for each edge e = {w, v} in E — F, let I(e) = 1({w, p(w)}). 
Performance: O(lg? n). 

Eulerian cycle. An Eulerian cycle of an undtrected graph 
G = (V, E) is a cycle containing each edge in E exactly once. If 
any vertex has odd degree, then no Eulerian cycle exists. Form 
a set of disjoint cycles of the pointers of the representation of 
G as in the algorithm for directing a tree. The cycles can be 
merged using an algorithm similar to the minimum spanning tree 
algorithm. Performance: O(lg? n). 


7. Concluding remarks 


The efficiency of a DRAM algorithm depends on how well its 
input is embedded in the DRAM, but this embedding problem 
must be faced by algorithm designers in any bandwidth-limited 
distributed network. In general, the problem of determining the 
best embedding is NP-complete, but for many common situa- 
tions, good embeddings can be found. For example, any of our 
polylogarithmic-step conservative algorithms can be run on pla- 
nar graphs properly embedded in an area-universal fat-tree, and 
they will exhibit polylogarithmic time performance. 

As another example, a subproblem in switch-level simulation of 
a VLSI circuit is the finding of electrically equivalent portions of 
the circuit. A naive divide-and-conquer embedding of the circuit 
on an area-universal fat-tree [12] yields small load factors for ev- 
ery cut. Thus, our conservative connected components algorithm 
will never cause undue congestion in communicating messages in 
the underlying network, and the algorithm will run effectively as 
fast as on an expensive, high-bandwidth network. 

Many other classes of algorithms can be implemented in a con- 
servative fashion on a DRAM. Any algorithm that communicates 
olny across pointers in an input data structure is conservative. 
Passing a single datum between two processors, however, can re- 
quire time linear in the diameter of the data structure, whereas 
our algorithms all run in a polylogarithmic number of steps. As 
another example, systolic array algorithms for matrix problems 


[11,14] can be implemented efficiently if the matrices are properly 
embedded. In general, any fixed-connection network algorithm 
will run well on a DRAM if the communication required by the 
network can be supported by the underlying DRAM network. 

As a final comment, it may well be that the notion of a con- 
servative algorithm is too conservative. A contraction tree is not 
conservative with respect to its input tree (though the levels of 
the contraction tree are), but the load factor of the contraction 
tree is at most O(lgm) times the input load factor. As a practi- 
cal matter, it is probably not worth worrying whether every set 
of memory accesses is conservative with respect to the input, as 
long as the load factor of memory accesses is polylogarithmically 
bounded. Algorithms with this looser bound are somewhat eas- 
ier to code because of the relaxed constraint, and they should 
perform comparably. 
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EFFICIENT PARALLEL ALGORITHMS FOR GRAPH PROBLEMS 


Clyde P. Kruskal 
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ABSTRACT New efficient techniques for manipulating 
linked structures in parallel are presented. These techniques 
work on the EREW machine, which is the weakest shared 
memory machine. Using these techniques we develop algo- 
rithms for connected components, spanning trees, minimum 
spanning trees, and other graph problems. All of these algo- 
rithms achieve linear speed-up for all but the sparsest graphs. 
We also present a new parallel radix sort algorithm that is 
optimal for keys taken from a small range. 


1. INTRODUCTION 


We present new parallel algorithms for several important 
graph problems including spanning forest, connected com- 
ponents, biconnected components, and (undirected) minimum 
cost spanning tree (MST). All of the algorithms are improve- 
ments over the best results known in several respects: they 
assume the weakest model of shared memory parallel compu- 
tation, i.e. the EREW (exclusive read, exclusive write) model; 
they are deterministic; they are fast on all graph densities. In 
addition, the input is not required to be in any prearranged 
order, such as adjacency lists; the input can consist of an 
unordered list of edges. When the problem size is large rela- 
tive to the number of processors and the graphs are not 
extremely sparse (i.e. as long as n=o(e)), the algorithms 
achieve a linear speedup, which means they have optimal 
processor-time product. As a by product, we obtain a parallel 
version of radix sort that is optimal as long as the range of 
the elements sorted is at most polynomially larger than the 
number of processors. 


The algorithms employ a set of useful techniques for 
efficient parallel manipulation of data structures to avoid 
conflicts between processors. The most powerful and original 
technique is a recursive two-step, block-based algorithm. It is 
used in an algorithm for solving the “‘parallel prefix” problem 
on a linked list of elements. In quite a different way, it is 
used to convert an edge list representation of a graph into an 
adjacency lists representation. The former representation is 
how a graph is often defined; the latter representation is often 
required for efficient parallel access to edges of the graph. 
Variants of the technique are shown to be useful for perform- 
ing table lookup, update without conflicts, and radix sort. 


Sections 2, 3, and 4 describe basic building blocks. Sec- 
tion 2 reviews the model and some work on solving linked list 
problems in parallel. In Section 3, we show how a graph can 
be converted from an edge list representation to an adjacency 
list representation efficiently in parallel. We also present the 
parallel radix sort algorithm. Section 4 shows how tree prob- 
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lems can be solved efficiently in parallel. We then use the 
basic building blocks in Section 5 to obtain parallel algorithms 
for finding spanning trees, connected components, biconnected 
components, and minimum spanning trees. A chart at the 
end of the paper summarizes our results and shows how they 
compare to the current best. 


2. PRELIMINARIES 


2.1. The Model 


We assume the PRAM computation model: a PRAM 
consists of p autonomous processors, all having access to a 
common memory. At each step, each processor performs one 
operation from its instruction stream. Instructions accessing 
shared memory are also assumed to be accomplished in one 
cycle. Borrowing the notation of [12], we distinguish three 
variants of the PRAM family: In a Concurrent Read, Con- 
current Write (CRCW) model, processors may simultaneously 
access the same memory location; there are various schemes 
for resolving write conflicts. In a Concurrent Read, Exclusive 
Write (CREW) model, several processors may simultaneously 
read the value of a memory location, but exclusive access is 
required for write. Finally, in the Exclusive Read, Exclusive 
Write (EREW) model, a memory location cannot be simul- 
taneously accessed by more than one processor. Our algo- 
rithms can be implemented on the weakest of these three 
models: the EREW PRAM. 


2.2. Linear Linked List Problems 


We now review our algorithm for solving problems on 
linear linked lists [7]; it will be extended in the following sec- 
tions to more general linked structures. Although our results 
for linear linked lists have since been improved by Cole and 
Vishkin [4], their technique does not seem to extend to more 
general linked structures. 


The product computation problem is to compute the pro- 
duct @90a,0--: 04,4, given n elements ao, @, ..., Gy-1, 
and a binary, associative operation, denoted o, e.g. ordinary, 
matrix, and Boolean addition and multiplication. The initial 
prefiz problem is to compute all n initial prefixes ag, ag 0 a4, 
sy @90€,0 °°: Oa, _,. The initial prefix problem when 
solved in parallel is known as parallel prefiz. When the ele- 
ments are laid out in known memory locations there are well 
known methods to _ solve both problems in_ time 
O(n/p + log p) [9]. 

The problem is harder to solve when the elements are 
stored in a linear linked list. The locations of the elements is 
given, but it is not known which element is which. Only the 
location of the first element is given along with a map Succ 
from the zth element to the (7+1)st element, and a map 
Pred from the 7th element to the (7-1)st element (the second 
mapping can be computed from the first in O(n /p ) time). 


It is easy to solve the product and initial prefix problems 
sequentially in O(n ) time by starting at the first element and 
following the links. Wyllie [15] shows that this problem can 
be solved with p<n processors in O(nlogn / p) time, 
using a recursive doubling technique. The algorithm never 
achieves linear speedup. 


An efficient parallel solution to this problem has the fol- 
lowing general form. 


General Parallel Prefix Algorithm 
if the list contains more than one element then 
begin 
pick a set S of nonadjacent elements in the list; 
for each a € S do 


begin 
{replace a pair of adjacent elements by one 
element} 
val (succ (a )) := val (a) o val (succ (a )); 


succ (pred (a)) := succ (a); 
pred (succ (a )) := pred(a) 

end; 

apply algorithm recursively to the list; 

for each a € S do 

begin 
{expand element back into a pair} 
val(a) := val(pred(a )) o val(a); 
succ (pred(a)) := a; 
pred (succ(a)):=a 

end 

end 


The first part of the algorithm solves the product prob- 
lem, successively compacting the list until only one item is 
left; the second part, where the recursion unfolds, expands the 
list back, and computes the missing partial products. Any 
parallel algorithm that solves the product problem by succes- 
sive compaction can be used to create a parallel prefix algo- 
rithm, that expands the list back by matching step by step 
the compression operations done by the parallel product algo- 
rithm. The resulting algorithm will have twice the running 
time of the original algorithm. We shall henceforth consider 
only the product part. 


It is not obvious how to pick at each iteration a set S of 
O(p ) nonadjacent element, while devoting only constant time 
per element. Assume w.l.o.g. that both n and p are powers 
of 2 (this affects the result only by a constant factor), and 
that p<n/2. The product algorithm consists of O(log n) 
iterations. At each iteration the elements are partitioned into 
n/p blocks, each block containing p elements. The proces- 
sors visit the blocks one by one; the z th processor always 
visits the 7th element within a block. It pairs this element 
with its successor, only if the element is not marked as 
deleted, and its successor belongs to another block. It takes 
O(n /p ) time to process all n/p blocks. 


After visiting the blocks, every element has formed the 
product with a neighbor, provided it and its neighbor were in 
different blocks; all pairings remaining to be done at this 
phase are internal to a block. Since there are n/p blocks 


and p processors, p”/n processors can be assigned to each 
block. 


If p?/ n > 1, then we recursively apply the same routine 
within each block. If p?/n <1, assign each processor n /p? 
blocks, and pair elements sequentially with each block. Note 
that p and p?/n are powers of 2. 
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After this, if an element has not been paired with one of 
its neighbors, then both of its neighbors must be paired 
(except for the first and last elements of the list), so at most 
2n /3 + O(1) elements are left. 


Let H, (n) be the number of steps required by one itera- 
tion on a list of length n. 


H,(n) = O( 7) + Hyyn(P) ifp >1, 
H,(n) = O(n/p) otherwise . 
It is easy to check that 
H, (n) _ (2 log n 


P log(n/p) 


After each iteration, the remaining elements are packed 
to the top of array, being careful to maintain the integrity of 
the pointers (which can be done by first determining the loca- 
tion of each element, then updating pointers to point to the 
new locations, and finally moving the elements). Once there 
are less than 2p elements, the O(nlog n/p) = O(log p) 
parallel algorithm is used to complete the algorithm. 


Let T,(n) be the number of steps it takes to solve a 
problem of size n, Le. the execution time of the procedure 
Product. If n<2p, T,(n) = O (log n) using the standard 
parallel algorithm. Otherwise, 


T,(n) = O( H,(n) + % + log p) + T, (=). 


It is not hard to show that 


n) = o( tls r_ 
T,(n) - p log(an /p)> 


We obtain 


Theorem 2.1: The parallel prefix problem for items 
represented by a linked list can be solved on an EREW 
n___logn 

p log(n /p ) 

Note: This yields, for any constant O<e<1, a linear 
speedup when p<n'* (or equivalently when the time 
T, (n )>n‘). 

This algorithm can be used to solve the linked list pack- 
ing problem: Given a list of n items, a subset of which are 
active, pack the active elements to contiguous locations of 
memory (starting at some specified location). We can use the 
parallel prefix algorithm to rank the active elements, and then 
move each one to the location indicated by its rank plus the 
starting location. As a by product, the elements are packed 
in order so that the ith element is contiguous to the (¢-1)st | 
and (1 +1)st elements. 


machine in O ( ) time for p <n. 


A slight variation of the parallel prefix algorithm can be 
used to assign to each node on the list the value 
a,0o--+: oa,. The second (list expansion) phase is modified 
so as to broadcast the product computed in the first phase. 
We call this the product broadcast algorithm. For example, 
we can use this algorithm to label each node in a list with the 
name of the first node in the list, or with the name of the 
least node in the list. The product broadcast algorithm can 
also be run on a linked circular list, provided that o is a com- 
mutative operation. 


If the parallel prefix algorithm is applied to a union of 
disjoint lists, than it will compute initial prefixes in parallel in 


each list. Similarly, if the product broadcast algorithm is 
applied to a union of disjoint lists (circular lists), then it will 
assign to each node the product of the elements in the list it 
belongs to. 


38. GRAPH ADJACENCY LIST CONSTRUCTION 


The adjacency lists representation of a (directed) graph 
is particularly useful for many parallel graph algorithms. 
Unfortunately, this is not always the input format of a graph. 
The graph adjacency list construction problem has as input 
the list of the edges of the graph (presented as pair of nodes). 
The edges are stored in an arbitrary order in an array. One 
constructs an adjacency lists representation of the graph, that 
is an array L [1..n]| of linked lists, where L [7] is the list of 
nodes adjacent to node z. 


Constructing the adjacency lists for a graph with n 
nodes and e edges requires only O(e-+n) serial time and 
O(e+n) space. The array L {1..n] is initialized to nil in 
time O(e). The adjacency lists are then constructed in time 
O(e) by inserting edges successively in the adjacency lists. 


The parallel version using p processors is not so simple. 
When the graph is dense, i.e. e =O(n”), a matrix can be used 
to avoid interference. When the graph is sparse one must be 
careful to insure that multiple processors do not try to add 
elements to the same adjacency list at the same time. 


Assume w.l.o.g. that p and e are powers of 2, and that 
p <n /2. We create an adjacency list representation, where 
each entry in the array L |i] has a pointer to the head and a 
pointer to the tail of the 7 th list. Each entry in the edge list 
will point to to the corresponding entry in an adjacency list. 
Each adjacency list is doubly linked. 


If p*<e then the edges are split into p blocks each of 
size e /p. Each processor is assigned a block and it sequen- 
tially creates adjacency lists for the edges in its block. This 
has to be done with some care: since e /p might be much less 
than n, one cannot afford to pay the O(n) overhead of ini- 
tializing the array entries to nil. Instead we initialize 
correctly only those entries that are going to be used. There 
are two passes over the data. In the first pass no insertions 
are made; for each edge, (the header of) the list the edge is to 
be inserted into is initialized to nil. In the second pass the 
insertions are actually made. We create a structure where 
nonempty adjacency lists have valid format, whereas empty 
adjacency lists may contain garbage. This phase requires 
O(e/p) time and O(pn-+e ) space. 


Next all the partial adjacency lists are linked up: We 
start with an empty structure and serially link to it the adja- 
cency lists of the p blocks, one at a time. All p processors 
always process the same block at the same time. As above 
this is done in two passes. In the first pass each processor 
picks an edge, and if this edge is the first in its list then the 
processor initializes the corresponding array entry in the new 
structure to nil. In the second pass each processor picks an 
edge, and if this edge is the first in its list then the processor 
links that list at the head of the corresponding adjacency list 
in the new structure. Since each block structure contains one 
entry per node there are no conflicts.. 


The time to process a block in the second phase is 
O(e/p”). There are p blocks, so the total time for the 
second phase is O(e /p). The total time for the algorithm is 
O(e/p) and the space is O(pn+e ). 
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If p*>e then the edges are split into e /p blocks of size 
p. The algorithm is recursively applied to each block in 
parallel, with p*/e processors per block. Note that p*/e and 
p are powers of 2. Next the e/p partial adjacency lists are 
linked up, with each block being processed one at a time by 
the p processors. The same two-pass method is used to avoid 
initialization overheads. This part of the algorithm takes 
O(e/p) time and O(pn +e ) space. 


Let T,(e,n) be the running time of this algorithm, and 


ee (e,n) be the space used by the algorithm. Then if 
p™NE, 


T, (en) = O(e/p). 
If p?>e, we get 
T, (en) = Tyx/,(p.n) + O(e/p). 
This recursion solves to 
¢€ loge 
p log(e /p)’- 
It is easy to modify the above algorithm to set the empty list 
headers to nil at an additional expense of O ((n +e )/p ) time. 


T, (en) = O 


There are at most p graph adjacency list construction 
tasks running in parallel, with a total number of edges e. 
Thus, the space used is O(pn + e). 


Theorem 3.1: The edge list to adjacency list conversion 


problem can be solved on an EREW machine with p <e /2 
loge -.,n ; 
processors In O(= 7 + —) time and O(pn-+e 
p log(e /p) p) ae 
space. 
Note: This yields a linear speedup when p <e!* for any 


constant 1>e>0; the space requirement is no more than for 
an adjacency matrix. 


Corollary 3.2: A list of e numbers in the range 1..n* can 
be sorted with p<e /2 processors in time 
e loge 
OR, log(e /p ) 
Proof: Assume k=1. For each key value 7 in the range 
l..n we create a linked list containing the items with key 
value 7. This is done in the same manner an adjacency list 
was created in the previous algorithm. It requires time 
e loge n ) 
plogfe/p) =p” 
We then pack the list of nonempty buckets, in time 
O(n/p + logp). A sorted list can now be created in time 
O(n /p ). 


+ >) and space O(np +e). 


The result for general k is obtained by running k phases 
of a radix sort algorithm, with radix n. O 
Corollary 3.3: A list of m numbers in the range 1,...,R can 
p log(m /p ) 
Proof: We use the previous corollary, with R = n*. The 
Ht Oe ME. hereby cthe 
p log(m /p ) 
m_logR ) 
p log(m /p ) 
The sorting time is optimal when p <R ? (1), 


be sorted with p <m processors in time O ( 


time is minimized when n ~ 


sort takes time O ( 


Corollary 3.4: Given an adjacency lists representation of a | 
graph with e edges and n _ nodes, the adjacency lists 
representation of the reversed graph (i.e. the graph where the 
direction of each edge is reversed) can be computed in time 


e loge n 
(pYoste/p) * p? 
These results can also be used to compute the composi- 
tion of mappings. Let f be a function with domain l...m 
and range l...n, and let g be a function with domain l...n. 
Assume that f is given as a (not necessarily sorted) list of 
pairs <?,f (1)>, and g is represented by an array of values, 
with the 7th entry storing g(7). We compute a representa- 
tion of gof asa list of pairs <i,g(f (7 ))> using the follow- 
ing steps. . 
(1) The pairs <i,f (¢)> are stored in bucket lists, 
indexed by the value of the second coordinate f (7). 


(2) The first element of each list is marked. 


(3) Each marked element <i,j > (j =f (1)) performs a 
look-up for the value of g(j), and stores this value. 


(4) The value of g(j) is broadcast to all elements in the 
bucket, using the product broadcast algorithm. 


Empty buckets are never accessed, so that it is not 
necessary to initialize them. We obtain 
Theorem 3.5: The composition problem can be solved with 
p processors on an EREW machine in time O ie 
p log(m /p ) 
and space O(np +m ). 


If the function g is also represented by an unsorted list 
of pairs, the the composition can be computed in time 
O( m+n log(m+n 


p_ log((m+n)/p) 


rithm given in [11]. 


+ =), using the composition algo- 
Pp 


4. TREE PROBLEMS 


4.1. Tree Recursions 


The parallel prefix algorithm can be extended to “paral- 
lel prefix” computations on a tree. This has been indepen- 
dently noted for mesh connected computers [1,13]. 


Let T be tree represented by its adjacency list (each 
child has an edge directed to its parent). For each node u of 
T let val(u) be a value stored at that node. Let 

F(u) = > val (v ) 
v descendant of u 
(where a node is taken to be descendant of itself). The addi- 
tion is an arbitrary associative and commutative operation. 
The function F can also be defined by the recursion 
F(u)= YD F(v)t+val(u). 
v child of u 

We wish to compute the value of F at each node of the tree. 
The parallel prefix problem is a particular case of this general 
problem, when the tree degenerates into a linked list. Other 
interesting cases are 


(1) Computing d(u), the number of descendants of each 
node, including itself. We take val(u)=1, and + is 
usual addition. 


(2) Computing the path length of the tree. 
val (u )==d(u), and + is usual addition. 


(3) Computing the height of each node in the tree. We 
have val (u )=1, and + is the minimum operation. 


We take 


An efficient parallel algorithm for solving tree recursions 
has the following form 
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General Parallel Tree Recursion Algorithm 
if Tree contains more than one element then 
begin 

Pick a set S of nodes such that 
(1) no two nodes in S are adjacent or have 
the same parent 
(2) each node of S has at most one child; 
for each node u € S do 
begin 
add the value of u to the value of parent (u ); 
delete u; 
if u has a child then 
link the child to parent (u ) 
end; 
execute algorithm recursively; 
for each node u € S do 
begin 
link u to parent (u ); 
if u has a child then 
begin 
add the value of child(u ) to u; 
link child (u) to u 
end 
end 
end 


It is easy to see that this algorithm solves the tree recur- 
sion problem correctly. One has to show that a set S con- 
taining a fixed fraction of the nodes can be picked at each 
iteration, while spending only constant time per node. 


Assume that the tree JT has degree bounded by two. 
We shall divide each iteration in the execution of the algo- 
rithm in two stages, one where left children are active, the 
second where right children are active. This guarantees that 
children of the same parent do not conflict. Conflicts between 
a child and its parent are handled exactly as in the parallel 
prefix algorithm, by dividing the edges into blocks that are 
handled one at a time. 


At least half of the nodes of T have no more than one 
child. At least 1/3 of the nodes with no more than one child 
are paired at each iteration. We thus obtain that the recur- 
sion can be solved with p processors for a binary tree with n 
nodes in time O (2—1°8"__) in the EREW model. 

p log(n /p ) 

Let T be now a general tree. The tree T can be 
mapped onto a binary tree T°, as described in (6, § 2.3.2]. 
The mapping is readily computed from the adjacency list 
representation of I’, with edges going from parent to child: 
v = left(u), if v is the first child in the adjacency list of 
u;v :=right(u) if v follows u in the adjacency list of their 
common parent. 


Let F° be the function computed on the binary tree T° 
by solving the tree recursion problem on this tree. Then 


F*(u) is the sum of val(v), taken over all descendants of u 
and descendants of elder siblings of u. We have 


F*(u) if right (u )=nil 
ces tr (left(u)) + val(u) otherwise 
Thus, F' can be easily computed from F 6 We obtain 
Theorem 4.1: Given an adjacency list representation of a 


valued tree with nm nodes, the tree recursion problem for the 
tree can be solved on an EREW machine with p processors in 


, logn 
time O we oh forp <n. 
S log(n ey 


4.2. Tree Traversals 


Given a tree with n nodes represented by an adjacency 
list (with edges going from parent to child) a linear linked list 
of the nodes arranged in preorder can be produced by p _ pro- 
cessors in time O(n/p). A similar construction is given by 
Wyllie [15] for binary trees. 


Let uy; and up be two copies of the node wu in the tree. 
Define the mapping nezt as follows. 


v, if v is first child of u 
next (uy ) = up if u has nochildren 
if v follows u in the adjacency 
UL list of parent (u ) 


nezt (up ) = 


parent(u)p if u is last child of parent (u ) 


This mapping corresponds to a traversal of the tree in the 
order Root-Children-Root (uy, represents the first traversal of 
u, and up represents the second traversal). 


Each node of the tree (with the exception of the root) 
occurs exactly once in an adjacency list. The linked list 
defined by the relation next can be created by p processors 
in the time it takes to traverse a linked list in parallel: Each 
processor is assigned n/p entries from the adjacency list; the 
processor creates those nodes of the list corresponding to the 
nodes of the tree it was assigned. The linked list traversal is 
required for each last child to determine its parent. 


A preorder list of the nodes can be obtained in time 
n___logn 

p log(n /p) | 

nodes, and shrinking the list accordingly. 
running the parallel product algorithm. 


ce ) 


UR 
This is done by 


) from this list by deleting all the 


Applying parallel prefix to this list will provide the 


preorder number of each node with linear speedup for 
p < n l-€.. 


If we apply this algorithm to a forest then we shall 
obtain a preorder list for each tree in the forest, and compute 
the preorder number of each node in the tree it belongs to. 
Postorder on trees or forests and inorder on binary trees can 
be handled in a similar manner. Note too that the adjacency 
list of a tree can be recreated from the list defined by the 
next relation in time O(n /p ). 


4.3. Unicycular Graphs 


In this section we show how a “unicycular” graph can be 
converted to a forest with the same connected components. 
Although algorithms on unicycular graphs are not by them- 
selves particularly important, this routine is useful in later 
sections. A unzcycular graph is a tree with one cycle; it con- 
sists of a set of m nodes, each with a parent pointer to 
another node in the set. Since each node in a unicycular 
graph has a unique parent, one can define on such a graph the 
mapping nezt used in the last section. 


Lemma 4.2: Let G be a unicycular graph, and let nezt be 
the mapping defined above. Let G,,..,; be the graph defined 
by this mapping: The nodes of G,,.; consist of two copies of 
each node of G, and uv is an edge of G,,, iff v = next (u ). 


Then 
(1) 
(2) 


The graph G4 consists of exactly two cycles. 


The nodes uy, and up occur in distinct cycles of G,,,.,; 
iff u is on the cycle of G. 
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Proof: Each node has exactly one successor and one prede- 
cessor in G,..;, SO that this graph is a union of cycles. The 
node uw is on the cycle of G, iff a traversal starting at u 
reaches u again before it backtracks. Hence u is on a cycle 
of G iff the path in G,,.,, starting from uz returns to u, 
before it reaches up. The graph G can be transformed in a 
tree by deleting one edge from its cycle. This causes the dele- 
tion of one edge, and the change of one edge in G,,,,, , thereby 
obtaining a linear graph. It follows that G contains 
exactly two cycles. 0 


next 


Let G be a graph that is the union of disjoint unicycular 
components. The graph G! is a forest decomposition of G if 
G' is obtained from G by deleting one edge from each cycle, 
thereby replacing each unicycular component with a tree. 


Theorem 4.3: Let G be a graph with n nodes that is the 
disjoint union of unicycular components, represented by its 
adjacency list. The adjacency list of a forest decomposition of 


G can be computed with p_ processors in time 
® 108" _) in the EREW model. 
p log(n /p ) 


Proof: The following algorithm will do. 


(1) 
(2) 


Construct the linked structure defined by the nezt 
relation. 


Assign a distinct label to each circular list (e.g. the 
least name, in lexicographic order, of an item in the 
list); label each item with the label of the list it 
belongs. 


(3) 


Mark those items uy, such that up occurs in a list dis- 
tinct from the list containing u, . 


(4) 
(5) 


Mark on each list the least item u, that was marked 
in the previous phase. 


Delete from the original adjacency list the edge leading 
to u, for each item uz that was marked at the previ- 
ous stage. 


Each of these phases can be implemented to run within the 
required time. 0 


5. SPANNING FOREST 
PONENTS 


Connected components can be derived from a spanning 
forest by creating a preorder list for each tree, and marking 
each node with the label of the first node in its list. This can 
be done with p <n /2 processors in the EREW model in time 
O( n__logn ) 

p log(n /p ) 
Our algorithms for finding a spanning forest are in the 
spirit of most of the previous algorithms for this problem: 
(super) nodes are continually combined into super nodes. 
When no more combining is possible, each super node will 
represent a connected component, and a spanning tree will 
also exist for each component. 


AND CONNECTED COM- 


Eckstein [5] has shown that a spanning forest can be 
found in O(e/p +n) time using either depth-first or 
breadth-first search [5]. This is optimal when the number of 
processors is small relative to the denseness of the graph, 1.€. 

= O(e/n)). 

Let G be an (undirected) graph with e edges and n 
nodes. Our algorithms use two auxiliary graphs: the graph 
Gy; is a forest that is a subgraph of G; the graph G, consists 


of the supernodes of G. Each connected component (tree) of 
G, is represented by one supernode in G,. Two supernodes 
are connected if the corresponding trees are connected by an 
edge in G. Initially Gy contains all the nodes of G and no 
edges; G, is initially identical to G. At each iteration super- 
nodes that are connected in G, are combined into a new 
supernode; edges are added to Gy to merge the corresponding 
trees into a new tree. The algorithm terminates when G, 
does not contain any more edges. At that point Gy is a 
spanning forest for G. 


The graph Gy, is represented by marking in the adja- 
cency list of G those edges that belong to G,. Separate 
adjacency lists represent G,. In order to achieve the desired 
running time, the number of accesses to each edge entry in 
this list must be bounded by a constant. Therefore, it is not 
possible to update the edge entries at each iteration. Instead, 
each edge in an adjacency list of a supernode of G, is 
represented by an entry with the name of the incident node in 
G. Some of the edges in this adjacency list may be self loops, 
i.e. entries in the list of a supernode with the name of a node 
belonging to this supernode. 


We keep a membership list for each supernode, i.e. a list 
of the nodes belonging to each supernode. Each supernode 
has a weight count, which is the number of nodes belonging to 
it. We also keep an inverse directory, i.e. an array that indi- 
cates for each node the name of the supernode it belongs to. 


At the end of each iteration we pack the adjacency list 
of G, so that nodes with nonempty lists of incident edges are 
in the first locations. As we do not want to modify the edge 
entries, the packing is actually done on a separate array of 
pointers to the supernodes. | 


The initial adjacency list for G, can be created in 
O((n+e)/p) steps; it can be packed in time 
O(n/p + logp ). The membership list and inverse directory 
for supernodes can be created in O(n /p ) time. 


We terminate the “graph compaction” iterations when 
the number of nonisolated supernodes is O (max(n /p, p'*‘)). 
We can then use Eckstein’s algorithm to terminate in time 
O(n/p + p***+e/p). 


Assume that the number of supernodes with nonempty 
adjacency lists is at least p/**. At each iteration we serially 
perform the following steps. 


(a) Pick the first edge from each nonempty adjacency list of 
G,. Delete these edges from the graph G,. 


(b) Delete from this set of edges those edges that are self 
loops. 


(c) Create an adjacency list for the graph consisting of the 
remaining edges, and their endpoints. This graph con- 
sists of a union of unicycular components. 


(d) Create a generating forest F for the graph, by deleting 
edges from the new adjacency list. Each tree in this 
forest will be replaced by one new supernode. 


(e) Mark the edges of F as belonging to G;, . 


(f) Select in each tree of F the supernode of largest weight 
(this is the new supernode), and label all the supernodes 
in the tree with the name of the selected supernode. 
Mark all supernodes with the exception of the new 
supernodes fF as inactive. 


(g) Update the membership list and inverse directory for 
- supernodes. : 
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(h) Link the adjacency lists of supernodes marked inactive 
in (f) to the adjacency list of the new supernode they 
belong to. 


(i) Pack the list of supernodes with nonempty adjacency © 
lists. 


It is easy to check that this iteration performs a valid 
compaction of the graph. We shall now estimate the running 
time of each step. Let a be the number of active supernodes 
(i.e. supernodes with nonempty adjacency lists) at the start of 
the iteration. 


The edges picked in (a) are <supernode,node> pairs. 
We compute the <supernode,supernode> edge represented 
by this pair by composing the mapping represented by the 
pairs picked in (a) with the mapping represented by the 
supernode inverse directory. This is done in time 

Caray = O(a/p). Self loops can then be deleted 
in time O(a /p ). 

Thus, step (b) can be performed in time O(a/p). It is 
easy to see that each of the remaining steps, with the possible 
exception of step (g) can also be executed within the same 
time bound. As a edges are deleted from G, at this itera- 
tion, it follows that the total time spent executing these steps 
is O(e /p). We shall now estimate the time spent in updat- 
ing membership lists and inverse directories for supernodes. 


Let m be the number of nodes that change supernode 
during the iteration. The number of affected supernodes is 
bounded by a. Using the membership list it is possible to 
create a list of affected nodes in time O(a/p). This can be 
used to update the inverse directory in time O(m/p). The 
membership list (and the weights associated with it) is 
updated by moving the member lists of inactivated super- 
nodes in time O(a /p) 


Let u; be the number of times node 2 changes supernode 


membership. The last discussion shows that the total amount 


n 
of time spent while executing step (g) is O(— + s >} 4; ). 
Pp i=1 
Since smaller supernodes are merged into larger ones, the 
e . u; 
size of the last supernode node i belongs to is at least 2°. 
When the last merge is performed we have more than n/p 
components. This implies the inequality 


n 
yee 


j=10" Pp 


n 
The maximum of )}u,; under the above constraint is equal 
i=1 


ton Ig p. Thus 


n 
yu; < niogp . 


i=l 
Hence, the total time spent performing step (g) is 
Oe 2). 
p p 


It follows that the total time taken by the algorithm is 
Oto nlogp , p+), 
p Pp 


Theorem 5.1: A spanning forest for a graph with n nodes 
and e edges can be computed by p <n/2 processors in the 
EREW model in time 


o(£ 4 n logp 4 p19), 
P P 


and space 
O(pn +e). 


Corollary 5.2: The connected components of a graph with 
n nodes and e edges can be computed with p <(n +e )/2 
processors in the EREW model in time 
o(= ue eNO A ey and space O(pn +e ). 


p  log((n +e )/p ) 


5.1. Biconnected Components 


Tarjan and Vishkin [14] developed a parallel algorithm 
to compute the biconnected components of a connected graph. 
Their algorithm, however, required that the graph be dense, 
the input already be in the form of an adjacency list, and 
that the model of parallel computation allow concurrent reads 
and writes. Given our algorithms to compute the spanning 
forest of a graph as well as the ability of construct preorder 
and postorder numberings and lists of a tree, we can use their 
method but improve the result: 


Corollary 5.3: The biconnected components of a graph can 


be computed in O (Sp Bee. cy 9 time, provided 
p Pp 


psn. 


Proof: The algorithm of Tarjan and Vishkin requires the 
computation of a spanning tree for the graph, the computa- 
tion of the preorder number of each node in this spanning 
tree, and the solution of several tree recurrences. We can per- 
form each of these operations within the claimed time bound. 


O 


5.2. Minimum Spanning Tree 


The spanning forest algorithm can be modified to give a 
minimum spanning tree algorithm. A sequential algorithm for 
minimum spanning tree merges supernodes iteratively. At 
each iteration a supernode is chosen; the least cost edge out- 
going this supernode is used to merge it with another super- 
node. The order the supernodes are chosen is arbitrary (see 
[3}). 

We can merge supernodes in parallel provided that the 
merging is consistent with some serial execution of the algo- 
rithm above. Let G,, the supernode graph, and G;, the 
spanning forest graph, be defined as in the previous algo- 
rithm. <A general parallel algorithm that finds a minimum 
cost spanning tree for a connected graph G=<V,E > has 
the following form. 


General parallel algorithm for MST 
G; = <V,empty >; G, = G; 
while G, has more than one node do 
begin 
1. pick least cost outgoing edge from each node of G, ; 
let U be the subgraph of G containing these edges 
(U is the disjoint union of unicycular graphs); 
2. delete from each cycle of U an edge with largest 
cost; let F be the resulting graph (F is a forest); 
3. add the edges of F to G, ; 
4. combine supernodes of G, that belong to the same 
tree of F into one supernode 
end 


The number of supernodes is at least halved at each 
iteration, so that at most lgn iterations are performed. 
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We implement this general algorithm using data struc- 
tures similar to those used for the spanning tree algorithm: 
We keep an adjacency lists representation of G,, and 


represent G; by marking edges in the adjacency list of G. 
We assume that p<n,e. 


(1) A least cost edge is picked in each adjacency list of 
G, by running a product broadcast algorithm on the 
lists of edges. This can be done in 

e€ loge ) 
p log(e /p )” 

(2) A trivial modification of the algorithm of Theorem 4.3 
can be used to delete from each cycle of U an edge of 
highest cost (rather than an edge with least index, as 
done in the original algorithm). This is done in time 

n___logn 
p log(n /p )” 
(3) The new edges can be added to G; in time O(e /p). 


(4) A supernode is selected from each tree of F, and each 
node of the tree is marked with the label of this super- 
node, by running a product broadcast algorithm in 
time O(— aes Ae 


time 


). The edges can be now updated 


p log(n /p ) 
to point to the new  supernodes in_ time 
(— 7 eas ORE”), This is done by computing the com- 
p_log(e /p ) 


position of two mappings: the mapping that maps an 
edge to the old incident nodes, and the mapping that 
maps these nodes to the new supernode they belong to. 
A new adjacency lists de aia is built from 


these updated edges in time Oi es, While 
p log(e (P ) 


the new list is created self-loops can be deleted. 
Each iteration takes time 
(£ e loge , n_ logn _ 
p log(e/p)  p log(n/p)’ 
Since there are at most lgn iterations we obtain 


Theorem 5.4: A minimum cost spanning tree of a connected 
graph can be computed with p <e,n processors on an 


EREW machine in time 


oe +n reer 


log(e +n ) . 
log((e +n )/p ) 


Note: This sea is efficient relative to Kruskal’s algo- 
rithm provided that p =O ((e +n )**). 


A minimum cost spanning forest for a general graph can 
be computed withing the same time bounds: the algorithm is 
modified so that supernodes with no outgoing edges are not 
visited. This is done by packing the list of nodes at each 
iteration. 


6. CONCLUSION 


Many simple and fast serial algorithms are often hard to 
parallelize. In many cases there is enough work that can be 
performed in parallel, but the challenge is to ensure that the 
processors do not conflict when performing this work, and 
that different processors do not replicate the same computa- 
tion. The problem is harder when sparse structures are han- 
dled: If a compact data representation is used then the data 
layout is irregular, and it is hard to distribute work efficiently 
among processors. If, on the other hand, a regular data struc- 
ture is used then superfluous work is performed. 


We have presented techniques for working with sparse, 
irregular structures, and used these techniques to solve several 
important graph problems. We believe these techniques are 
generally applicable to other sparse problems. We still do not 
know whether large, extremely sparse graph problems can be 
solved efficiently, i.e. when p << n ande = O(n). 
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A PARALLEL ALGORITHM FOR DOMINATORS +t 


Shaunak Pawagi 
Department of Computer Science 
SUNY at Stony Brook 
Stony Brook, NY 11794 


Abstract -- We present a fast parallel algorithm for com- 
puting the dominators of a directed acyclic graph. The 
model of computation used is a parallel random access 
machine which allows simultaneous reads but prohibits 
simultaneous writes into the same memory location. Let 
P,(n) be the processor complexity of computing the transi- 
tive closure of an n-vertex directed acyclic graph on this 
model. Our algorithm for computing the dominators 
requires O(log’n)! time using O(P;(n)) processors. The only 
known parallel algorithm for this problem [6] requires 
O(nP,(n)) processors. Our algorithm therefore improves the 
processor complexity of this algorithm by a factor of n, but 
has the same time complexity. 


1. Introduction 


Computing the dominators of a directed acyclic graph 
(DAG) is a very important code optimization step in com- 
pilers (see [2] for details). Consequently, this problem has 
attracted widespread attention and several excellent sequen- 
tial algorithms have been developed for it (see [{1]). Of late, 
growing interest in parallel computation has led: to a proli- 
feration of parallel algorithms for graph problems on a 
model of synchronous parallel computation [6,7]. Surpris- 
ingly, despite the importance of the dominator problem the 
only known parallel algorithm for it is due to Savage [6]. 
This algorithm computes dominators using a parallel ran- 
dom access machine which allows simultaneous reads but 
prohibits simultaneous writes into the same memory loca- 
tion. We refer to this model as R-PRAM. A powerful varia- 
tion of this model that allows simultaneous writes into the 
same memory location by more than one processor, is 
referred to as W-PRAM. 


Let G = (V, E) be a DAG rooted at r, and [V| = n, and 
|E| = m. We say that vertex i is a dominator of vertex j if i 
is on every path from r to j. For each vertex k, Savage first 
constructs a graph G,, by deleting from G all edges leaving 


k. Next, the transitive closure of each of these graphs is | 


computed in parallel. Let G’ and G, denote the transitive 
closures of G and G, respectively. Dominators are then 


si The support of the first author by the Air Force Office of Scientific 
Research under Contract F-49620-85-K-0009, and of the second author by grants 
from the NSF to the Machine Intelligence and Pattern Analysis Laboratory, and of 
the third author by the Office of Naval Research under Contract N00014-84-K- 
0530, and by the National Science Foundation under grant ECS-84-04399, is grate- 
fully acknowledged. 


Throughout this paper, we use log n to denote fiogsn 
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computed by using the simple observation that if i is reach- 
able from r in G" and not in G,, then k is a dominator for i. 
Savage’s algorithm requires O(log’n) time and uses O(nP,(n)) 
processors, where P,(n) is the processor complexity of com- 
puting the transitive closure. The two known parallel algo- 
rithms for doing so are due to Hirschberg [4] and Chandra 
[3], and require O(n*/log n) and O(n?*! /log n) processors 
respectively. 


In the following section we will describe our algorithm 


for computing the dominators on a R-PRAM. Our approach 


is radically different from both Savage’s as well as those 
used in sequential algorithms. The time complexity of our 
algorithm is O(log’n), and it uses O(P;(n)) processors. 
Observe here that the processor requirement our algorithm 
has decreased by a factor of n over Savage’s algorithm. 


It is interesting to note that computing the transitive 
closure seems to be an important step in parallel algorithms 
for various properties of directed graphs, and hence their 
processor complexities are determined by that of the transi- 
tive closure algorithm. Our algorithm does achieve this 
bound for processor complexity. This is similar to comput- 


‘ing properties of undirected graphs in parallel, where the 


processor complexities are determined by the connected com- 
ponent algorithm [7]. 


2. Preliminaries 
We begin with a review of graph-theoretic terms. 


Let G=(V,E) denote a graph where V is a finite set of 
vertices and E is a set of pairs of vertices called edges. If 
the edges are unordered pairs then G is undirected else it is 
directed. Throughout this paper we assume that V consists 
of the set of vertices {1,2,...n} and |E|=m. We denote the 
undirected edge joining the vertices u and v by (u,v) and the. 
directed edge from u to v by <u,v>. An adjacency matrix 
A of G is an nXn boolean matrix such that Alu,v|=1 if and 
only if (u,v) « E. A directed path in G joining two vertices ig 
and i, is defined as a sequence of vertices (ig,ij,i9,..i,) such 
that all of them are distinct and for each 0 < p < k, 
<ipslp41> is an edge of G. An undirected path is defined 
similarly. If ip =i, then the path is called a cycle. A 
directed acyclic graph (DAG) has no cycles. We denote a 
directed path from u to v by [u-—+v]. We say that an 
undirected graph G is connected if for every pair of vertices 
u and v in V, there is a path in G joining u and v. A free is 
a connected undirected graph with no cycles in it. 


A rooted directed tree has a distinguished vertex called root 
from which every other vertex is reachable via a directed 
path. We say that vertex u is an ancestor of vertex v if u 
is on the path from the root to v. The father of a vertex is 
its immediate ancestor. A descendant of a vertex is defined 
similarly. The lowest common ancestor (LCA) of vertices x 
and y in T is the vertex z such that z is a common ancestor 
of x and y, and any other common ancestor of x and y in T 
is also an ancestor of z in T. 


A directed graph is rooted at r if there is a path from r 
to every vertex in V. For the rest of this paper, without loss 
of generality we shall assume that G is a directed acyclic 
graph rooted at r. We say that vertex i is a domtnator of 
vertex j if 1 is on every path from r to j. In particular, for 
every iin V,r and i are dominators of i. Dominators exhibit 
transitivity, that is, for vertices i,j and k in V, whenever i is 
a dominator of j and j is a dominator of k, then i is a domi- 
nator of k. Therefore it is easy to see that the set of domi- 
nators of a vertex j can be linearly ordered by their order of 
occurrence on a path from r to j. The dominator of j closest 
to j (other than j) is called the immediate dominator of j. It 
follows from the definition that immediate dominator of 
every vertex is unique. We can now express the dominator 
relation as a directed tree T, rooted at r called the domina- 
tor tree. If u is the immediate dominator of v then <u,v> 


is an edge of Ty. Note now that i is a dominator of j if i is. 


an ancestor of j in Ty. 


The transitive closure matrix A’ is a boolean matrix 
such that A’ [i,j] = 1 if there is directed path from i to j in 
G, else A*li,j] = 0. For completeness, we first describe the 
parallel algorithm for the transitive closure. 


It has been observed in (4] that the transitive closure of 
a directed graph be computed in O(log?n) time on a R-PRAM 
by straightforward parallelization of the known sequential 
algorithm that is based on repeated multiplication of the 
adjacency matrix. In this parallel algorithm (see Algorithm 
2.1 below) and and or operations replace multiplication and 
addition operations of an inner product step. We refer to 
this as the and-or multiplication of two matrices. The algo- 
rithm initializes the transitive closure matrix A’ to the adja- 
cency matrix A and then performs (log n) iterations of the 
and-or multiplication of A’ by itself. The matrix DD is used 
as temporary storage for clarity. 


// All steps involving i and j are executed for all i, j, 
1<i<n and 1<j<n // 


1. A’ i,j) == Afi,j} — //Initialize// 
2. for t:==1 to log(n-1) do 


n 
2a. DD{i,j|:—= or 


x = 


; { A’[i,j], A [i,k] and A’[k,j] } 
2b. =A’ i,j] == DD{i,j] 


Algorithm 2.1 
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Lemma 2.1: The above algorithm computes the transitive 
closure A” in O(log’n) time using O(n*/log n) processors. — 


Proof: The proof is immediate from the steps (1) and (2) of 
Algorithm 2.1. If Chandra’s [3] parallel algorithm for matrix 
multiplication is used in step (2a), then the processor com- 
plexity of this algorithm is O(n?®/log n). Throughout this 
paper we refer to the processor complexity of this algorithm 


as O(P,(n)). OO 


We now describe our algorithm for computing domina- 
tors. 


3. The Algorithm 


In order to compute the dominator tree we first con- 
struct a spanning tree for G that is rooted at r. We then 
compute the set of dominators for each vertex in a matrix 
DOM such that DOM[i,j] = 1 if i is a dominator of j otherwise 
DOM(|i,j] = 0. The computational steps are as follows. 


1. Compute the transitive closure matrix A’ for G. By 
Lemma 2.1, this computation requires O(log?n) time 
and uses O(P,(n)) processors. 


2. Construct a directed spanning tree T, from the adja- 
cency matrix A and the transitive closure matrix A’. 
This is done by specifying the father of each vertex i, 
the smallest vertex j such that <j,i> is an edge of G. 
This selection can be done in O(log n) time by assign- 
ing n processors to each vertex. Since there are n ver- 
tices in G, we need n? processors for this step. 


3. For every vertex, mark all its ancestors in T, as its 
dominators. That is, set DOM[i,j] = 1, if i is an ancestor 
of j, else set DOM[i,j] to 0. Ancestor computation can 
be done in O(log n) time using O(n”) processors (see 
[7]). Initialization of the matrix DOM requires constant 
time and O(n’) processors. 


4. For every vertex v, consider the non-tree edges incident 
on it. For all such edges <x,v>, compute the lowest 
common ancestor of x and v in T,. Among these LCAs, 
select h(v) = LCA(u,v) be the vertex closest to the root 
r (h stands for highest). The lowest common ancestors 
for all vertex pairs can be computed in O(log n) time 
using O(n”) processors (see [7]). For each vertex v, h(v) 
can be determined in O(log n) time using n processors. 


5 Now, the edge <u,v> provides a path from h(v) to v 


passing through u, other than the path present in T,,. 
Therefore all vertices on the path [h(v)—v] are not 
dominators of v. For all such vertices j, set DOM[j,v] = 
0. Since the number of vertices on any path is at most 
n, we need O(n?) processors to do this step in constant 
time. 


6. For every vertex y, if iis not a dominator of y then i is 
not a dominator of any vertex reachable from y. 
Therefore for every vertex j, i is not a dominator for j 
if there exists at least one vertex x, such that i is not a 
dominator for x and j is reachable from x. Therefore 
set DOM|i,j] = 0, if 


n 
or 
x= 


{ (DOM[i,x] = 0) and (A‘[x,j] = 1) } 


evaluates true. This computation is equivalent to and- 
or multiplication of DOM and A. Therefore this step 
requires O(log n) time and O(P,(n)) processors. 


This completes the description of our algorithm for 
dominators. We now provide the proof of its correctness. 


Lemma 3.1: If vertex u is not a dominator of vertex v then 
at the end of our algorithm DOM|u,v] will be set to 0. 


Proof: If u is not on the path from r to v in T, then 
DOM|u,v] is set to 0 in step 2 of our algorithm and it stays 0 
for all following steps. If u is on the path from r to v in T, 
but it is not a dominator of v then there must exist a path 
from r to v that does not pass through u. There are two 
types of paths to be considered. First, v has a non-tree edge 
<x,v> incident on it such that u is on the path from the 
LCA(x,v) to v. In step 5 of our algorithm, we select h(v) to 
be the closest LCA to the root r. Therefore u must be on the 
path [h(v)—v], and DOM[u,v] will be set to 0. Second, v is 
reachable from some vertex y such that y has a non-tree 
edge incident on it, providing another path to y from r that 
does not use u. In step 5 of our algorithm DOM|u,y| will be 
set to 0, and since v is reachable from y, in the next step, 
DOM|u,v] will be set to 0. Hence the Lemma. Ey), aca 


Theorem 3.1: The above algorithm computes the domina- 
tor matrix DOM in O(log’n) time using O(P,(n)) processors. 


Proof: The correctness of our algorithm is proved in 
Lemma 3.1 and the processor and time complexities are 
immediate from steps 1 to 6 of our algorithm. 0 


Given the matrix DOM, the dominator tree can be con- 
structed by determining the immediate dominator for each 
vertex. Recall that the immediate dominator of a vertex is 
unique, and it is the closest dominator of that vertex. In the 
dominator tree, the closest dominator of a vertex is its 
father. The steps for construction of the dominator tree are 
as follows. 


1. For every every vertex i, count the number of domina- 
tors of i by summing the entries in the i*® column of 
DOM. This summation requires n? processors and O(log 
n) time. 

2. For every vertex i, determine the immediate dominator 
of i. If 1 has d dominators then the immediate domina- 
tor of i is a dominator of i that has (d-1) dominators. 
This can be done in constant time using n“ processors. 


3. The father of every vertex i in the dominator tree is its 
immediate dominator. The root of the tree is r. 


The above procedure constructs the dominator tree 
from the matrix DOM in O(logn) time using n? processors. 
The time and processor complexities are immediate from the 
steps given above, and the correctness is direct from the fol- 
lowing lemma. 
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‘Lemma 3.2: Let d be the number of dominators of i. A 


dominator j of i is the immediate dominator of i iff j has 
(d-1) dominators. 


Proof: The number of ancestors of any vertex in the domi- 
nator tree is equal to its dominators. Since the dominator 
tree is unique, the immediate dominator, which is the father 
of i in the dominator tree, must have (d-1) dominators. © 


4. Conclusions 


We have described an O(log’n) time algorithm for dom- 
inators of a DAG that uses O(P;(n)) processors, where P;(n) is 
the processor complexity of computing the transitive closure 
of a directed graph. This improves the processor complexity 
of Savage’s [6] algorithm, by a factor of n. Finally, it is 
worth mentioning that the algorithm presented here would 
require O(log n) time on a W-PRAM that allows simultaneous 
writing in the same memory location by more than one pro- 
cessor. The only step in our algorithm that requires 
O(log’n) time is computing the transitive closure matrix. 
All other steps require O(log n) time. Since the transitive 
closure can be computed in O(log n) time on a concurrent 


write model [5], our algorithm would therefore require O(log 
n) time on a W-PRAM. 
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Abstract 


The design of parallel computations involves numerous | 


decisions whose effect on execution efficiency is not im- 
mediately clear. A Petri Net based model is presented which 
has the representational power to model a wide range of 
computations while retaining the flexibility to alter con- 
figuration parameters with ease. The model can be used to 
predict optimal values for configuration parameters such as 
degree of parallelism and task granularity on the efficiency 
of the computation. 


1 Introduction 


This paper describes a model for representing and 
evaluating the performance of parallel computations. An im- 
plementation of the model is being used to analyze the be- 
havior of programs expressed using the Computation Struc- 
tures Language (CSL), a language designed and imple- 
mented at the University of Texas. 


The model can be used as a vehicle for studying the 
configuration space of a computation. A parallel computa- 
tion [1] consists of a collection of programs (called tasks) 
and the specification of dependencies between them. The 
dependencies could be simple synchronization or sequencing 
relationships, constraint relationships such as mutual exclu- 
sion for shared resources, or communication relationships 
signifying data transfer between two or more tasks. The per- 
formance of a parallel computation is therefore architecture 
dependent, since different architectures would support these 
dependencies with varying degrees of efficiency. The basis 
for a model for parallel computations is to be able to predict 
the efficiency of execution of a parallel computation on a 
given architecture, and to study the effect of varying dif- 
ferent configuration parameters of the computation. The 
configuration space of interest includes the following 
parameters: 


e shared memory/ message model: the interaction 
between the tasks of a computation can be 
specified using a message model, a _ shared 
memory model, or a mixture of the two. This 
choice is based on the support provided by the 
host architecture and the volume and size of the 
information being exchanged. 


* 
This research was sponsored by the Department of Energy un- 
der grant number DE-AS05-81ER10987 and grant number DE- 
FG05-85ER25010. 
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e granularity of a unit: the efficiency of the com- 
putation is directly dependent on the size of each 
schedulable unit of computation. Larger units 
would usually reduce the amount of parallelism 
possible, while smaller units could entail ad- 
ditional overheads of data movement. 


ecomputation structure: this includes’ the 
specification of synchronization and sequencing 
of the units composing the computation. 


e underlying or host architecture: all the earlier 
parameters are implicitly dependent on the 
choice of the host architecture. 


The model described here was initially designed to 
represent computations specified using the Computation 
Structures Language (CSL) ((2], [3]). CSL is a language that 
allows specification and programming of multitype, mul- 
tiphase parallel computations. It supports dynamic structur- 
ing of computations through multiple phases, each of which 
may display different types and degrees of parallelism and 
differing requirements for sharing of data and interprocess 
communication. The model is a direct abstraction from 
CSL. Each instance of a computation expressed in CSL has 
an equivalent representation using our model. CSL programs 
are directly derivable from the Petri Net model and vice 
versa. 


The proposed model is based on Petri Nets with exten- 
sions to allow the representation and performance evalua- 
tion of parallel computations. The extensions are predicated 
upon a performance evaluation viewpoint rather than a 
theoretical view of Petri Nets. Performance statistics for a 
computation are achieved by first expressing the computa- 
tion in terms of the model, and then simulating the behavior 
of this net instance. The choice of a Petri Net based model 
allows most of the communication and synchronization 
primitives of the computation to be directly modeled. Any 
non-determinacy in the computation can also be easily cap- 
tured in the model. Extensions are provided to allow 
dynamic structuring and scalability of the modeled com- 
putation. 


The rest of this paper is organized as follows. 
Section 2 gives a brief introduction of Petri Nets and high- 
lights some of the previous work in this area. Section 3 con- 
tains the definition and details of the proposed model, while 
the implementation is briefly described in Section 4. The 


utility of the model is demonstrated by means of an example 
in Section 5. 


2 Petri Nets 


Informally, a Petri Net [4] consists of two sets of ver- 
tices (called places and transitions) in a bipartite, directed 
graph. Arcs connect places to transitions or transitions to 
places. If an arc exists from a place to a transition, the place 
is an input place for that transition. Similarly, if an arc ex- 
ists from a transition to a place, the place is an output place 
for that transition. Places can contain tokens. One inter- 
pretation given to Petri Nets is that places represent con- 
ditions, transitions represent actions, and the presence of a 
token in a place represents the presence of that condition. 
The set of input places for a transition signifies the con- 
ditions that should be present for that event to occur. The 
occurrence of an event is represented by the firing of a 
transition. For a transition to fire, all its input places must 
contain tokens, and upon firing, the transition removes one 
token from each of its input places and puts one token each 
in its output places. The state of the net is simply the dis- 
tribution of tokens in the places of the net. Petri Nets are 
represented graphically as shown in Fig. 1 with the places 
shown as circles, the transitions as bars, and the tokens as 
dots. 


Token 


Place 


Transition 


Figure 1: A Petri Net 


Extensive work has been done to include timing infor- 
mation in Petri Nets. The most common addition is that of 
associating delays with transitions. When a transition begins 
to fire, it removes its input tokens as before, but places 
tokens in its output places only after a period of time equal 
to its delay value. E-nets [5] had fixed delays associated 
with transitions, Timed Petri Nets [6] allowed arbitrary 
delays; while extensions to their model in [7] allowed state 
dependent delays. While these models are suitable for 
modelling parallel systems, they lack the ability to model 
data-dependent control flow with ease. 


An extension was proposed by Keller [8] to help model 
this control flow. His model included a set of program vari- 
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ables. Each transition had a transition procedure and predi- 
cate based on the program variables. In addition to requir- 
ing tokens on all input places, a transition was now con- 
sidered enabled only if its predicate was true. Also, when a 
transition fired, its procedure was executed, thus altering 
the program variables. This model was designed for verify- 
ing correctness properties of parallel algorithms, and did not 
include the notion of time. 


The model proposed in this paper attempts to combine 
these ideas, while also including additions to support fea- 
tures such as dynamic structuring and scalability of com- 
putations. 


3 PCM: A model for parallel computations 


This section contains a description of the model, along 
with some details of the simulation procedure used to do the 
performance evaluation. 


3.1 Definition 


The Parallel Computation Model (PCM) is defined by 
the three-tuple 


PCM = (PN, V, TA) 


I, 0), a Petri Net, where 

-.p,t. a set of places,n > O 

. tit, a set of transitions,m > 0O 
the transition input function 


O C TxP, the transition output function 
where P and T are disjoint 


V = {v,.V5...,V,},a set of program variables 
TA = is a three-tuple <I, $, T>; 
and 
MW) = {1,(V), m)V),....7, WF, 
a set of predicates, 
@(V) = {4,(V). $,°0V),....6, VF, 
a set of procedures, 
TV) = {7,(). 72.%).....75,)}. 


a set of delays. 


The basic underlying structure is the standard Petri 
Net PN. In addition, the model allows a set of program 


variables (V), and attributes associated with each transition. 


Each transition t; in the model defined above has a 
corresponding predicate m(V) and a transition procedure 
¢(V). The predicates are defined on program variables, and 


are used to enable transitions under the modified firing 
rules. The transition procedures also operate on elements of 
V and can be used to modify them. The function 7(V) 


specifies the delay associated with transition t., and can also 
be a function of the set of program variables. 


3.2 Firing Rules 


The definition of the model is completed by specifying 
the modified firing rules. Each place in the net is concep- 
tually divided into an enable and a hold region (Fig. 2). 


This division helps in describing the behavior of the net. A | 


transition can be in one of three states: disabled, enabled, 
and firing. The rules governing transition state changes are 
as follows. 
e A disabled transition t, is enabled if each of its 
input places contains at least one token in its en- 
able region and the predicate n(V) corresponding 


to the transition is true. 


e An enabled transition t, enters the firing state 


by moving one token in each of its input places 
from the enable to hold region. If this transition 
shares input places with other enabled tran- 
sitions, they must satisfy rule 1 to remain en- 
abled, failing which they become disabled. 


e A transition t, remains in the firing state for a 
period of time specified by its delay function 
7(V). A the end of this period, it removes one 
token from the hold region of each of its input 
places, and places a token in the enable region of 
each output place, and executes its associated 
procedure 7(V). It then returns to the disabled 


state. 


Enable Region 
Hold Region 


i<=5 i>5 


Procedure 
& 
Predicate 


Figure 2: Execution of Net 


Transition firings are no longer instantaneous, as in 
the case of standard Petri Nets. When a transition starts. 
firing, it removes its enabling tokens , which can no longer: 


contribute to the firing of any other transition. Tokens ap- 


pear on the transition’s output places only after a certain. 


delay. There are two ways in which the firing of a transition 
can effect other transitions. When a transition starts firing, 
the removal of tokens from its input places could cause 
other enabled transitions to become disabled. Also, when a 
transition completes firing, it modifies the marking as well 
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as the state of the program variable vector, thereby causing 
other transitions to be either enabled or disabled. 


Transition predicates, procedures and program vari- 
ables provide the means to represent standard looping and 
branching constructs with ease. Figure 2 shows the branch- 
ing construct where the enabling of the two transitions is 
dependent on the current value of the program variable 2 - 
both transitions can not be enabled simultaneously. 


The state of this system at any time is given by the 
current marking along with the vector of program variables 
and the vector of remaining firing times’ for the transitions. | 
The marking expresses the ’control’ state, while the program 
vector specifies the ‘data’ state. If a transition is in the 
process of firing, its remaining firing time specifies the 
period of time after which firing will be complete. If the 
transition is not firing, its RFT is zero. . 


3.3 Hierarchical Modelling 


PCM supports hierarchical modelling by allowing tran- 
sitions to be designated as subnets. A subnet transition is 
constrained to begin and end with special places called the 
start and stop places (Figure 3a). When the subnet tran- 
sition begins firing, a token is placed in its start place, and 
the subnet becomes active. When a token appears in the 
stop place, the subnet ceases to be active. Only transitions 


in active subnets can fire.. ae 
In addition, subnets can be parameterized in two ways. A 


subnet can have a set of parameters which cause the entire 
subnet to be replicated upon its activation. Subnets can also 
contain special places which have parameters associated 
with them. When the subnet becomes active, these places, 
along with their connecting arcs, are expanded to produce 
simple places. In Figure 3a, P2(i,j) is a parameterized place — 
with j ranging from 1 to i. When this net becomes active, all 


‘parameters are applied to the nodes of the net, resulting in 


the net shown in Figure 3b. Dynamic structuring of the 


graph can be achieved by using program variables as 
parameter ranges. 


4 Model Behavior 


The PCM simulator package has been implemented in 
C on a SUN workstation. A menu-driven graphics front-end 
is provided to create and modify net instances. The current 
implementation supports a wide range of net attributes. 
Transition delays can be specified either as instantaneous 
(zero delay), fixed delay, or random based on exponential or 
uniform distributions. In addition, the delays can be ex- 
pressed as functions of the program variables. The tran- 
sition procedures and predicates are specified as standard C 


functions, which are precompiled. Places have types as- 
sociated with them to specify the kind of performance 
values expected from them. For example, a place modelling 
-@ queue would yield figures on throughput, mean wait and | 
service times, etc. The length of the simulation run is 
specified by the user. 


The simulation proceeds by moving 
time-stamped tokens through the net and collecting timing 


(b) 


Figure 3: Subnets 

information at the places. Alternately, a state history of 
the execution of the net is maintained, which can be used to 
determine the sequence of events in the computation. The 
simulator has been validated using analytical results ob- 
tained using the Stochastic Petrinet Analyzer [9]. 


5 An Example 


In this section, the utility of the model is 
demonstrated by means of an example. The computation, 
suggested by J. Dongarra and D. Sorensen of Argonne Na- 
tional Laboratories, is the solution of a lower triangular 
matrix: 


Tx =b 
where T is an nxn lower triangular matrix, x and b are n- 
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N = No. of Blocks 


Ss = size of each block 


Figure 4: A Triangular Solver 


vectors. The algorithm used is to decompose the matrix 
into blocks as illustrated in Figure 4. There are three types 
of tasks in the computation: znzt, solve and matvect. The 
tasks are described next: 


init(i) initializes one block row of the system. 


solve(i) triangular solver for the zth triangular 
diagonal block. It solves for x; in the sys- 
tem: 
Ti = 5 

This can be done only after all matvect 
tasks for row i have completed. Observe 
that solve(1) (solve for block 1) can begin 
immediately after initialization, and does 
not depend on any matvect tasks. 


executes the transformation 
Ti 
on the jth block in column i. This step 


can only be executed if solve(i) has been 
completed. 


matvect(j, i) 


Figure 5 shows the CSL program which specifies the above 
computation. The CONSTRUCT statement contains 
declarations of the tasks, shared variables and communica- 
tion channels involved in the computation. The declaration 
of Solve, for example, specifies N tasks Solve(1), 
Solve(2)...Solve(N). All Solve tasks are identical, and their 
compiled (object) code is found in file C2. A list of shared 
objects accessible to each task completes the specification of 
the Solve tasks. The Matvect tasks process the non- 
diagonal blocks of the matrix. Associated with each Mat- 
vect is a boolean task condition which is set after each ex- 


ecution of the task. 
The program starts off N parallel streams, as denoted by 


the outer COBEGIN. Each parallel stream begins by ex- 
ecuting a diagonal (solve) task first, and then all non- 
diagonal (matvect) tasks in that column in parallel (inner 
cobegin). The WAIT statement ensures that the execution of 
each solve task begins only after all matvect tasks in its row 


JOB Triangle; 


VAR N : integer; 


BEGIN 
N := 4; 
CONSTRUCT 
TASKS 
Init(i) : c1 [ T(i,j), Xi) ] 
RANGE j = 1 to i, 
i = 1 to N; 
Solve(i) : C2 [T(i,i), X(i) ] 
RANGE i = 1 to N; 
Matvect(i,j) : C3(T(i,j), xi), X(j)] 


CONDITION C(i,j) 
RANGE j = 1 to i-1, 
i= 1 to N; 
END; { CONSTRUCT } 


WITH T(i,j), X(i) DO 
EXECUTE Init(i) RANGE j 
1 


nou 
ss 
ct 
o 
= 


COBEGIN 
( // WAIT C(i,j) RANGE j = 1 to i-1; 
WITH X(i), T(i,i) DO 
EXECUTE Solve(i); 
COBEGIN 
(// WITH T(j,i), X(j) : X(i) DO 
EXECUTE Matvect(j,i) 
) RANGE j = i+i to N; 
COEND 
) RANGE i = 1 to N; 
COEND ; 
END. 


Figure 5: CSL Program for Triangular Solver 


signal their completion by setting their task conditions. 


The statement: 


WITH T(j.i], X{j] : Xfi] DO 
EXECUTE Matvect (j,i); 


specifies that the task matvect requires exclusive access to 
shared objects Tj,i] and X[j]; and wishes to use X{i] in read- 
only, or non-exclusive mode. Finally, the CSL program 
demonstrates the power of the RANGE statement as a con- 
struct for parameterizing the computation. 


The equivalent representation of the computation in 
the PCM model, shown in Figure 6, consists of four subnets. 


= 
: 


Solution 
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C(i,j) 


TG,i) 
X(i) 
Solve(i) 
A Column(i) 
(c) 
j=i+1..N 


Figure 6: PCM solution 


Figure 6(a) shows the top level decomposition of the com- 
putation into the initialization phase (Initarray), and the 


solution phase (Solution). Initarray (Figure 6(b)) models N— 


parallel streams, with each stream containing one execution 
of Init. The place T(i,j) is parameterized, and is expanded 
when the subnet becomes active. For example, if i is 3, T(i,j) 
would be expanded to T(3,1), T(3,2) and T(3,3). The subnet 
Solution, shown in Figure 6(c) models the rest of the com- 
putation. Each solve task solve(i), requires exclusive access 
to T(i,i) and X(i), and can execute only after conditions 
C(i,1), C(i,2)..C(i,i-1) are set. After completion of solve(i), 
subnet Column is activated, which models the matvect task 
executions. The transition Acquire is used to gain exclusive 
access to shared objects T(j,i) and X(j), and non-exclusive 
access to X(i). The non-exclusive access is modelled here by 
merely copying the contents of X(i) into local memory. A 
more elaborate reader-writer scheme could be modelled by 
means of auxiliary program variables and appropriate tran- 
sition predicates and procedures. 


To calculate the delays in the model, the following ar- 
chitectural model is assumed. The system contains several 
independent processors, each with sufficient local memory. 
In addition, all processors can access a shared memory, and 
can set locks on parts of this memory. Access times are as- 
sumed to be equal for both local and remote accesses. The 
only overhead for remote accesses is for setting or releasing 
locks. Figure 7 shows the delays assumed for some opera- 
tions. Also shown are delays associated with the transitions 
of the model, in terms of the size of each block. 


The simulation results presented here are for a 48x48 
system. The execution speed is measured for different values 
of N, the number of blocks. Figure 8 shows the various ex- 
ecution speeds as N is varied from 1 to 6. The structure of 
the computation is such that the parallelism is mostly con- 
tained in the concurrent executions of the matvects for a 


Operation No. of Cycles 
multiply + add (V) 30 
acquire/release shared memory (c) 10 
local store in memory (1) 5/element 
copy from shared memory (r) 10/element 

For block size S, 

matvect s°V + 2c 

Solve s°V / 2 + 4c 

Acquire rs + 4c 

Init(i) for ith row Is(s(i+1/2) + 1) 
+ 2(i+1)c 


Figure 7: Delays associated with some operations 


column. This explains the limited speedup when N is in- 
creased from 1 to 2. The small speedup that does appear 
seems to be due to the parallelism in the initialization tasks. 
As N is increased to 3 and 4, a substantial decrease in ex- 
ecution time is observed. Eventually, the bottleneck is due 
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to conflicts for access to X(i) ty all matvect tasks in column 
1. 

Using the powerful parameterization features provided 
in this model, the simulations can be repeated for various 
values of N with no changes to the net itself. The model is 
also capable of evaluating resource utilizations for com- 
munication channels and throughputs at various points in a 
computation, which can be used to determine bottlenecks in 
the computation. 


425 
375 
325 


275 


Time (cycles) 


175 5 
No. of Partitions 


Figure 8: Execution Time vs No. of Partitions 


6 Conclusions 


In this paper, we have presented a model for represen- 
tation and performance evaluation of parallel computations. 
In addition to modelling the algorithm behavior, the model 
incorporates features to study the execution overheads such 
as synchronization and communication delays. The model 
provides a vehicle for studying the configuration space of a 
given parallel computation, and deciding which configura- 
tion is most suitable for a given computation and architec- 
ture. 


The parameterization mechanism described here leads 
to a powerful and compact representation which is ex- 
tremely flexible. Dynamic graph _ structures can be 
represented using this parameterization scheme. Only small 
changes need to be made to a model instance to vary 
parameters like the granularity of a task or the degree of 
parallelism. 


The model has been designed to provide a straight for- 
ward mapping from CSL programs to net instances. This 
mapping could be automated, providing a useful tool for 
analyzing and debugging CSL programs. A unique element 
of this modelling system is that it models the execution of 
programs, thus capturing the effects of different represen- 
tations of algorithms. The Petri Net model is directly 
coupled to CSL constructs and vice versa. 
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Parallel Parsing on a One-Way Array of Finite-State Machines 


Jtk H. Chang and Oscar H. Ibarra 
Department of Computer Science 
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Abstract. We show that a one-way two-dimensional 
iterative array of finite-state machines can recognize and 
parse strings of any context-free language in linear time. 
What makes this result interesting and rather surprising is 
the fact that each processor of the array holds only a fixed 
amount of information (independent of the size of the input) 
and communicates with its neighbors in only one direction. 
This makes for a simple VLSI implementation. Although it 
is known that recognition can be done on such an array, 
previous parsing algorithms require the processors to have 
unbounded memory, even when the communication is two- 
way. 


1. Introduction 


The parallel recognition of context-free languages by 
iterative arrays of finite-state machines was first considered 
in [11], where it was shown that context-free languages can 
be recognized by two-way two-dimensional iterative arrays 
in linear time. The construction in [11] is a parallel imple- 
mentation of the well-known Cocke-Younger-Kasami 
(CYK) dynamic programming algorithm for recognizing the 
strings generated by a context-free grammar in Chomsky 
normal form [1]. Later in [10], the problem of computing 
the cost of an optimum binary search tree was shown to be 
solvable by a one-way two-dimensional array of 
unbounded-memory processors in linear time. It follows 
from the construction in [10] that context-free language 
recognition can also be done on a one-way two-dimensional 
iterative array (see, e.g., [3]). A recognition algorithm akin 
to the CYK algorithm is Earley’s algorithm [1]. The advan- 
tage of the latter algorithm is that it does not require the 
context-free grammar to be in Chomsky normal form. A 
parallel implementation of Earley’s algorithm was recently 
reported in [2]. Both recognition and parsing were 
considered, and detailed VLSI implementations were 
described. The structure of the iterative array implementa- 
tion of the recognition algorithm is similar to that in [10] 
with each processor of the array being finite-state. How- 
ever, the array implementation of the parsing algorithm 
required the processors to store and manipulate data which 
grow with the length of the input string being parsed. This 
is because of the way the parse is generated: the "indices" 
specifying the decompositions of the input string in the 
dynamic programming method are stored in the processors. 


* This research was supported in part by NSF Grant MCS 
83-04756. O. H. Ibarra was also supported by a John 
Simon Guggenheim Memorial Foundation Fellowship. 
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Moreover, each processor must be capable of comparing 
index values, thus making the "control mechanism" (for 
transmitting and distributing data) of the array more com- 
plex. Thus, the parallel parser in [2] is not finite-state and 
its VLSI hardware implementation is not as simple as the 
one for recognition. 


In this paper, we show that we can produce the parse 
without storing the “indices”. In fact, we show that both 
recognition and parsing can be carried out on a one-way 
two-dimensional iterative array of finite-state machines 
(2DIA) in linear time. Because the array is finite-state, the 
processor’s logic is simple and independent of the input size. 
This, together with the one-way communication, makes 
VLSI implementation simpler. 


A 2DIA [4] is shown in Figure 1-(a). It consists of an 
nXn array of identical finite-state machines (nodes) that 
operate synchronously at discrete time steps by means of a 
common clock (not shown in the figure). The node at the 
upper right hand corner receives the serial “input 
aia.°°'a,6 beginning at time 0. The serial output 
b,b,-- + b,$ is observed from the node at the lower left 
hand corner, and is output beginning at time 2n-1. Thus, b, 
occurs at time 2n-2+i. (Note that all nodes generate out- 
puts, but only the output of the node at the lower left hand 
corner is observed.) The a,’s and b,’s come from finite input 
and output alphabets % and A, respectively. Both input 
and output strings are terminated by a special endmarker 
$, which is not in & U A. We assume that, unlike the a,’s, $ 
is not "consumed" when read by the array, and is always 
available for rereading. The state and outputs of a node at 
time t are functions of its state and the states of its neigh- 
bors (or the external input in the case of the node at the 
upper right hand corner) at time t-1. At time 0, each node 
is in a distinguished quiescent state q), with its outputs set 
to the blank symbol X. It remains in this state until one of 
its neighbors enters a non-quiescent state. The 2DIA has 
time complexity T(n) on input a,a,---a,$ if it outputs $ 
after at most T(n) steps. Clearly, T(n)>3n-1. We note 
that the 2DIA and the triangular array of Figure 1-(b) are 
equivalent. Clearly, the former can simulate the later. For 
the converse, the triangular array simulates in parallel the 
computations of the nodes that overlap when the 2DIA is 
folded along its diagonal. Our algorithms will be carried 
out on 2DIA’s, since they are more convenient to use in our 
constructions. It is clear, however, that they can also be 
carried out on triangular arrays. 


The 2DIA, being finite-state and one-way, is not so 
easy to program. Concurrency and synchronization of 


processes are not easy to handle. Thus, for the proofs, we 
use an equivalent formulation of the 2DIA in terms of a res- 
tricted type of uni-processor sequential machine called 
2DSM [7,8] (see Figure 2). The 2DSM consists of a finite- 
state control, with an input terminal from which it receives 
the serial input a,a,---a,$ and an output terminal from 
which the serial output b,b,---b,$ is observed. As in a 
2DIA, we assume that $ is not consumed when read by the 
machine and is always available for rereading. The 2DSM 
operates on a two-dimensional worktape of n X n cells. Ini- 
tially, all cells of the worktape are set to \. The 2DSM 
operates in sweeps as shown in Figure 3. A sweep begins 
with the machine in a distinguished start state q, and the 
read-write head (RWH) scanning the rightmost cell of the 
first row. The machine then reads an input symbol and 
scans the first row from right to left. For each cell scanned, 
the machine rewrites the cell, outputs a symbol, and 
changes state (except into q)). When the machine has pro- 
cessed the leftmost cell, the RWH is reset to the rightmost 
cell of the next row in state qy. The operation is repeated 
for all other rows. After scanning the bottom row, the 
RWH is reset to the rightmost cell of the first row to begin 
the next sweep. (Thus, a sweep begins at the rightmost cell 
of the first row and ends at the leftmost cell of the bottom 
row.) The RWH has the following capability: when scan- 
ning a cell in some row, it can also read (but not rewrite) 
the content of the cell directly above the cell currently 
scanned. The output symbols b,,... ,b,,$ are the outputs 
produced by the leftmost cell of the bottom row. The 
2DSM has sweep complexity S(n) on input a,a, --- a,$ if it 
outputs $ after at most S(n) sweeps. Clearly, S(n)>n+1. 
In [7,8], the following result was shown: 


Theorem 1. Let T(n) > 3n-1. If M, is a 2DIA with time 
complexity T(n), then we can effectively construct an 
equivalent 2DSM M, with sweep complexity T(n)-2n+2, and 
conversely. 


Thus, we need only show the constructions on a 2DSM, 
since automatic conversion to the corresponding 2DIA is 
guaranteed by the constructive proof of this result. 


2. Context-free Language Recognition 
and Parsing on a 2DIA 


Our construction is based on the CYK algorithm [1]. 
The same techniques should apply to Earley’s algorithm [1]. 


Let G = (N, &, S, P) be a context-free grammar 
(CFG), where N and » are finite nonterminal and terminal 
alphabets, respectively, S € N is the start symbol, and P is 
a finite set of productions (or rules) in Chomsky normal 
form. Thus, the productions in P are of the form A—BC or 
A—a, A,B,C EN anda € &. Let L(G) denote the language 
generated by G. For any string x = a,a.°--a,, n>1 and a, 
EX, let Qj) = {A | A EN and A => aaiy,--- aj}, 
1<i<j<n. Then, Q(i,i) = {A | A € N and Aa, € P}, 
1<i<n, and Q(ij) = L Qlik)-Q(k-+1,j), 1<i<i<n, where 


i<k<j 
X-Y = {A | (4 B,C) (B EX, C EY, and ABC € P)}. The 
CYK algorithm builds table Q by dynamic programming. 
Then, x € L(G) if and only if S € Q(1,n). 


The above algorithm can be modified so as to also out- 
put a parse of x if x € L(G). This is accomplished by associ- 
ating with each Q(i,j) another set R(i,j) containing "rules" of 
the form [A—+BC] or [A—+a], A,B,C € N and a € &. The 
R(i,j)’s are computed as follows. Suppose that A becomes a 
member of Q(i,j), i<j, as a result of the rule ABC, B € 
Q(i,k) and C € Q(k+1,j), for some i<k<j. Then [A—BC] is 
in R(i,j). Similarly, if A becomes a member of Q(i,i) as a 
result of the rule A—a,, then [A—a,] is in R(i,i). Note that, 
in fact, the Q(i,j)’s need not be computed, since they can be 
derived from R(i,j)’s. (Q(i,j) is simply the set of all nonter- 
minal symbols appearing on the left-hand sides of rules in 
R(i,j).) 

Figure 4-(a) gives an example of a CFG G. For input 
x = baaba, the table R of R(i,j)’s is shown in Figure 4-(b) 
(ignore the *’s for now). Observe that baaba is in L(G) since 
R(1,5) contains at least one rule whose left-hand side is S. 
The same table, with the entries relabeled, is shown in Fig- 
ure 4-(c). 

If x € L(G), a parse tree of x can be obtained by back- 
tracking. That is, one starts by choosing from R(1,n) a rule 
of the form [S—+AB]. Then, for k =1,..., n-1, R(1,k) and 
R(k+1,n) are searched for rules of the form [A—CD] and 
[B—+EF], respectively. (If the search is successful for more 
than one value of k, one is chosen arbitrarily.) The process 
is repeated for each newly found successor until the R(i,i)’s 
are reached. Figure 4-(d) gives a parse tree of the string x 
= baaba, obtained by backtracking on the table R of Figure 
4-(b). In each node, the number below the rule represents 
the set from which the rule was chosen. (The chosen rules 
are also marked * in Figure 4-(b).) The parse itself can be 
specified in several ways. For our purposes, we shall output 
the right parse of the input string, which is obtained by 
traversing the parse tree in reverse preorder. For the given 
example, the right parse is [S-+BC], [CAB], [B—+CC], 
[C—ra], [C—+AB], [B—+b], [A—+a], [A—+a], [B—>b]. Thus, the 
right parse is obtained from the sequence of nodes 15-14- 
12-5-8-4-3-2-1. 

We now show that the above parsing algorithm can be 
carried out on a 2DSM operating in O(n) sweeps. It then 
follows that the corresponding 2DIA also operates in linear 
time. The construction is complicated by the fact that the 
2DSM is finite-state, since in this case it cannot store 
numbers whose values grow with n. As we shall see, this 
restriction makes the generation of the parse quite difficult. 
To simplify the presentation, we only illustrate the con- 


_ struction by means of an example, using table R of Figure 
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4-(b). 

Let M be a 2DSM. Without loss of generality, assume 
that M has a worktape consisting of 3nX3n cells. (The 
worktape can always be reduced to nXn by simulating a 
3X3 subarray in one cell.) Each row of the worktape is 
divided into three tracks. M operates in several phases: 
Table Construction, Tape Reversal, Regeneration, Parse- 
Tree Extraction, and Outputting of the Right Parse. 


Table Construction. M starts by constructing table R on 
the first track of the worktape. The operation of M during 
this phase is a modification of that given in [8] (where only 
context-free language recognition was considered). The idea 


is to compute the entries on the ith diagonal of the table 
(Figure 4-(c)) during the ith sweep of M. Figure 5 shows the 
pattern we wish to achieve. The jth row of sweep i, 1<j<i, 
computes the jth entry of diagonal i of the table (starting 
with the topmost entry). For instance, during sweep 3, M 
computes §, 7 and 10 in rows 1, 2 and 3 of the worktape, 
respectively. Note that an entry is computed only after 
forming the "convolving" pairs on which it depends. For 
example, in the third sweep, 10 (represented by the pair 
(10,10)) is computed by first forming the pairs (7,1) and 
(3,6) on the first two cells of the row. 

A careful study of the profile will reveal how M should 
generate the pairs. Except for the leftmost pair of each 
row, the left element of each pair is propagated diagonally 
downwards (i.e., one row down and one cell to the left). 
The right elements of the pairs are obtained by shifting 
them one row down during each subsequent sweep. The 
leftmost pair of each row is simply the value of the entry 
computed on this row. All of these actions can be per- 
formed by M. However, a problem arises when generating 
the left element of the first cell of each row. For instance, 
consider rows 3 and 4 of sweep 4 of the profile. The pair 
(11,1) in row 4 can only be formed by propagating 11 of the 
pair (11,11) in row 3 into this pair. However, M cannot 
pass information from left to right since it changes states 
only during a right-to-left row scan. This problem is over- 
come by using the following "folding technique”. The idea is 
to fold the profile of each sweep along the dotted lines 
shown in Figure 5. The folded profile is shown in Figure 6. 
As a result of the folding, all the necessary propagation and 
shifting of elements can now be performed by M. For 
instance, in sweep 4 of the profile, pair (11,1) is now directly 
below pair (11,11). Hence, 11 can be propagated from the 
latter pair to the former during a right-to-left row scan of 
M. Figure 7 gives the rewriting rules necessary to achieve 
the folding. In the figure, the boxed cell is the cell currently 
being rewritten; the solid (dotted) arrow means: copy the 
current (previous) contents. Thus, the right element of a 
pair is obtained from the previous right element of a pair in 
the row above it. The second track is used to hold these 
previous values. 


While performing the above steps, M also creates, on 
the third track, a table T whose ith row, 1<i<n, is simply 
the bottom row of sweep i of M (see Figure 8). Thus, at the 
end of n sweeps M has the last sweep profile of Figure 6 on 
the first track and table T on the third track of the work- 
tape. When $ is first read (on the (n+1)st sweep), M can 
then check whether 15 has a rule whose left-hand side is S 
(the start symbol). If no such rule exists, M outputs "error" 
and halts. Otherwise, it chooses one such rule, marks it by 
* and proceeds to the next phase. 


Tape Reversal. During this phase, M reverses the tape 
contents by first reversing the rows, then the columns of the 
worktape. To reverse the rows, M proceeds as follows. 
When $ is first read (on the (n+1)st sweep), M rewrites the 
rightmost cell of the first row by ‘-’ and stores the previous 
contents of this cell on the left end of the row (i-e., after the 
last ‘B’). (M in fact stores the previous contents in reverse. 
That is, if the previous contents were (a,b)(c,d), then M 
stores (d,c)(b,a)). M also marks the left end of the row by 
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@. The same is done for all other rows. In succeeding 
sweeps and for each row, M moves past the ‘-’s and rewrites 
the first cell encountered by ‘-’. It then stores the previous 
contents of this cell in the cell marked @, while shifting the 
tape contents to the left. M repeats the process until the 
next cell to be rewritten is the one marked @. The tape 
profile of M after reversing the rows is shown in Figure 9- 
(b). Next, M reverses the columns. The steps carried out 
are the same except that they are done vertically (see Fig- 
ure 9-(c)). M also marks the bottom row of the table by @. 
This phase takes 2n sweeps to complete. The configuration 
of the worktape after reversal is shown in Figure 10. 


Once the tape contents have been reversed, M gen- 
erates a parse tree of the input string by backtracking. It 
does this by performing the next two phases simultaneously. 


Regeneration. During this phase, M regenerates the 
sweep profiles shown in Figure 6 in reverse order starting 
with sweep 5. In general, M can regenerate the profile of 
sweep i from the profile of sweep i1+1 and from table T. For 
instance, to regenerate the profile of sweep 4, M starts by 
erasing the top row of the current profile and replacing the 
right elements of the pairs in the second row by the 
corresponding elements in the second row of table T. The 
left elements of the second row and the pairs of the remain- 
ing rows are obtained by shifting and propagating the 
entries according to the rules given in Figure 12, which are 
essentially the "“inverses" of the rules given in Figure 7. 
Again, the boxed cell represents the cell currently being 
rewritten; the solid (dotted) arrow means: copy the current 
(previous) contents. The same steps are carried out to 
regenerate the remaining sweeps. In each case, the top row 
of the current profile is erased, and the right elements of 
the next row is replaced by the right elements in the 
corresponding row of table T. The left elements of the row 
and the pairs of the remaining rows are formed using the 
rules in Figure 12. The regeneration phase terminates after 
the bottom row of the table has been erased. To know 
when this happens, M does the following. At the start of 
the regeneration phase, M initializes a marker on the right- 
most cell of the top row ((15,15)(15,15) in Figure 11). In 
each subsequent sweep, it shifts the marker one cell to the 


left. The regeneration phase terminates when the marker 
moves past the leftmost column of the table. 


Parse-Tree Extraction. While executing the regeneration 
phase, M simultaneously performs a marking phase through 
which it extracts a parse tree of the input string. Recall 
that at the end of the table construction phase, M has 
marked by * a rule in 15 which has S (the start symbol) on 
its left-hand side. This rule is the root of the parse tree 
generated by M. The rest of the tree is generated as fol- 
lows. When scanning a row, M checks if the rightmost 
(non-‘B’) entry has a rule marked *. If no rules are marked, 
M writes ‘-’ at the left end of the row. Otherwise, the rule 
is either of the form [A—a] or [A—+BC]. In either case, M 
remembers the rule in its state and writes the rule at the 
left end of the row. In addition, if the rule is of the form 
[A—+BC], M does the following. While moving left on the 
worktape, M searches for the first pair of entries (X,Y) in 
which B appears on the left-hand side of a rule in X and C 


appears on the left-hand side of a rule in Y. Let the two 
rules be [B—DE] and [C-+FG]. M then marks both 
[B—+DE] and [C-+FG] by *. 

During the regeneration phase, the entries (along with 
any marked rules) are shifted and propagated according to 
the transition rules given in Figure 12. Thus, at some point 
an entry with a marked rule will reach the rightmost (non- 
‘B’) cell of some row. 


The execution of the above steps results in the genera- 
tion of a matrix of rules, as shown in the last sweep profile 
of Figure 11 (to the left of the dotted vertical line). This 
represents essentially a parse tree of the input string (see 
Figure 4-(d)). This phase terminates at the same time as 
the regeneration phase. 


Outputting the Right Parse. What remains to be done is 
to output the right parse itself. Intuitively, this can be 
done by scanning the matrix of rules one column at a time, 
starting with the rightmost column. Within each column, 
the rules are output one at a time, starting with the top- 
most one. However, a problem arises because M cannot tell 
when it has exhausted one column and proceed to the next. 
To overcome this problem, M performs a _ "shift-and- 
accumulate" phase during which the matrix is shifted down- 
wards and the rules are accumulated on the bottom row of 
the matrix. (The bottom row of the matrix is marked @ at 
the end of the column-reversal phase.) The details of this 
phase are shown in Figure 13. The bottom cells can hold at 
most two rules. The cell acts as a queue so that when a 
rule is shifted into a cell which is already full, the first rule 
shifted in is taken out and passed to the next cell. Note 
that there are exactly 2n-1 rules in the matrix, and these 
can all be accommodated in the bottom cells. This "shift- 
and-accumulate” phase can be performed by M in n sweeps. 
Once completed, M can then output the right parse of the 
string by outputting the rules from right to left. (Note that 
the output is observed at the bottom row of the worktape, 
which is different from the bottom row of the matrix. In 
this case, M simply outputs the same rule in each of the 
remaining rows of the tape.) We leave it to the reader to 
verify that outputting the rules in this order does indeed 
give the right parse of the input string. 


M takes 7n sweeps to complete the computation. The 
corresponding .2DIA has time complexity 9n-2. Thus, we 
have 


Theorem 2. Context-free language recognition and parsing 
can be carried out on a 2DI]A in linear time. 


Remark 1. The 2DSM described above can be made to 
operate in time (n+1)+en for any real number €>0, by 
simulating k sweeps in one sweep and allowing the machine 
to output k symbols at a time. It follows from the charac- 
terization that the 2DIA can be made to operate in time 
(3n-1)+en. 

Remark 2. Strings of length less than n can also be 
recognized/parsed by the 2DSM. This follows from the fact 
that the length of the input determines the size of the table 
built by the 2DSM during the table construction phase. 
Moreover, during the outputting of the parse, the answer is 
propagated to all rows lying below the table. 
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3. Conclusion 


We have shown that context-free language 
recognition/parsing can be carried out on a 2DIA in linear 
time. One can show that a (one-way) 2DIA operating in 
linear time can be simulated by an unbounded two-way 
2DIA (Figure 14) also operating in linear time. Moreover, 
the computation of the two-way 2DIA can be sped up to 
operate in time (n+1)+en, for any real number €>0 [7]. It 
follows that CFL recognition/parsing can be carried out on 
a two-way 2DIA in (n+1)+en time. 


Our techniques are applicable to other problems 
involving dynamic programming, e.g., to the problem of 
finding approximate patterns in strings [14,15], the string- 
to-string correction problem [12,16], the longest common 
subsequence (LCS) problem [5,6,12,16], dynamic time warp- 
ing, optimum generalized alignment, error-correction, etc. 
[13]. For these problems, we are interested in the "parse", 
rather than the value, of an optimal solution. For example, 
for the LCS problem, what we require as output is an LCS 
of the two given strings, not its length. If only the value of 
an optimal solution is required, the problems can be carried 
out in linear time on a one-dimensional array of non-finite- 
state processors [9]. 
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Figure 3. Computation profile of a 2DSM 
for 3 complete sweeps. 
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Figure 5. Possible "layout" of the computation 
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Figure 6. Computation profile of M obtained by folding 
the profile of Figure 5 along the dotted lines. 
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Figure 7. Rewriting rules for M during 
the table construction phase. 
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Figure 9. (a) Profile of M after n sweeps; 
(b) Profile after reversing the rows; 
(c) Profile after reversing the columns. 
(af i =(a,b)(c,d) then i’ = (d,c)(b,a).) 
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Figure 14. A two-way unbounded 2DIA. 
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the shift-and-accumulate phase. 
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ABSTRACT 


The recent advances in computer architec- 
ture for image processing will provide unpre- 
cedented systems capabilities in the upcoming 
decade. Among the most significant aspects of 
the architectures will be their increasing ability to 
process image data in a parallel fashion. This 
paper presents a new computing architecture 
called "DRAFT" (Dynamically Reconfigurable 
Architecture for Factoring Things) used in paral- 
lel processing of image data structures. Further- 
more, we show that quadtree data structures can 
be processed efficiently on this new parallel com- 
puting architecture. Algorithms are given for 
constructing and pruning the quadtree structure 
and for findings neighbors in a parallel fashion. 
The computational requirements of the "DRAFT" 
system in image processing are examined and 
their analysis is presented in detail. 


INTRODUCTION 


The quadtree has received much attention 
in recent years as an efficient data structure for a 
variety of image processing applications. Recur- 
sively defined a quadtree may be empty or it may 
consist of a root with either none or exactly four 
sons consisting of quadtrees. For this paper a 
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region will be the BLACK portion of a 2**n x 
2**n array made up of a unit square pixels 
colored BLACK or WHITE. A sample region 1s 
presented in Fig. 1 and its quadtree is given in 
Fig. 2. We define a node in a quadtree to be a 
record containing the following fields. If P is a 
(pointer to a) node and D is in the set of direc- 


tions {NW,NE,SW,SE}, then we may define the 


fields as follows. COLOR(P) has value WHITE 
or BLACK for a leaf, GRAY for an interior node. 
SON(P) (pointer to) a collection of four nodes 
which are the sons of P in each direction; NIL if 
no such node exists. FATHER(P) (pointer to) 
the father of P; NIL if P is the root. PATH(P) a 
path code recording the position of the node as an 
encoded series of directions from the root to the 
node as proposed by Jones, Raman and Iyengar 
[1,2]. The (pointer to the) root of the quadtree 
will be denoted by ROOT. In our figures, the 
offspring of a node are drawn in the canonical 
order NW-NE-SW-SE. 


Many algorithms have been written to pro- 
cess patterns stored as quadtrees. Important to 
these algorithms are procedures which allow 
determination of the neighbors of any given non- 
gray node. These "neighbors" are the closest 
nodes in any of four directions, referred to as 
N,E,S, and W. An algorithm for traversing a 
quadtree to find these neighbors is given in 
Samet[3]. For a broader treatment of quadtree 
related research see [4, 5, 6, 7, 8]. 


This paper presents a scheme for perform- microinstruction word. By setting or resetting the 


ing quadtree functions in parallel on a new type bits in this field the microprogrammer can at the 
of architecture called the DRAFT architecture. micro-instruction level join one or more adjacent 
We begin with a brief overview of the architec- Slices into a single processing unit or separate 
ture in section two. Section three outlines several them into independently operating parallel pro- 
programming concerns designed to maximize cessors. Thus the machine is_ horizontally 
parallelism. Next a parallel data structure for reconfigurable along its word length into any 
representing quadtrees in a DRAFT machine is combination of processing elements which can be 
presented in section four, followed by an algo- constructed by joining adjacent slices. Possibili- 
rithm for building these structures from raster ties range from a single 256-bit processor to eight 
input in section five. Section six demonstrates 32-bit processors. For clarity we will refer to 
parallel operation in the DRAFT environment processing elements constructed in this manner as 
with a sample neighbor find algorithm. An segments and the underlying 32-bit building 
analysis of the expected performance improve- blocks as slices. Figures 3 - 5 illustrate the details 
ment for common quadtree operations is provided of the DRAFT architecture machine. 


in section seven. 


Figure 3 


FIG. 


0 it 12 
FerGLE LEwaTw ConTaCk [OPERTiOn cen [x OORESS] 
OQ (1,0) 
FIG. 
NW é SE 17 17. 6 Og 
OPERAND | OPERAND OPERATION CONDITION 
NES her [ender | om pracae 
A (2 0) (2,1) (2,2) (2,39 
: Figure 4 
{J @ Fl C ) a a Dd a a @ a a a 
B UV 
o e ais | | _| | stares 
Cc W E 
Figure 2 
A HORIZONTALLY RECONFIGURABLE cds ee cao oor OT 
ARCHITECTURE (DR AFT) TLEXER || FLEXER [See FLEXER || FLEXER || FLEXER 
5 


The DRAFT is an extended word length 
machine, 256 bits in the current implementation. 
This 256-bit word is constructed of 32-bit slices. Figure 5 
A switching network is placed between each of 
the slices and is controlled by an 8-bit field in the 


896 


As Stated, the slices and hence the segments 
configured from the slices are independently pro- 
grammable. The major components of each slice 
include a control RAM and instruction pipeline, a 
32-bit ALU built from VLSI bit-slice devices, a 
64k-by-32-bit data RAM, and the switching cir- 
Cuitry necessary for constructing segments. The 
local control and data RAMs provide each seg- 
ment with an instruction and data stream separate 
from the other segments in a given configuration. 
When combined into a segment, the control 
RAMs of each slice are simply programmed with 
the same instruction. The data RAMs of adjacent 
slices combine into higher- and lower-order 32- 
bit pieces of the segment data word. 


Two components of the DRAFT machine 
which are global to the slices are the microin- 
struction sequencer and the global condition mul- 
tiplexer. The microinstruction sequencer broad- 
casts a common next instruction address to all 
segments and is itself programmed with a 
Separate sequencer instruction. The resulting pro- 
gram structure is unique in that a single control 
Structure encloses parallel operations taking place 
within different processors of the same machine. 


To provide for conditional execution of the 
segments, a slice level control mechanism is 
added by the global condition multiplexor. 
Between the slices and the global sequencer is a 
hierarchical arrangement of condition codes. At 
the bottom of this hierarchy is the slice level con- 
dition code bit. This bit is set by a combination 
of inputs, including the status output from the 
ALU, a slice level condition mask which selects 
the appropriate status output, and the segmenta- 
tion word which propagates condition codes from 
the high-order to low-order slices within a seg- 
ment. 


The sequencer requires a single binary con- 
dition code to control branching within the global 
control structure. The function of the global con- 
dition mux is to generate the logical AND and 
logical OR combination of the local condition 
codes and to select under program control the 
appropriate combination for use by the sequencer. 
Thus, transfer of control instructions for the 
sequencer have, instead of the conventional con- 
ditional and unconditional analogs, unconditional, 
conditional all slices, and conditional any slice 
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versions. The slice level control structure is 
implemented as conditional and unconditional 
analogs of all slice operations. Unconditional 
versions Operate as normal instructions, while 
conditional analogs execute only so long as the 
local condition code is false. Once the local con- 
dition becomes true, these instructions become 
NOPs, and the slice essentially drops out of any 
parallel computations occurring within the global 
iterative or conditional structure. It is also not- 
able that each segment has the ability to directly 
set or reset the local condition code independent 
of the actual status. This mechanism completes 
the condition code structure by adding the capa- 
bility to mask out a given slice from consideration 
in a conditional all or conditional any decision for 
the global control structure. For a broader treat- 
ment of the DRAFT machine see[9]. 


DRAFT PROGRAMMING PARADIGMS 


What most distinguishes DRAFT based 
parallel algorithms from other types of parallel 
algorithms is the unique mix of global and local 
control structures. Because of the common 
instruction word address, each segment must wait 
until all segments have finished with a block of 
code before continuing. The conditional execute 
option allows a slice to shut itself off while wait- 
ing. At the programming level, this provides an 
excellent environment for converting sequential 
algorithms into parallel algorithms. The parallel 
algorithm is written as though it were sequential, 
with the added specification on conditional 
branching instructions indicating whether the 
branch is dependent on the condition testing true 
for all of the segments or just one of the seg- 
ments. 


Obviously it is possible to lose parallelism 
if some of the segments are shut off. This is the 
cost one must pay for the simplicity of the 
DRAFT programming environment. However, 
there are several ways in which can parallelism 
can be maximized by the programmer. 


Paradigm 1: Since each slice processor includes 
its own local slice of the DRAFT memory word, 
the arrangement of data structures in memory is 
key to parallel operation. As the slices combine 
into segments of longer word length so do the 
memories. If a data item is to be considered as a 


256 bit item, it should be right justified in the 
memory of the 256 bit segment to be used. Con- 
versely if eight items are to be used in parallel 
they should be placed adjacent to one another in 
the slice level store of each segment. 


Paradigm 2: Whenever possible, replace a block 
of code involving a conditional branching opera- 
tion with a functionally equivalent block without 
conditional branching. Conditional branches 
depend on the status of all segments either 
ANDed or ORed together to generate global 
transfers of control. It is possible that when a 
conditional structure is executed, some of the seg- 
ments will invoke conditional execution and shut 
themselves off. Since the objective is to maxim- 
ize parallelism and minimize non-executing con- 
ditional code, the substitution of logical opera- 
tions for conditional control structures such as if-. 
then and case is advised wherever possible. If the 
function can be implemented without any type of 
conditional branching, this insures that all seg- 
ments will remain active throughout. Although 
this seems to be a restriction on the programmer, 
examination of code and judicious selection of 
intemal representations for data will reveal some 
branches that are the result of programming style 
and which can be eliminated without undue cost. 


Paradigm 3: DRAFT micro-code allows parallel 
programming of the global micro-sequencer con- 
currently with the operation of the slices. The 


programmer has the opportunity to test condition | 


bits at each instruction and branch out of redun- 
dant or useless operations. Although this branch- 
ing 1s usually done at the end of a block of code 
on a sequential machine, it is possible to speed up 
an algorithm on the DRAFT by redundant testing 
in the microprogram sequencer. 


A PARALLEL DATA STRUCTURE 
FOR QUADTREES 


The data structure used by Samet and oth- 
ers [4,1,2,7,8] for image quadtrees contains a 
color value and pointers to each of the four sons 
representing the NW,NE,SW, and SE quadrants 
of the image section. The structure proposed here 
is similar but includes additional fields for a 
parent pointer and a path code to be used in 
neighbor finding. The major difference in the 
DRAFT implementation is that the son quadtree 
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nodes will be processed in parallel and reside 
adjacent to one another in a single segmented 
location of DRAFT memory. Thus the parent 
requires Only a single pointer to this location. 
Figures 6 and 7 show a 4 x 4 raster scan and its 
corresponding full (worst case) quadtree 
representation. Figure 8 illustrates how the quad- 
tree representation can be partitioned in a the pro- 
totype DRAFT machine. 


Figure 8 


The layout parallels the segmentation to be used 
in processing at each level: 256 bit uni-processing 
at the root, four segments of 64 bits each at the 
next lowest level, etc. Since the current DRAFT 
prototype is an eight-slice machine, nodes at lev- 
els three and below must be arranged as might be 
expected in a single processor environment. This 
amounts to a data partitioning between the slices, 
so that one eighth of the total area represented is 
handled by each processor and each of eight 
image octants will be processed in parallel. How- 
ever, the DRAFT architecture imposes no limit on 
the number of slices. The current prototype has 
eight, but since only neighboring interconnections 
are required, it is feasible to string together a 
large number of slices to meet the requirements 
of some particular imaging task. 


Note how this arrangement of the quadtree 
into DRAFT memory greatly increases the avail- 
able parallelism in quadtree algorithms. Assum- | 
ing that the number of slices is sufficient to hold 
all leaf nodes in a quadtree of depth N. It is pos- 
sible by successive reconfigurations to access 
each node in a full quadtree in N steps, compared 
with the 4**N-1 steps required for traversals in a 
sequential environment. If the number of slices is 
not sufficient to hold every leaf node, the access 
time degenerates to that of a sequential access in 
all levels below log4(P) where P is the number of 
Slices. In such a case, each slice is executing a 


sequential traversal with all slices running in 
parallel. 


Other quadtree operations work well in this 


data partitioned format. Rotation of the image in 


90 degree increments can be achieved by chang- 
ing the order in which the child nodes are 
accessed; the pointer to the NW _ quadrant 
becomes the pointer to the SW quadrant, etc. No 
actual movement of data is required and pointer 
updates proceed in parallel. The same ts true for 
a transposition of the image across some axis 
inside the plane of the image. Since superimpos- 
ing one quadtree image upon another is a task 
which requires traversing two quadtrees simul- 
taneously while constructing a third, one can 
expect timing improvements for this operation to 
be proportional to the improvements in traversal 
time. The time complexity for finding the inter- 
section of two images should also be improved 
from 4**N to N, assuming a processor slice is 
available for every pixel. 


QUADTREE CONSTRUCTION FROM 
RASTER INPUT 


For testing on the DRAFT prototype, two 
implementations of the quadtree structure were 
proposed: a vertical storage, where each field is 
placed in a different data location with the color 
at the base of the data page, and a horizontal 
packing where all fields are kept in a single data 
location. Vertical storage assigns one node to 
every data page; therefore horizontal packing is 
16 times more memory efficient. However, hor- 
izontal packing has a disadvantage that extra code 
is required to extracted each field from the data 
word before it can be used. The procedures below 
were coded assuming horizontal packing. 


Having chosen a representation scheme for 
the quadtree nodes the next problem is loading 
image data for an input device, such as a raster 
scanner into the DRAFT memory. This operation 
is important since the arrangement of pixels in the 
proper slice order is critical in constructing the 
parallel quadtrees as shown in Figure 8. For a 
DRAFT machine implementation sufficiently 
wide to provide a slice for each pixel this problem 
is relatively trivial. More difficult is the arrange- 
ment of nodes for the case of a limited number of 
slices, each processing a sub-quadrant of the 
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image. Such was the case with the following test 
algorithms designed for the eight slice prototype. 
For loading an N x M pixel raster the data must 
be divided into eight octants of size N/2 x M/4. 
As the raster is scanned pixels from lines 1 to M/4 
are placed in slice one for n < N/2 and slice three 
for n >=N/2. The next quarter for m are placed in 
slices two and four, then five and seven, and the 
remaining quarter loads into slices six and eight. 
Shown below is a sample implementation of this 
algorithm for the DRAFT machine host proces- 
sor. The routine get pixel reads the next pixel 
from the scanner and store_pixel places the pixel 
into the memory of the second parameter "slice". 


/* sample algorithm for loading an Nx M 
raster into an 8 slice DRAFT machine */ 


int slice[4] = 0,1,4,5; 


for m = 1 to max lines 


{/* lower half of line */ 
for n = 1 to pixels per linel2; 
f 


pixel = get pixel; 
store_pixel(pixel,slice[int(m/4)]); 


} 
for n = I to pixels_per_linel2; 
{ /* upper half of line */ 


pixel = get pixel; 
store_pixel(pixel,slice[int(m/4)] +2); 
} 
} 


The store_pixel routine also performs an 
additional function to ease the construction of 
quadtrees for each of the octants in the next step. 
For this construction step the sequence of pixels 
in sequential memory locations will need to be 
the ordering from left to right of the leaves in the 
full quadtree to be constructed. This is not the 
order in which the pixels are received from the 
scanner. To achieve the required shuffling of the 
pixel locations the loading program uses an M/4 
by N/2 translation raster. Each location in the 
translation raster holds the offset from raster base 
into which the pixel is to be stored. Figure 9 
illustrates a translation raster for an eight-by-four 
octant processed at the slice level in a 16-by-16 
raster. 


Input Raster 
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Translation Raster 
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Figure 9 
The final step in preparing a raster for DRAFT 
quadtree operations is to construct and color the 
parent nodes ascending upwards to the root. 
Phase one scans in parallel the pixels of each of 
the eight octants. Every fourth pixel generates a 
new node colored black if all four pixels are 
black, white if all four are white, and grey if a 
mixture of black and white exists. Also during 
this scan the parent field of each pixel is filled 
with the address of the new node and the son field 
of the new node is set to the base address of the 
group of four. After scanning the pixel level 
nodes the process is repeated for each of the new 
parent parent nodes created until in the final itera- 
tion a single root node is generated for the octant 
quadtree processed by each slice. The procedure 
below shows the code for each slice. The actual 
code 1s implemented in DRAFT microcode. 


/* sample slice level algorithm for generating 
quadtrees for raster octants */ 
nodecount = n*m/8; 
while(nodecount > 0) 
for i = 1 to nodecount by 4 
{ new_parent = make parent node; 
son(new_ parent) = i; 
color(parent) = color(i) 
for j =ttoi+3 
{ parent(j)=new_parent; 


color(parent)==color(j) : color(parent)=GREY; 


} 
} 
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Note that the conditional testing of child colors 
appears to be backwards in this algorithm. In fact 
the statement is an example of conditional execu- 
tion. The program sequencer will cause the 
assignment statment to execute in all slices, how- 
ever in those slices for which the local condition 
code is true, the assignment statement is con- 
verted to a nop and the parent color is unchanged. 


Phase two of quadtree construction com- 
pletes the upper levels of the image quadtree as 
shown in Figure 7. In the eight-slice prototype 
this will be two additional levels. Since all of the 
octant quadtrees roots lie in adjacent slices of the 
same location, the completion of the address 
fields is trivial. Each level is executed using a 
successively larger segmentation until the root 
executes in single segment mode. Having con- 
structed the quadtree, the next step is the imple- 
mentation of common quadtree operations. A 
complexity of analysis of most such operations 
will inevitably be dependent on the complexity of 
quadtree traversal. As stated, a sufficiently wide 
DRAFT implementation reduces such traversals 
to complexity proportional to the depth. When 
DRAFT segments must traverse entire subtrees in 
parallel the single instruction address forces the 
complexity to the worst case of the eight (for the 
prototype) subtrees. This is because all other seg- 
ments must wait for the "slowest" segment to 
complete its traversal before continuing. 


FINDING NEIGHBORS 


One essential quadtree operation that has 
received a good deal of attention [2,10] is the 
neighbor-find operation. The DRAFT implemen- 
tation of this algorithm is a good example of how 
varying segmentation can not only assist in quad- 
tree traversal but can also return information such 
as the size of a neighboring block. The algorithm 
proceeds recursively with an N bit parameter 
holding the path code of the node for which the 
neighbors are to be located. N in this case is the 
number of bits in the current segment. At each 
level of the recursion the machine reconfigures to 
divide the current segment into four new seg- 
ments which will simultaneously search the four 
children. Each of these segments is given an N/4 
bit copy of the original parameter arranged across 
the parameter word as shown in Figure 10. 


Before Reconfiguration 


After Reconfiguration 


Figure 10 


A identical mechanism is used in reverse for the 
return parameter. In this case the segment locat- 
ing a neighbor places an N bit copy of the path 
code for that neighbor into the corresponding seg- 
ment of the return parameter word. When the 
recursion returns to the root the resulting word 
contains a string of four values representing the 
four neighbors of the node in question. The seg- 
ment size of each of this values determine the size 
of the neighbor block returned. A recursive slice 
level algorithm for nearest neighbor is shown 
below. The actual micro-code for nearest neigh- 
bor on the DRAFT prototype is a non-recursive 
implementation using the parent pointers. This is 
due to the limited size of the stack memory for 
the sequencer chip used. 


find_neighbor(node quadtree) 

{ 

/* locates the neighbors of "node" 
in "quadtree" */ 


if (color(quadtree) <> GREY) 
then if (is_neighbor(node ,quadtree)) 
then return(0); 
else return(quadtree); 
else { re_seg(4); 
find_neighbor(node,son(quadtree)); 
} 
} 


The re_seg procedure above resets the segmenta- 
tion and regenerates the parameter as above. 
Is_neighbor is a boolean function which com- 
pares the path codes of "node" and "quadtree" to 
determine if they are neighbors. 


901 


Algorithms have also been devised for 
union and intersection functions of two image 
quadtrees. For these functions the result quadtree 
is the union or intersection of the blacks regions 
of the two inputs Both algorithms are based on a 
top down traversal like the one discussed in sec- 
tion three and thus have complexity proportional 
the depth of the inputs. 


ANALYSIS 


Analysis of DRAFT algorithms must be 
made based on two cases. In case one the slice 
width of the DRAFT implementation is equal the 
number of nodes in the worst case quadtree to be 
constructed. For this case quadtree construction, 
traversal, and neighbor finding reduce to O(N) 
since each step processes an entire level and N is 
equal to the depth of the quadtree. While this is 
not an unrealistic assumption based on current 
VLSI capabilities, no such machine for non- 
trivial rasters has been built. For operations on 
the existing eight slice DRAFT prototype the 
analysis divides into two phases. Phase one is the 
analysis above for the upper levels of the quad- 
tree. Phase two, is an eight processor parallel ver- 
sion of essentially sequential operations on the 
eight octant subtrees. The proposed data structure 
divides all work evenly so that the time required 
to execute in parallel is the same as the equivalent 
sequential algorithm divided by p, where p is 
some even number of slices. The cost paid for this 
improvement is log4(p) steps representing those 
levels which can be fit into adjacent slices. For 
quadtree construction each quadtree is con- 
structed in with O[(4(N/p)-1)/3] where N is the 
number of pixels in the raster. The entire quadtree 
for rasters of N pixels is thus constructed in 
Oflog4(p) + (4(N/p)-1)/3] Assuming an N 
log4(N) complexity for sequential neighbor 
finding yields an identical analysis for the 
DRAFT neighbor find algorithm of complexity 
O[log4(p) + N/p log4(N/p)]. Traversals and 
traversal oriented operations improve from ON ) 
for the sequential case to O[log4(p)+ N/p] in’ the 
DRAFT algorithm. 


In Iyengar and Moitra [SJ], a paradigm is 
given for measuring EPU, the Effective Processor 
Utilization, is defined as (complexity of the 
fastest sequential algorithm for a problem)/( the 


number of processors used * time complexity of 


the parallel algorithm). Using this measurement, 
DRAFT algorithms have an EPU of 1. 


FUTURE RESEARCH 


The DRAFT machine has great potential 
for use in an image processing environment. Its 
quasi-SIMD structure makes it possible to create 
algorithms which make good use of parallel pro- 
cessing while retaining the familiar programming 
Structures of a sequential environment. In addi- 
tion to the work in new algorithms for this 
machine, two other areas are currently being 
given attention. The first is an attempt to reduce 
the essential switching functions to custom VLSI. 
The existence of such a chip set would greatly 
enhance the ability to apply this new processor 
design to a wide range of special purpose 
machines. Second is the development of a high- 
level programming environment. Whereas the 
current micro-code environment is well suited to 
optimizing performance, high-level language to 
support such operations as the segmented recur- 
sion discussed in this paper would be of great 
advantage. 
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Abstract 


Given a sequence of numbers which can be mapped into an 
m Xn array, sorting rows and columns is shown to yield an overall 
sorted sequence. This unusually simple procedure is proven to 
require O(logym) iterations by analysing the data movement in the 
array under successive row and column sorts. An efficient bubble 
sort network suitable for VLSI implementation, with near-optimal 
area—time” performance is a direct application of the row-column 
sorting technique. The ease in implementation for practical VLSI 
chips is also demonstrated. - 


I. Introduction 


The problem of sorting numbers on a two-dimensional array 
has been studied by various researchers ((1], [2], [10]) and more 
recently by Leighton(3], Lang et al.[8] and Tseng et al.[12]. This 
problem involves routing of each data item to a distinct position of 
the array predetermined by some indexing scheme. Three different 
schemes have been considered by Thompson & Kung/1] : row major, 
shuffled row major, and snake-like row major (see Figure 1). The 
corresponding column-major forms can be considered equivalent to 
the schemes above. Most of the above referenced algorithms are 
based on two dimensional adaptations of very efficient sorting algo- 
rithms for linear sequences like bitonic sort[9] in [1] and [2], and 
odd-even merge sort[9] in [10] and these are essentially recursive in 
nature. Even though they perform optimally within a constant fac- 
tor of the lower bound of O(n) for a nXn array, these algorithms 
spend most of the time in routing data to appropriate processors. 
Consequently, the complexity of the execution time is dominated by 
the number of unit routing steps in a SIMD model. Thompson and 
Kung/t] derived a 4(n-1) lower bound from an initial configuration 
where elements on the opposite corners of an n X n array have to 
be exchanged. 


However, mere figures of time complexity appear to be 
superficial, when we consider the complexity of the control structure 
for the complicated data routing during the successive stages of 
recursion. One hardly needs to overemphasize the cost of the com- 
munication overheads in a VLSI implementation. In this context one 
can be more enthusiastic about the algorithm in [12], which in spite 
of being recursive in nature gets away with minimal control and 
achieves a bound within O(log n)'*) of the optimal. In the realm of 
sorting a two-dimensional array of numbers, a seemingly "nice" way 
would be to sort rows and columns (since it involves sorting on 
smaller problems of approximately Vn size) and "hope". that 
somehow a combination of these two operations will terminate in a 
sorted sequence. Unfortunately, such a procedure doesn’t seem to 
work when implemented in a straight-forward manner and indeed 
Leighton(3] observes that ".. if the matrix were square, we would 
essentially be sorting rows and columns which is well known to leave 
entrics arbitrarily away from the correct sorted position’. This was 
in obvious reference to the row-major indexing scheme. Paradoxi- 
cally things fall into place when one sorts the rows in a snake-like 


Snake~like row-ma jor 


Row-ma jor 
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Figure 1. Indexing schemes. 


row-major form without increasing the complexity of the procedure. 
This simple algorithm, which we call shear-sort , will be formally 
introduced in the next section. Section III will provide the analysis 
of the algorithm. In section IV we discuss very efficient and simple 
VLSI implementation of the algorithm using only bubble-sort net- 
work and in section V we discuss a method to optimize the algo- 
rithm by simple manipulations. 


Il. Row-column sort 


Let Q = [ qj; |] be an mXn matrix onto which we have 
mapped a linear integer sequence S. Sorting the sequence S is then 
sorting the elements of Q in some predetermined indexing scheme. 
We suggest an iterative algorithm in which every iteration consist of 
the two basic operations: 

(1) Row-sort - Sort independently all the row vectors of Q such that 
adjacent rows are sorted in opposite directions (alternate rows in 
the same direction). In a normal snake-like row-major indexing 
scheme, sort the first row from left to right (increasing). At the end 
of this step, gj; < Qj j41 for all i = 1, 3, 5,..2p+1 and qj < i541 
for all i = 2, 4, 6,..2p. 

(2) Column-sort - Sort independently, in an ascending order from 
top to bottom all column vectors of Q. After this step q;; < 4j41,; 
for all j = 1, 2, ..n. 

The shear-sort algorithm is defined as a repetitive application 
of steps 1 and 2 until one of the following conditions is satisfied : 

(a) all the columns are sorted, i.e. no element has moved in the 
present column-sort after a row-sort, or 

(b) no element has moved in the present row-sort after a column 
sort. ; 

A step by step application of the algorithm is shown in Figure 2. 


(a) Throughout this paper log will be assumed to be to the base 2 unless otherwise mentioned. 
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Fig 2 a Rows and columns 
are sorted in row-major form 
but the array as a whole is 
unsorted. 


major indexing scheme. 
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Fig 2 e At the end of itera- 
tion 2 all elements are in 
their final sorted rows. 


major sorted sequence. 


The reader may note that the algorithm will terminate into a 
sorted snake-like row-major sequence from the definition of such an 
indexing scheme. It is trivial to observe that, if the rows and 
columns are sorted in the directions corresponding to the algorithm, 
the array is sorted. Thus conditions (a) and (b) are equivalent and 
are necessary and sufficient conditions. 


To give the reader an insight into the ’mechanism’ of this sort- 
ing technique and throw some light on how such a simple procedure 
results in a sorted sequence we will obtain a very loose upper bound 
using an informal analysis. In the next section we will derive a tight 
upper bound using an indirect technique. 


Consider the n smallest elements in Q and assume they are 
randomly distributed over the rows and columns of the array. The 
first column-sort will move these elements to the first row if initially 
they all happen to be in different columns. However, if all n smallest 
elements are in the same column (only possible for m > n), by vir- 
tue of alternating sorting direction on rows, a row-sort will have the 
effect of moving the elements on odd rows to the leftmost column of 
the array and the remaining half to the rightmost column. This 
phenomenon ts analogous to normal shear of a column and hence the 
name of this algorithm . Clearly, at the beginning of the second 
iteration the n smallest elements have moved up into the upper half 
of Q. The second row-sort will then pair the elements on the two 
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Fig 2 b After the first 
rowsort in snake-like row- 


Fig2d Rowsort of iteration 2 


Fig 2 f A snake-like row- 


leftmost columns for the odd rows and on the two rightmost 
columns for the even rows. Following the same reasoning, it is not 
difficult. to see that the column sort of the p iteration will move the 
n smallest elements of Q to the band defined from rows 1.. ae 

2 
Without loss of generality we can assume n to be a power of 2 and 
conclude that the n smallest elements of Q will move to their sorted 
positions in at most log n iterations. 


It follows from the above discussion that for m < n, the n 
smallest elements move into their sorted position in at most log m 
iterations. It is evident that once the first row is in place, it will 
continue to do so throughout the remaining iterations. Therefore we 
now face a problem of sorting the reduced array Q_, of dimension 
m—1xXn. Each time a row is in its place, we reduce the problem to a 
smaller array on which the smallest elements are brought to the 
‘first’ row in [log(m—k)]| iterations where k is the number of previ- 
ously discarded ’first’ rows. The total number of iterations to sort 
the array is thus 


k=m-—1 


Py [log(m—k)] 


which is bounded from above by mlog m. 


In this simple analysis we assumed that after the ’first’ row of 
array Q_, is in place all the remaining elements are randomly dis- 
tributed in the reduced array Q_,_). This is not the case as we will 
show in the next section which gives a much better bound of O(log 
m) iterations. 


Il. Analysis of the algorithm 


We shall prove that shear-sort converges in O(log m) itera- 
tions, for an mXn array, by showing that the elements go within a 
specified distance of its final sorted row in every successive iteration. . 
It will be seen that an arbitrary element (and hence all elements) 


goes within ry of its final sorted row after p iterations, from which 


the O(log m ) bound follows immediately. We will use here an 
application of the [0-1|principle (Knuth[6]) (for a direct combina- 
torial proof see Scherson & Sen(1J]). For the purpose of applying 
the [0,1] principle, we will visualize our sorting algorithm as a sort- 
ing network of logm + 1 stages, where in each stage we sort all the 
rows in a row-major snake-like form followed by sorting all the 
columns (Figure 3). 


Consider the simple case of a 2Xn array containing arbitrary 
number of 0’s and 1’s 
(a) After the rows are sorted, the 0’s in the first row will be packed 
to the left, while 0’s in the second row will be pushed to the right. 
Let us denote the number of 0’s in the first and second rows by n, 
and no, respectively (Figure 4). 
(b) Depending on the value of n; + no, sorting the columns (in this 
case a simple compare exchange), will result in one of the following 
Case 1 ( nj + no <n): the bottom row will contain only I’s and 
the top row will be a mixture of 0’s and 1’s. 
Case 2 ( ny + nm. = n ): the top row will contain only 0’s and the 
bottom row will have.only 1’s. 


ROW SORT |) COL. SORT 
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Figure 3. The algorithm as a sorting network. 
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Figure 4a. The top row Is sorted from left to right 
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Figure 4b. Afterthe column sort, one of the two 


rows<second In this case) contains only 0’s or 


1's, In a more favorable case, both rows ore clean. 


Case 3: ( nn, + nz > n): the top row will contain only 0’s and the 
bottom row will contain an unordered sequence of 0’s and I’s. 

Out. of these cases, Case 2 is the most favorable, since the rows are 
already sorted and in the other cases we have a sorted row (first in 
case | and sccond in case 2). Also, the rows are mutually ordered 
i.c., all the clements of the top row are less than or equal to the cle- 


ments of the bottom row. In the next row-sort the remaining row is” 


ordered and sorting is complete. Henceforth we shall refer to a row 
as being ’clean’, if it contains identical clements (only 0’s or only 
I’s). Analogously, a row consisting of both 0’s and 1’s will be called 
‘dirty’. 

The above discussion makes it clear that in a pair of adjacent 
rows, after the first iteration at least one of the rows is cleaned’, 
ic. it consists of only 0’s or 1’s. We will now extend the above 
phenomenon to an mXn array. Without loss of generality assume 
m to be a power of 2. After the row-sort of the first iteration we 
have 0’s squeezed to the left on the odd- numbered rows and packed 
to the right on the even-numbered rows. Consider that a column 
sort consists of the following two stages 
(1) For an element g;,;, do a compare exchange with the element 
gi+1.j, for i = 1,3 ..,m-1. For i = 2,4 ..,.m, compare exchange q, ; with 
qgi-1.;- (This is equivalent to sorting m/2 pairs of adjacent rows 
independently.) 

(2) Sort the columns normally i.e. across the entire columns. 


It should be clear that decomposition of a column sort in the 
above manner is not going to affect the ordering resulting from a 
normal column sort. The extra (hypothetical) step makes the 
analysis much simpler. From the previous discussion, we know 
that following step 1 there is at least one clean row among every 
pair of odd-even rows. We may now consider an entire ’clean’ row 
as a single entity during the column sort since a ’clean’ row contin- 
ues to be clean throughout the subsequent iterations. This follows 
from the observation that, when a ’clean’ row of 0’s is compared 
with a ‘dirty’ row across the columns, the clean row bubbles to the 
top row remaining ’clean’. Similarly a ’clean’ row of 1’s sinks to the 
bottom. Clearly, two clean rows, when compared across the 
columns cannot ‘dirty’ themselves. It may be easier to visualize this 
process of column sort being a bubble sort, in which all elements of 
a row vector are simultaneously compared to the corresponding ele- 
ments in their adjacent row. The reader should note that the actual 
algorithm used for sorting columns does not necessarily have to be a 
bubble-sort since the ordering obtained is independent of the algo- 
rithm used. Row-sorts do not affect the clean rows. 
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It follows, that after the column-sort step of iteration 1, we 
have at least half (m/2) rows ’clean’, with the clean rows of 0’s at 
the top and ‘clean’ rows of 1’s at the bottom. The ’dirty’ rows will 
occupy a band of at most m/2 rows in the middle of the array (Fig- 
ure 5). Actually we may find less than m/2 ’dirty’ rows since two 
‘dirty’ rows may mutually ‘clean’ each other during the column sort. 
In the course of the second iteration, half of the m/2 ’dirty’ rows 
will become clean by applying the same argument to the contiguous 
band of dirty rows while the clean rows continue to be clean. By a 
repetitive application of this phenomenon of cleaning half the 
number of dirty row with every successive iteration, we may 


conclude that there can be at most one dirty row after log m itera- 
tions. An additional row-sort orders this dirty row which actually 
separates the clean rows of 0’s from the clean rows of 1’s in the 
sorted array. Thus, for an initial array consisting of an arbitrary 
number of 0’s and 1’s the shear-sort algorithm terminates suc- 
cessfully in logm +1 iterations. 


The classification of rows as ‘clean’ and ‘dirty’ in constructing 
the complexity analysis was extremely helpful since it reduced the 
burden of keeping track of every element in the array to only m 
rows. At this point the implications in a generalized array consisting 
of clements which may not be only 0’s or I’s, is not clear. Neverthe- 
less, we can state the following important result: 

Theorem 1 : An m Xn array of clements can be sorted using 
shear-sort in time proportional to [logm]+ 1 [m*(time for row-sort) 
+ n*(time for column sort)}. 

The proof follows immediately from the previous discussion and the 
[0,1] principle. The tightness of this bound should be obvious from 
the the informal discussion at the end of the previous section. In the 
next section we will present an efficient implementation of this algo- 
rithm on a mesh-connected processor array. 


The average-case performance analysis can also be simplified 

using the [0-1] principle. An important corollary following from the 
discussion of clean and dirty rows can be summarized as follows: 
Corollary 1: A 0/1 array consisting of ’d’ dirty rows initially, can 
be sorted using [logd] iterations of shear sort. 
Theorem | can be stated as a special case, since all m rows can be 
dirty initially. Consider O(m) 0’s uniformly distributed in the 
m Xn array, the other elements being 1’s. Because of the uniform 
distribution, we may conclude, that on the average, m/2 rows will 
be dirty. From Corollary 1, the average number of iterations is logm 
- 2 or O(log m). This shows that the algorithm is optimal within 
minor variations of the basic shear-sort paradigm. 


IV. VLSI Implementation of shear-sort 


In spite of the terrible performance of a normal single proces- 
sor bubble sort, efforts have been directed towards obtaining 
efficient VLSI implementations ([5], [6]) because of the inherent sim- 
plicity of the algorithm. For this purpose a parallel version of bub- 
ble sort viz. odd-even transposition sort([6]) has been adopted. By 
using crossing sequence techniques, several researchers have shown 
that the optimal AT? bound for sorting n elements is O( n? ) in 
word model and O( n7log?n ) in bit model ((3],[5],[6]). The normal 
N/2 processor bubble sort where each processor performs one 
compare-exchange operation during each of the N iterations behaves 
horribly ( O( n? ) ) with respect to the AT” measure. This remains 
unchanged even by using completely pipelined bit-parallel com- 
parison exchange modules to sort more than one problem instance. 
The pipelined scheme consists of O ( n? ) comparators which reduces 
the effective area by a concurrency factor(n) - the time remaining 


unchanged ([5]). 


Theorem 2: The AT” performance of shear-sort implemented with 
a bubble-sort network is O(n‘log?n) for n? elements. 


Figure 6c shows the implementation of the shear-sort using a 
pipelined scheme where each of the n rows(columns) are pipelined 
through this sorting network. The ‘Transpose/ Detranspose ’ net- 
work aligns the array properly for the next column (row) sort. Fol- 
lowing Leighton’s(3] argument, the Transpose/Detranspose network 
needs n non-unit length wires and hence occupies O(n”) area where 
the transposition is performed in n parallel stages by hardwiring the 
rows to the corresponding columns (and vice versa- see Figure 6b). 
The bubble-sort network consists of n” comparators and thus the 
total area of the network is O( n7logn ). We will need 2n word steps 
to sort all n rows (columns). Each comparator is capable of per- 
forming a compare-exchange operation of two O(log n) bit numbers 
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Figure 6b. Transpose/Detranspose network. 


This permutes the rows and columns so 
that the rows are sorted in snake-lke 


manner by the bubble-sort network. 
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Figure 6c. Block diagram of Shear-sort using Bubble-sort network. 
log n + 1 passes through this network will sor+ any sequence 


in O(1) time. As observed previously the Transpose/Detranspose 
network also needs O(n) time for each iteration. Since we need log n 
iterations the AT~ performance for this scheme will be 


O(n? logn)xXO((n logn)*) = O(n‘log*n) 


This is only O( log’n ) away from the lower bound. A similar result 
can be obtained for the bit model by using bit-scrial compare 
exchange modules. Each compare-exchange module can perform a 
compare exchange operation every O(log n) time units and can fit 
into an O(1) by O(log n) unit rectangle. Thus the bubble-sort cir- 
cuit occupies an area of O( n'logn  ) units. The 
Transpose/Detranspose circuit consists of n non-unit length wires 
which occupy an area of O ( n? ) units. Each of these wires routs 
O(n logn) bits of data and thus takes O(n logn) units of time to 
complete the operation. The total time is O( nlog?n ) and so the 
AT* measure for this scheme is O( n“log*n ). This is only a factor of 
log’n away from the optimal which is O( n'log?n ) for O( n°? ), 
O(log n) bit numbers (([3]). 


As noted by Thompson|5], this network needs very little in the 
way of control as no complicated operations are involved and may 
be more attractive than its AT” performance indicates (being a cou- 
ple of log n factors away from the optimal). There is hardly any 
need to overemphasize that this scheme of sorting which exploits the 
powerful property of the algorithm has made bubble-sort compar- 
able to some more sophisticated VLSI sorting networks as far as 
area—time” trade-off is concerned. 


Figure 7 shows a more regular network for accomplishing the 
required compare exchanges, though it needs slightly more arca (a 
factor of logn more than the previous approach). We have managed 
to incorporate the comparators for carrying out row and column 
sorts within the mesh- connected network, which is very similar to 
an ILLIAC [V like machine. The horizontal and the vertical 
comparc-exchange modules are used during row-sort and column 
sort’ stages respectively. A ROW/COLUMN control line decides 
which set of the comparators is being used. A ROWDIR control line 
which controls the direction of compare exchange for the horizontal 
comparators, achieves alternating sorting directions in rows. A sin- 
gle row/column of elements ’oscillates’ between two adjacent 
row(column) n/2 times, each of which carries out the required 
compare-exchanges needed for two iterations of odd-even 


transposition sort. The ROWDIR is changed during every clock 
cycle of the row-sort stage, so that adjacent rows are sorted in 
opposite directions. The overlapped comparators in the figure actu- 
ally represent single comparators multiplexed between the row and 
column sorts. Each cell is drawn with two inputs, but only one of 
them is selected according to the row or column sort stages. After 
O(logm +1) such iterations the array is sorted in a snake-like order- 
ing. 

A few comments may be expedient at this point to emphasize 
the ease in implementation and the expandability of this scheme 
which are of major concern to any VLSI chip designer. The simpli- 
city in the control structure, which is the result of purely iterative 
nature of the algorithm, and the regularity of the layout due to use 
of only nearest- neighbor type operations, make it ideally suitable 
for systolic implementation. Each chip containing a subset k X k of 
the n X n components will be connected to four similar chips in the 
four (North, East, South, West) directions, thus presenting no addi- 
tional complications for chip interconnections. To account for the 
pin limitations in a chip, and packaging a maximum number of 
components in a single chip, we can use bit serial communications 
between the nearest neighbors. This would not change the asymp- 
totic performance since a compare-exchange operation is of the 
same complexity as a bit-serial communication time of a ’b’ bit 
words ( O(b) time). For example, in a 80 pin package, we need at 
most 10 pins for the global control signals leaving us with 70 pins 
for the inter-chip communication. Rounding this to the nearest 
power of two, i.e. 64, we can integrate a 16 X 16 mesh-connected 
compare-exchange modules in a single chip (for a k X k array, we 
need 4k data pins). 


While this scheme is theoretically inferior to Lang et al.[8] by 
a factor of logn, it may prove to be more feasible for reasonable 
sized arrays by sacrificing a little time-performance in the bargain. 
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Figure 7. A regular VLSI layout for Shear~sort. 


V. Optimization of the shear sort 


A further reduction in execution time for shear-sort is possible 
for a ‘brick-wall’ type sorting using a simple manipulation in the 
dimension of the array of n? elements. Recall from section HI, that 
the band of ‘dirty’ elements keeps on shrinking by a factor of 2 in 
every ttcration. This is the basis for an O(log m) (for an mxXn 
array) iteration convergence. Thus, during the successive stages of 
the algorithm, it is enough to sort the clements in a column that fall 
within the ‘dirty’ band, instead of sorting an entire column during 
the column-sort stage. Another way of looking at it is that, since all 


the clements move within oF rows after p iterations (discussed 


m ’ 
more in [13]), we need to perform only oP steps of the brick-wall 
sort after p iterations. 


lor an mXn array, this means we perform m, m/2, m/4 .. 
itcrations of brick-wall sort in successive column-sort stages. How- 
ever, no such optimization scems possible for the the row-sort stage 
since the algorithm is non-adaptive so that we keep on performing 
all the ‘n’ compare-exchange steps throughout the course of execu- 
tion. This yiclds a complexity of n(logm +1) + 2(m - 1) compare 
exchange steps. For n? elements, this can be expressed as 


(logm + 1) + 2(m — 1) 


By choosing m = Ne, we get a complexity of 
l 
2VinViogn + oglede ) 


Zlogn 


3 |, 


bo 


O( 


; logl 
The second term can be approximated to o(n) since Naa = 0(1), 


thereby yielding the following result: 

Theorem 3: A rectangular array of n? elements can be sorted using 
shear-sort in time proportional to O( nV logn ). 

The behavior of this function is very close to linear since Vlogn is 
less than 5 for well over tens of millions i.e. more than the practical 
size of any sorting chip. In fact it outperforms most of the well- 
known algorithms for number of elements below a couple of tens of 
thousands which is not rare in practical situations. The cautious 
reader may have noticed that an extra compare exchange step is 
needed during the column sort stage depending on the boundary of 
the dirty band. If the dirty band starts from an even row during all 
the iterations of column sort the complexity function will be affected 
by an additional log m steps, which is negligible. 


This improvement can be very easily incorporated into the 
VLSI implementation shown in Figure 7. The number of iterations 
of odd-even transposition sort during row and column sort stages 
can be easily controlled by the number of ‘oscillations’ between 
adjacent rows(columns) of comparators. 


VI. Conclusions 


The feasibility of sorting a rectangular array of elements by 
sorting rows and columns was demonstrated. The algorithm was 
shown to execute in at most O(log m) iterations and in section III 
we traced the data movement during each iteration. Recall that we 


showed that all elements moved within O( = ) rows of their final 


destination row in O(p) iterations. Also taking advantage of this 
phenomenon it is possible to optimize a simple algorithm like bubble 
sort. Sorting a row(column) is actually sorting Vn elements in an 
element sequence which may be expensive. However this basic opera- 
tion may be optimized by using a sorting network as was demon- 
strated in section TV. The complexity of this algorithm is more 
appropriately expressed as O ( Vn xkxlogn ) for an ’n’ element 
sequence organized as a square array, where k is the time for sorting 
Vn elements and this is O( Vnlogn ) for a single processor sort 
which was used as a basis for the multiprocessor implementation. 
or the sorting network allowing pipelining this turns out to_be O( 
(Vn + k)logn ) which at the best will give us a time of O( Vn logn 
). 

This algorithm can be also executed effectively in any MIMD 
type machine which allows independent access to rows and columns 
in a shared memory model. A multiprocessor architecture which 
can implement the algorithm elegantly because of its orthogonal 
aceess to the memory banks and originally conceived as a high 


performance graphics system which can draw very fast vectors of 
any orientation ({11]) is discussed in Scherson & Sen(13]. In a similar 
architecture, Tseng et al.{12] propose a combination of bitonic sort 
and a single processor O(nlogn) sort, which can achieve identical 
complexity figures when implemented on mesh-connected proces- 
sors. That process also sorts rows and columns in successive stages 
though it requires a more complicated control structure being recur- 
sive in nature. The direction of sorting in rows change dynamically 
in successive levels of recursion, as also does the number of indepen- 
dent sequences in the columns. 


In an effort to optimize the algorithm further, by trying to 
reduce the cost of sorting entire rows and columns by partial sorting 
an interesting variation is mapping a two dimensional odd-even 
transposition sort on this row-major snake-like indexing scheme. 
Each iteration will consist of performing compare-exchange on ele- 
ments ( %o;1,; , Zai,; ) on all rows independently followed by the 
same procedure on all the columns ( 2;9,-; , 22; ) and then repeat- 
ing the same with the elements ( 2o;,;,22;41,; ) in all rows and ele- 
ments ( 2;.9;,2;2;4; ) in all columns. It is not difficult to see that 
such an algorithm will result in a sorted sequence. It was speculated 
in [13] that the upper bound for the number of iterations may be 

n giving an optimal and an even simpler algorithm. However it 
turns out to be totally inefficient running into O(n) iterations like a 
normal bubble-sort which makes us hopeful about the optimality of 
shear sort within similar variations. 
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Abstract 


A methodology to derive trace-driven simulations on a 
uniprocessor for evaluating the performance of parallel algo- 
rithms in multiprocessors is described. Based on the desired 
degrees of accuracy and speed of the evaluation, we interleave 
the independent traces, which are obtained by actually run- 
ning each of the cooperating processes on a uniprocessor and 
capturing global events. Since only global events are captured, 
data reduction is very high. Using this trace, we can extract 
parameters which are applied to a variety of progressively 
refined analytical and hybrid models to study the impact of al- 
gorithmic and architectural features on performance. Finally, 
design spaces of architectures and algorithms can be explored 
through successive refinements of the simulations. 


1. Introduction 


Multiprocessor systems are emerging in the commercial 
world as cost-effective systems for satisfying throughput 
demands of the mini and supermini markets [ARC85]. It is 
foreseen that such systems can deliver not only throughput, 
but also speed-up. Many languages have been developed to 
take advantage of multiprocessor systems [AND83, BAR83, 
TED84]. They introduce the notion of a process to the users 
of these machines. The execution of a program can be imple- 
mented by a set of concurrent and cooperating processes run- 
ning simultaneously on different processors. Speed-up can thus 
be obtained for a given program and not only for a batch of 
programs. 


Many results have been derived on the computation 
complexity of parallel algorithms [HEL78, KRU83, GAN84]. 
However, the interaction between software and hardware is so 
complex in MIMD machines that simple complexity studies are 
often inadequate. Some attempts at modeling some algorithms 
on specific multiprocessors are presented in [DUB82a-b, 
VRS84, VRS85]. However, concurrency control mechanisms in 
the algorithm, such as scheduling and synchronization, and 
data allocation effects are most of the time neglected for rea- 
sons of analytical tractability. Moreover, it is not clear how 
the parameters used in the models can be estimated. Simula- 
tion technique is often the method of choice to verify analyti- 
cal approximations. Of course, the most accurate evaluation 
method is measurement on an actual prototype[GEH82]. 
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Some major simulators have been implemented to 
predict the performance of algorithms on multiprocessors. The 
first one, developed at Lawrence Livermore 
Laboratory[AXE84], uses a trace-driven approach but traces 
every instruction with the detailed model of the S-1 multipro- 
cessor. Therefore, the simulation is complex and must contain 
hardware timing information which is specific to the S-1. The 
NYU simulator is also an instruction level simulator[ABR84], 
written in CDC Cyber assembly language. The Cedar 
simulator[ABU85],_ written to take advantage of 
Parafrase[KUC84], allows the user to select statements whose 
execution may be simulated. A parallel algorithm simulator, 
called PSIMUL, was written at IBM([SOK86]. This is also an 
instruction level simulator, but with no timing facility. The 
trace is produced by interleaving blocks of records of the indi- 
vidual traces of each process, where the size of each block is 
determined by the time slice interval on the host machine. 


In this paper, we present a different methodology for 
developing trace-driven simulations of the execution of algo- 
rithms in multiprocessor systems. It is not specific to any par- 
ticular architecture and hence can be used to evaluate different 
machines. The only limitation we impose is on the computa- 
tional model which is described in the next section. The com- 
plexity of the trace-driven simulations as presented here is in- 
termediate between the complexity of analytical models and 
detailed, cycle-by-cycle simulations. In our methodology, every 
instruction of the algorithm is executed. However, only global 
events are traced. Depending on the accuracy desired, the in- 
terleaving of the individual traces is derived from virtual or 
real time components. The trace is then applied to the perfor- 
mance prediction. In our technique, we extract values for algo- 
rithmic and architectural features from the trace and apply 
them to analytical or hybrid simulation models to predict the 
performance. Of course the detailed simulation with real tim- 
ing is also accomodated and is necessary for the validation of 
the more approximate models. 


The paper is structured in two parts. In the first part, 
the computational model and the basic multiprocessor environ- 
ment we assume is described. In the second part, the architec- 
ture for the simulator is analyzed and different levels of com- 
plexity of simulations are identified through a careful analysis 
of the tasks involved in the trace-driven simulation of mul- 
tiprocessor programs. 


2. Computational Model and General Hy potheses 


Parallel and distributed algorithms can be classified 
broadly in two classes, with respect to the complexity of 
analyzing their performance: data-independent and data- 
dependent algorithms. In data-independent algorithms the 


performance is independent of the values of the data. These al- 
gorithms include many numerical algorithms. The perfor- 
mance of data-dependent algorithms depends on the values of 
the data. Such algorithms which include most symbolic algo- 
rithms are more difficult to analyze than data-independent al- 
gorithms because of the large number of configurations to con- 
sider. Compounding this difficulty is the possibility that the 
data-dependent algorithm may be non-stationary; that is, the 
behaviour of the algorithm execution varies with time. 


The parallel programming model we will use is based 
on practical concurrent programming language constructs 
[AND83]. The model allows the user to specify the concurren- 
cy and cooperation between the processes of the program. This 
technique is different from the compiler generated parallelism 
which have been shown to be effective for numerical 
algorithms[K UC84], but performs somewhat poorly on nonnu- 
merical algorithms[LEE85]. The poor performance may be due 
to the fact that most nonnumeric algorithms are data- 
dependent and often non-stationary. Therefore, the program- 
ming model adopted here is all the more applicable to non- 
numeric algorithms. 


It is assumed that each processor has a large instruc- 
tion cache or memory so that accesses to program code and lo- 
cal variables are done locally and do not generate global 
traffic. This assumption is justified by the fact that the pro- 
gram size for the algorithms considered here is usually very 
small compared to the total number of data accessed. If an al- 
gorithm does not fit this description, then the modeling tech- 
nique must be refined to include instruction fetches. 


In the type of machine envisioned in this study, the 
operating system is minimum. The multiprocessor is used as 
an attached multiprocessor. Basic operating system mechan- 
isms supported by the methodology described here are flexible 
enough to support most, if not all, possible algorithms. These 
mechanisms are for process scheduling, sychronization, and 
communication[LIN81,DUB82b]. In addition to these mechan- 
isms, data partitioning and allocation, and process granularity 
may also affect the performance of the algorithm[JON80, 
GAJ85]. The effects of these mechanisms can be derived from 
the global trace events by scanning the event attributes in ord- 
er to extract statistics for the multiprocessor. The computa- 
tion homogeneity or heterogeneity of the set of concurrent 
processes is also an important feature of the algorithm. 
Analyzing heterogeneous computations analytically is often im- 
practical. 


In a multiprocessor system, scheduling may be static or 
dynamic. Both scheduling mechanisms are therefore support- 
ed by the methodology. In the static scheduling policy, the 
binding between processors and processes is determined at 
compile time. Static scheduling has low runtime overhead but 
is less flexible and often results in poor load balancing. In the 
dynamic scheduling policy the time of binding of processors to 
processes is delayed until runtime. A process can be executed 
by any processor that becomes available. Dynamic scheduling 
results in more runtime overhead but also in better load 
balancing. 


In the attached processors under study, scheduling sup- 
port is minimum. Therefore, the system relies on self- 
scheduling by the application program as performed in the 
HEP multiprocessor via a local or global ready list{GAJ85, 
SMI81]. In this case, the processes participating in the algo- 
rithms execute code to fork other processes and to access the 
job queues. In the static scheduling, each process is forked on a 


specific processor, determined at compile time and the process 
runs to completion on the designated processor. Each processor 
has a run list. When a process forks a new process on a 
different processor, it puts a descriptor for the process in the 
queue of the target processor and executes a static_fork in- 
struction. The effect of this instruction is simply to wake up 
the target processor in case it is idling. The target processor 
automatically reads the top of its run list, loads the descriptor 
and starts execution. 


In the more complex, dynamic scheduling case, a run 
list is shared by several processors. The dynamic_fork pro- 
cedure is somewhat similar to the static_fork procedure but, 
the job descriptor is put in the shared run list if an attempt to 
bind the new process to a processor fails because there is no 
free processor. 


In both static and dynamic scheduling, a processor that 
terminates a process attempts to get a new descriptor from the 
private or shared list before entering the idle state. The termi- 
nation of a process is indicated by the execution of a terminate 
primitive. 


Our methodology is applicable to most practical mul- 
tiprocessor architectures. These include shared-memory mul- 
tiprocessors and distributed systems with or without cache 
memories. For multiprocessors with cache memories, the im- 
pact of software or hardware-enforced coherence on algorithm 
performance can be evaluated since only global events may 
cause cross-interrogates [DUB82c]. 


The basic set of language constructs we have used to 
support multiprocessing are the process creation and control 
primitives: fork/join, suspend, activate, create, terminate and 
resume. In addition, we provide simple synchronization primi- 
tives such as sptn-lock, suspend-lock and unlock. Knowing 
that each synchronization primitive has its own advantages 
and disadvantages, we also provide other data-level primitives 
such as fetch-and-add [GOT83], full/empty bit [SMI81]. For 
distributed systems, message-based primitives such as 
send/receive [SHA84] are used. | 


3. Modeling Methodology 


The methodology uses a hybrid and hierarchical ap- 
proach to modeling by incorporating trace-driven simulations 
and analytical models[SCH78, KUM80]. In a multiprocessor, 
the events to trace are accesses to shared data or message- 
passing through the network. However, the methodology al- 
lows for the tracing of any selected set of events. 


3.1. Interleaving of the traces 


In order to evaluate the performance of arbitrary paral- 
lel and distributed algorithms on proposed or existing mul- 
tiprocessors, the traces must be generated through the in- 
terpretation of the multiprocessor algorithm. This is especially 
important for algorithms whose behavior is highly sensitive to 
data values. Therefore the traces are derived while the multi- 
tasked algorithm is executed. 


The events of importance in a multiprocessor system 
are accesses to shared data (and, in particular, synchroniza- 
tion), message-passing operations, and the creation and purg- 
ing of processes. In between the execution of such events, pro- 
cessors execute locally and do not affect each other. One prob- 
lem is to interleave the traces of events generated by each pro- 
cessor. This interleaving is needed because the simulation tech- 
nique described here is a technique for a uniprocessor system. 
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This section analyzes three levels of accuracy in trace inter- 
leaving: interleaving based on rank, on a virtual time approxi- 
mation or on physical time. 


3.1.1 Rank Interleaving 


A first approach is to interleave the P traces according 
to the rank or position of each event in each individual proces- 
sor trace. Assume that the P processors are ordered in an arbi- 
trary sequence and numbered from 0 to P-1. In the interleaved 
trace, global event k in processor j occupies position KP +7. 
This interleaving techniques is said to be based on rank. This 
approach is acceptable for simple multitasked algorithms 
where the computation is homogeneous, and the real elapsed 
time between two successive events is approximately the same 
for all processors at any given time. This is the case for the 
simple PDE example given in section 5 of this paper. It is an 
unacceptable approximation for more complex algorithms such 
as quicksort, in which the execution behavior of the processors 
is not homogeneous all the time. 


3.1.2 Virtual-Time Interleaving 


To improve the reliability of the simulation, the traces 
are interleaved according to the rank of each event in its trace 
and according to a timing estimate such as the instruction 
count between two events generated by a process. We call this 
interleaving an interleaving based on a virtual time approzima- 
tion [LAM78, JEF83]. The reason is that the timing estimate is 
not a physical time. In this technique, the trace for each pro- 
cessor is a set of records of the type 

< pn,e, 6t, parameters >. 
6t is an estimate of the time complexity of the work (in 
number of instructions, or in number of instruction cycles) per- 
formed by the processor since the last global event from the 
processor. In our facility, we derive this number of instruction 
cycles from a mechanism first developed at AT&T Bell 
Laboratories[WEI84] and modified at Rice University[JUM85]. 
pn is the number of the processor causing the event, e is the 
event type and parameters are the parameters of the event. 
Examples of event types include fetch, store, test_and_set, 
block, fork and terminate. Parameters may be, for example, a 
shared-memory address, a process descriptor, or a run list 
pointer. Table 1 indicates the various trace events with a 
definition of their parameters. The virtual time approximation 
defines a partial ordering of the global events produced by the 
processors as follows. For processor j there are a total of n; 
events numbered £; j, | OF ee Ea in; and the time between 
events fH; ; and £; 4, is dt; ; =dt +t, , where 6t is the vir- 
tual time increment given in the trace record and ¢, is the vir- 
tual time that the processor has been blocked since the last 
global event. Then, the virtual time for the kth event of pro- 


cessor j is denoted by T; ,= §) ét j,1- By definition, the par- 


:=0 

tial ordering is such that Em Sin if and only if 
TP; om <T1,.- A partial ordering of events is not sufficient for 
the simulation because events having the same virtual time 
can only be scheduled one at a time in the multiprocessor 
simulator implemented on a uniprocessor. In order to define a 
total ordering of events, a selection algorithm is used to select 
among trace events having the same virtual-time stamp. 


The interleaving based on the virtual time approxima- 
tion is attractive because it is independent of actual multipro- 
cessor organization, data allocation strategies, and technologi- 
cal parameters. It only depends on the algorithm and on the 
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uniprocessor architecture. If it is valid, the interleaved trace 
may be reused multiple times to study various system 
configurations and technology trade-offs. This aspect has been 
a major advantage of trace-driven simulations for the analysis 
of uniprocessor systems. In the analysis cf multiprocessor sys- 
tems this advantage may be lost in some cases because of a 
phenomena called trace-shifting and explained below. 


3.1.3 Trace-Shifting Phenomena 


In the virtual time approximation, the effect of the 
multiprocessor environment is not included in the algorithm to 
interleave the traces. There are many ways in which the sys- 
tem under study may affect the interleaving of the traces. For 
example, in a cache-based system a cache miss in a processor 
delays the occurence of the next global event in this processor, 
and therefore affects the interleaving of events. If this effect is 
symmetric with respect to all the processors (i.e. if the same 
type of delay is encountered by all processors at some random 
times), then the virtual time approximation is valid. When the 
effect is not symmetric over a long period of the simulation, a 
long-term phenomena called trace-shifting may occur in which 
the traces get out of phase as compared to the order of execu- 
tion on the real machine. For example, in the cache-based sys- 
tem, some processor may be in a phase in which misses occur 
in bursts because it is accessing a new data structure; in a dis- 
tributed global memory system, such as the Cm*(GEH82], 
trace shifting may occur if the data is allocated asymmetrical- 
ly, even if all the processors are performing the same work. 
For example, if all data are stored in the memory of processor 
0, processor 0 will access memory faster and therefore, in the 
real system, it will execute faster than the other processors. In 
the simulation, however, all processors would execute at the 
same speed. Therefore, the trace of processor 0 has shifted 
with respect to the P-1 other traces. Trace shifting may result 
in considerable errors in some cases. For data-dependent algo- 
rithms and dynamic scheduling, it may result in a totally 
different execution sequence as compared to the execution on 
the real machine. 


3.1.4 Physical-Time Interleaving 


In general, the only way to avoid trace shifting is to in- 
clude physical (real) timings in the trace-interleaving algo- 
rithm. We call this type of interleaving an interleaving based 
on physical time. First, the "6t'" provided in the trace record 
of the individual processor traces must be a physical time esti- 
mate as accurate as possible. Second, a penalty must be added 
to account for data access delays such as cache miss penalty or 
remote memory access penalty. The penalty depends on vari- 
ous factors including the multiprocessor configuration, data al- 
location and technology parameters. A fixed penalty for each 


access type may also result in trace shifting, because of 
conflicts for shared-resource accesses. If some processors are 
systematically conflicting while accesses from other processors 
are conflict-free, the traces will shift in the case where a fixed 
penalty is added for each access. The penalty should in this 
case be variable and include conflict delays. A full fledged 
simulation of the multiprocessor may be needed to determine 
accurately the total access delays through the interconnection 
network and through the memory system. 


In some cases, when the algorithm interpretation is not 
affected by the shifting, the problem of trace shifting can be 
compensated for during the simulation. That is, traces can be 
realigned periodically: a total penalty is accumulated for each 
processor during the simulation; when the difference between. 


the penalties becomes too large, the traces are synchronized by 
subtracting their total penalty from the time stamp of their 
next event. 


3.1.5 On-the-fly and Off-line Tracing 


Finally, tracing may be done on-the-fly or off-line. If it 
is done on the fly then the interpretation of the algorithm, in- 
terleaving of the traces, simulation and collection of statistics, 
are done in one single run. No trace file is derived, saved and 
reused. This mode of operation is desirable if the interleaving 
must be based on physical time, or if the overhead of handling 
large trace files (on tape or on disk) is large compared to the 
overhead of rederiving the trace each time. Note that the 
trace-driven approach is still applicable in this case even if the 
trace cannot be reused. Indeed, partitioning the simulation 
into two parts (the derivation of the trace and the simulation ) 
removes all local events from the global event list and results 


in increased efficiency. Off-line tracing is done in two phases. 


In the first phase, the interleaved trace is derived and stored in 
a trace file. The trace file may then be reused several times for 
multiple simulations of different system configurations, data al- 
location and/or technological parameters. The off-line inter- 
leaving of traces only makes sense when the trace is re-usable, 
i.e., when the interleaving is based on rank or on the virtual 
time approximation. 


3.2. Description of Simulator Architecture 


The simulator is divided into three types of modules 
(see Figure 1): the trace-generator, trace-scheduler, and MP- 
simulator. The function of a trace-generator module is to gen- 
erate the trace records as they are requested by the scheduler 
module. For each algorithm and each partitioning strategy, 
one must write a set of trace generator modules. The trace- 
scheduler is the driver of the whole simulator: it determines 
the next process that will generate a trace record according to 
the trace interleaving strategy and a selection algorithm, it re- 
quests the next trace record from the selected process, and 
submits the record to the MP simulator. This last module 
may be a detailed simulation or a program to gather statistics 
of the dynamic behavior of the algorithm on the architecture. 


TRACE GENERATOR 


SIMULATOR 
DRIVER 


EVALUATE STATISTICS 
ON BUS AND 
SIMULATION | MEMORY TRAFFICS 


REDUCED 


— ase ame «un eee er eee eee eee ae ee 


STATISTICS 


e COMPLEXITY MODEL 

¢ THROUGHPUT BOUNDS 

e QUEUEING MODEL 

e OTHER ANALYTICAL 
APPROXIMATIONS 


Figure 1. Activity flowchart of trace-driven simulation. 


ANALYTICAL 
MODELS 


These statistics are then used as parameters in simple analyti- 
cal models such as complexity, throughput or queueing models. 


An activity flowchart for the overall performance 
analysis is also outlined in Figure 1. Several "cuts" are possi- 
ble in the flowchart. A cut means that the simulation is run in 
two independent parts, and that intermediate results from the 
first part must be stored. A cut at (a) requires storing one or 
several files on tapes or disks, for each processor; a cut at (b) 
requires storing one single thread of trace records, a cut at (c) 
reduces the information that has to be stored to statistics such 
as mean traffic values, variance of memory service time, histo- 
gram of execution time between two successive global events. 


Cuts at (a) and (b) result from off-line tracing while a 
cut at (c) results from on-the-fly tracing. Depending on the in- 
terleaving strategy, on the algorithm being traced, and on the 
overhead of file manipulation, a cut at (a) or (b) may be 
desireable in order to reuse the trace. 


3.2.1 Trace-Generator Module 


A trace-generator module of the simulator executes the 
process and produces each global event from that process. The 
process is written as a procedure specifying the task to be per- 
formed augmented by statements which log global-trace event 
records as these events are generated by the process. For each 
"forked" process there exists a trace generator. A trace gen- 
erator is completely specified by a process and by a descriptor 
defining the data set to which the process is applied. 


The implementation of a trace generator is based on 
the coroutine principle. Few basic functions are provided to 
facilitate the "forking", activation and suspension of the trace 
generators. The first function is called fork (proc, 
P1) Pa) ---) P, )- This function creates and activates a process 
of type “proc” with the associated parameters p 1, Po, ---, Das 
and returns a unique process identification number, pid. Sub- 
sequently, the process can start execution when the trace- 
scheduler performs a transfer (ptd ) to the process. The trace 
generator suspends itself and transfers control to the scheduler 
by executing a suspend (). This is essentially a coroutine call to 
the trace scheduler. In the trace generator, a trace event is 
logged into the event list of the scheduler by log event 
(pn,event_type, time_incr,parameters). The "time_incr" attri- 
bute is the virtual or physical time increment from the last 
event logged by that processor. The next time the trace 


scheduler schedules the trace generator for event production it 
again executes a transfer (ptd). These and other primitives 
were packaged into a Concurrent C preprocessor|[MAD86]. 


3.2.2 Trace-Scheduler Module 


The trace scheduler is designed to invoke the trace gen- 
erators. Its first function is therefore the interpretation of the 
multitasked algorithm. The second function is to produce a 
single thread of totally ordered events from the partially or- 
dered set resulting from the interpretation. 


The trace scheduler maintains four types of lists: the 
event list, ready list, run list and the blocking lists. The run 
list, which could be local or shared, contains the set of forked 
processes waiting for a processor. The blocking lists support 
blocking synchronization primitives such as suspend-locks and 
synchronous message primitives. The event list contains at 
every time as many nodes as there are active processors in the 
multiprocessor. A node of the event list includes a 
processor_number field, an event descriptor field, a time stamp 
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field, and other parameters. The time stamp is computed on 
the basis of the time increment provided in the trace record 
and the system time at the arrival of the event. Nodes in the 
list are ordered according to increasing time stamp values. The 
scheduler selects the processor whose node is at the top of the 
list as the producer of the next global event. If several events 
at the head of the event list have equal time stamps, then 
these events are moved to the ready list where a selection al- 
gortthm is invoked to schedule the events in the ready list in 
turn. Possible selection algorithms are round-robin, priority, or 
random; the choice of an algorithm should reflect the intercon- 
nection network arbitration protocol. 


At initialization of the scheduler, the first event, 
"start", and the process identification corresponding to the al- 
gorithm to be run is placed in the event-list. While the ready 
list 1s not empty one event at a time is selected for event pro- 
cessing based on the event type in the multiprocessor simula- 
tor. The processor which issued that event is then resumed to 
generate the next trace event, which is logged into the event 
list. When the ready list is exhausted, the scheduler starts a 
new cycle by moving events with equal time stamps to the 
ready list for scheduling. Note that if the application has been 
correctly interpreted, the algorithm terminates when the event 
list is empty. 


The possible event types for the scheduler are given in 
Table 1. In the following, the implementation of some of these 
primitives in the trace scheduler are discussed. On a read, 
write, test_and_set or a reset event type, the scheduler simply 
submits the request to the MP simulator and then invokes the 


Table 1 


Possible Trace events and their parameters 


Events Parameters 
READ, WRITE, x 
TEST_AND_SET, RESET 

BLOCK x 

FORK _id 
TERMINATE pn 
START_PROCESS _id 
STOP_PROCESS _id 
SYNC_SEND, SYNC_RECEIVE m, p_id 


ASYNC_SEND, ASYNC_RECEIVE m, p_id 


SYNC_BROADCAST m 
ASYNC_BROADCAST m 
READF, WRITEE X 
FETCH_AND_ADD x, d 


x: shared memory address; p_id: process identification; pn: 
processor number; m: message; d: increment. 


next trace element. Programs for spin_locks and suspend_locks 
are given in Figure 2. A suspend_lock is more difficult to im- 
plement and requires the additional primitive block (x) to indi- 
cate to the trace-scheduler that the execution of the process is 
to be suspended. The scheduler puts the processor in a block- 
ing list associated with x, and thereby suspends it from invok- 
ing its trace generator. On executing an unblock (x) in a trace 
generator, the scheduler randomly selects one of the processors 
blocked on x for reactivation by placing it in the event list. 


To illustrate the use of these functions in implementing 
a trace generator, we consider the simple example of a parallel 
and dynamic qutcksort performed on an array Afl...nj. A 
description of this algorithm is given in [DEM82]. In this paper 
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SRARERERERARARRERRERRERRERERERERERE / 


/* Microcode for spin_lock (x) */ 
/* and suspend_lock (x) . 
HEAERRARRAEEAERRERAEREARERERRERERE / 


spin_lock (x) 
{ 


repeat y : 


test_and_set (x) ; 
until y ; 


suspend_lock (x) 


—_ 


y test_and_set (x) ; 
if (y = 1) block(x); 
JR EEPEEREAREEEL ERASE SEAR AO RR ERE ARS 


/* Trace generators for xf 
/*spin_lock (x) and suspend_lock (x) */ 


LEELA EES LSS ALAS SA AA SS A, 


gen_spin_lock (x) 


Y i=; 
if (y = O) x :=1; 
log_event (pn, "tas",2,x); 
suspend () ; 


while (y = 1) 
{ 


Y i=; 
if (y = 0) x :=1; 
log_event (pn, "tas", 2,x); 
suspend () ; 
} 
} 
gen_suspend_lock (x) ’ 


yY := x; 
if (y = 0) x 1; 
log_event (pn, "tas", 2,x) ; 


suspend () ; 
if (y = 1) 
{ 
log_event (pn, "block", b_time, x) ; 
suspend () ; 
} 
Figure 2. Trace generators for spin_lock and 


suspend_lock. 


we present a skeleton of the algorithm in Figure 3 and illus- 
trate a segment of the implementation of the trace scheduler 
with dynamic forks in Figure 4. For simplicity, we have ex- 
cluded all the necessary declarations and calls to the MP simu- 
lator. The quicksort routine picks up a "descriptor" from the 
run list via a spin-lock and splits up the array into two subar- 
rays about a pivot elememt A[v] [BAA78]. A new quicksort 
process is forked to sort one of the subarrays while the current 
process sorts the other subarray. An array with a number of 
elements less than "threshold" is sorted by insertion sort. 
During the forking of the new process, a "descriptor" is also 
placed on the run list via a spin-lock. 


The trace generator for the quicksort will include two 
statements wherever a reference to global data A[*] is made in 
the algorithm. These are 
log_event (pn,event_type,time_incr,parameters) and suspend (). 
Note that the trace generator for spin-lock, shown in Figure 2, 
will replace the spin-lock in the quicksort process. Complete 
analysis of the quicksort in multiprocessors with distributed 
global memory system can be found in [BAL86, PAT86]. 


3.2.3 Multiprocessor Simulator Module 


The complexity of the multiprocessor simulator 
depends on the class of multiprocessor and the level of details 
of the simulation. We illustrate these complexities by consider- 
ing single-bus multiprocessor architectures. The simulator col- 
lects statistics on the bus and memory traffic generated by the 
global events transmitted by the trace scheduler. The statistics 
collection for distributed systems and for systems with distri- 
buted global memory merely consists in the updating of traffic 
counters. The simulator for cache-based machines is by far 
the more complex. It must keep track of the cache contents 


and implement cache replacement and coherence algorithms. 
The cross-interrogate’and traffic statistics may then be used in 
analytical models to compute the degradation on system per- 
formance during the execution of the algorithm. 


quicksort (A, 1,u) 
{ 


pn = getmyPE () ; 

dequeue (global_run_list) ; 
while (( u-1) > threshold) 
{ 


v = split(A,1,u,pn); 
forkprocess (quicksort,A,1l,v-l,pn) ; 
l=vt+il; 


insort(l,u) ; 
terminate(pn) ; 


forkprocess (proc,A,1,u,pn) 


pd = fork(proc,A,1,u) ; 

enqueue (global_run_list) 
log_event (pn, "fork", fork_time,pd) ; 
suspend () ; 


terminate (pn) 


log_event (pn, "term", delta_t) ; 
suspend () 
} 


dequeue (global_run_list) 
t 


Spin_lock(global_run_list) ; 
dummy_dequeue_to_ensure_serial_access 
to_global_run_list() ; 

unlock (global_run_list) ; 


enqueue (global_run_list) 
{ 


spin_lock (global_run_list) ; 

dummy_enqueue_to_ensure_serial_access 
to_global_run_list() ; 

unlock (global_run_list) ; 


Figure 3. Dynamic Quicksort program skeleton. 


4. Hybrid Simulation Models 


The traffic estimates obtained through enumeration of 
bus access or shared memory access events in the simulation 
trace can be used in analytical models. The analytical model 
can be as simple as throughput bounds or queueing models. 
Many models have been developed to analyze the performance 
of multiprocessors once the traffic and latency values are 
known[HOO77,DUB82a,MUD84]. 


In general, the more precise an analytical model is, the 
more system parameters are needed. Also, the model may be- 
come quite complex to exploit. We feel that a great deal of in- 
formation on the performance of multitasked algorithms can 
be obtained by simple throughput bounds. 


The models in which the parameters are estimated as 
averages over the entire run of the application is only valid 
when the computation is approximately stationary in time. By 
stationary, we mean that its execution behavior with respect 
to causing the global events remains constant during the 
overall processing. If the behavior varies (such is the case for 
the sorting algorithm, which is non-stationary because the 
decomposition varies during the execution), then the models 
presented above must be applied successively to the different 
phases of the algorithms. During each phase, it is considered 
that the characteristics of the computation remain constant. 


trace_scheduler () 


Initialize ; 
pd = fork(quicksort,A,1,u) ; 
log_event (NULL, "Sstart",O,pd) ; 
while (event_list NOT NULL ) 
{ 

EVENTS_TO_READY_LIST ; 

rdy_ptr = get_next_ready_event ; 

while (rdy_ptr NOT NULL) 

{ 


pd_old = bind[rdy_ptr -> pn]; 
CASE (rdy_ptr -> type) 
“start":pn = get_PE_pool() ; 
pd = rdy_ptr-> parameter ; 
bind[pn] = pd: , 
Q.global_run_list(rdy_ptr) ; 
transfer (pd) ; 
break ; 
"fork": Q_global_run_list(rdy_ptr) ; 
pn = get_PE_pool () ; 
if(pn NOT NULL) 
{ 


pd = rdy_ptr -> parameter ; 
bind[pn] = pd ; 
transfer (pd) ;: 


transfer (pd_old) ; 
break ; 
“term": transfer (pd_old) : 
if (global_run_list NOT NULL) 
{ 


pn = rdy_ptr -> pn ; 
rdy_ptr = D_Q_global_run_list ; 
pd = rdy_ptr -> parameter 
bind[rdy_ptr ->pn] = pd ; 
transfer (pd) ; 

} 


else 


bind[rdy_ptr ->pn] = NULL ; 
ret_PE_pool (rdy_ptr ->pn) ; 


} 
break ; 

"block":x = rdy_ptr ->parameter ; 
block (rdy_ptr,x) ; 
break ; 

“unblock":x = rdy_ptr ->parameter ; 
rdy_ptr = random_select (x) ; 
pd = search(rdy_ptr ->pn) ; 
transfer (pd) ; 
transfer (pd_old) ; 
break ; 


} 

M_P_Simulator (rdy_ptr) ; 

rdy_ptr = get_next_ready_event : 
} 


Figure 4. Skeleton of trace scheduler. 


4.1 Example 


Below a grid relaxation algorithm for solving numeri- 
cally partial differential equations is described and executed for 
the hybrid simulation. This example is an illustration of the 
methodology for a simple algorithm on a complex machine (i.e. 
a cache-based system). A complete analysis for different cache 
and coherence mechanisms can be found in [DUB85]. 


Grid relaxation is an iterative technique to solve partial 
differential equations (PDEs) numerically. The traditional ex- 
ample chosen to compare algorithms and implementations for 
PDE solvers is Laplace’s equation on a square domain of R2, 
and is referred to as the model problem [YOU71]. The problem 
is first discretized on a square grid. In the Jacobi iteration, the 
computation consists in repeatitively computing the average of 
the four N-, E-, W-, and S-neighbors at each grid point. There- 
fore, the (K+1)th iterate is computed as 


K+i__1 K K K K 
a 25-443 FIG TE par $5 5 
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Two copies of the grid, U and V, are maintained. In 
each iteration, a copy is updated by using the iterate values of 
the other copy. Then the two copies are interchanged. This 1s 
an example of a synchronized algorithm. This algorithm can 
also be implemented as an asynchronous algorithm in which 
the latest iterate values are used [KUN76]. Let N? be the size 
of the square grid. A partition can be described in terms of P 
blocks, where P is the number of processors. P =p“, where 
p is the number of blocks on each side. The boundary condi- 
tions add a total of 4N fixed iterate values to each grid, so 
that the total number of values for each grid is actually 


N2+4N. 


This algorithm is an example of a data-independent al- 
gorithm with homogenenous decomposition. A simple inter- 
leaving based on rank was applied for its simulation. 


The trace was then used to study the performance of a 
bus-oriented multiprocessor system caches. For the wrste- 
through system, the memory is updated on each write, and no 
space is allocated in the cache on a write miss. Each write cy- 
cle on the bus is read by all the processor nodes and caches are 
invalidated when they contain a copy of the modified 
block[PAT82]. 


There are two types of events causing accesses to the 
bus: read misses and writes. Let M be the total number of 
misses and W the total number of writes in all the caches for 
one iteration. Note that W—N*. If C,, and C,, are the 
number of bus cycles required per cache block read and word 
write respectively, the total bus traffic due to global data 
accesses per Jacobi sweep is the sum of the miss traffic and the 
write traffic and can be written as 

Traf (p,q )=MC,, +N?C, (4.1) 
This traffic places an upper bound on the execution speed be- 
cause of the limited bus bandwidth. 


Let T,, and 7, be the processor wait time for a read 
miss and for a write in the absence of conflicts, respectively. 
Also, let m,; be the total number of misses per iteration in pro- 
cessor t. A lower bound on the iteration time is obtained by 
neglecting all the bus conflicts. 


2 
ty N? 
+—— T 4.2 


dive (p q )=tot a T Piet ia 
In (4.2), £, is the time to update one iterate when all operands 
are in cache, tg is a fixed time, independent of the decomposi- 
tion. The maximum of "m;" is used because the process with 
the worst case computation time determines the duration of 
the iteration in a synchronized algorithm [KUN76]. 


From the trace-driven simulator we can evaluate M 
and m; for different values of P and various cache 
configurations. The values obtained through simulation can 
then be substituted in (4.1) and (4.2). 


5. Conclusion 


Understanding the behavior and the performance 
trade-offs of multitasked algorithms in multiprocessor 
machines is a very difficult task. Many authors have tried to 
develop stochastic models. But such models are too general to 
be applicable to specific algorithms. Performance issues such 
as the impact of synchronization techniques, scheduling 
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methods, and partitioning and allocation strategies cannot 
easily be mapped onto simple analytical models. This is spe- 
cially true for data dependent algorithms, heterogeneous 
decompositions, or dynamic scheduling. Full-fledged cycle-by- 
cycle simulation of the multiprocessor is very tedious, especial- 
ly if the goal of the analysis is to understand the performance 
of the multiprocessor for very large configurations of multipro- 
cessors and very large problem sizes. 


In this paper, we have presented an intermediate ap- 
proach, namely a trace-driven, approximate simulation of the 
problem on the multiprocessor. The simulation yields results 
such as the total number of events of given types. These event 
counts are then introduced in throughput bounds or approxi- 
mate analytical models to evaluate the approximation to the 
desired performance measure of the algorithm. The approach 
presented in this paper provides a general methodology, with 
which simulations and analytical models may be refined and 
validated progressively, from the complexity models to the 
cycle-by-cycle simulation. 


The methodology has been applied to the grid relaxa- 
tion and static and dynamic quicksort problems. These exam- 
ples were presented here to illustrate the methodology. Exten- 
sive results can be found in the referenced technical reports. 
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Abstract 


We introduce a technique for transforming certain 
dynamic programming algorithms into ones which are in 
some sense equivalent, but require asymptotically less space 
and time under a logarithmic cost criterion. When mapped 
into silicon, these transformed algorithms result in VLSI 
processor arrays which are significantly smaller and faster. 
We illustrate the discussion with a general case of serial 
dynamic programming and briefly describe other, nonserial, 
applications in pattern matching. 


Introduction 


VLSI processor arrays are often application-driven; the 
structure of the algorithm and the domain of the problem 
dictate the complexity of the hardware needed. Algorithmic 
structure has been used to simplify communication 
requirements. We show how structure in the problem can be 
exploited to simplify the computation itself. 


Throughout the following complexity analysis we shall 
use the logarithmic cost criterion [1], an appropriate 
measure for algorithms to be implemented in VLSI. The cost 
of storing a non-negative integer i under the logarithmic 
criterion is 

1 ifi = 
Ki) = Clog. ()1 ifi>0 
g oft i 
and the cost of evaluating a primitive function f (e.g. 
addition or comparison) is the sum of the costs of its operands 


Ufx,, x) = Ux.) + Ux) 


The Serial Dynamic Programming Problem 


The general case of a problem solvable by serial dynamic 
programming can be expressed as 


min f{X) = min > f(x") 
X X i€T 
where 


X= {x Xo» 19K} 


is a set of discrete-valued variables, S; is the definition set of 
x; (\Sj| = 9,), and 


T = {1, 2,...,n—1} 
are the time instances of the problem. We wish to minimize 


t This research was supported in part by DARPA under contract 
N00014-82-K-0549. 
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the objective function f over all schedules X subject to the 
serializing constraint 


= {,, *; + i) 


The f,(X*) are components - the ehiective function [2]. 

It may be helpful to observe that this formulation is fully 
equivalent to the problem of finding a shortest (i.e. lowest 
cost) path through a multistage graph with edge weights 
determined by 


fx, ., 


Cin TAR; p Xi x1) 1sisn-1 ,1sjso,, 1skso. re 
where x;; is the jth value x; assumes in some standard 
ordering. We interpret c; ;; as the cost of traversing the edge 
from the jth vertex in the ith stage to the kth vertex in | the 
i+ 1st stage. 


The Problem Restricted 


For the development which follows, we make two 
assumptions which restrict this most general problem. The 
first, strictly for analytical convenience, is that 


0, = 1 i=1,n and o.=m 2sisn-1 

so that we consider only graphs which have n stages, with 
one vertex (or “node” or “state”) in the first and last stages 
and m vertices in each of the intermediate stages. Such a 


graph is shown in figure 1. 


|<_—__—_—_—_——- stages ——___—____>| 
1 2 3 4 


figure 1. amultistage graph — 


The second assumption is a necessary condition for the 
transformations we present. It is that 


,EN 0sc. ‘ik =C 
for all cj; p. ae is, we assume the edge weights can be 
represented as elements chosen from a finite set of non- 


negative integers, and we know in advance their maximum 
value. This is a highly restrictive assumption, but one which 


and 


reflects real-world conditions. Since programs and processor 


arrays have limited resources, we often know (or at least 
must require) bounds on the edge weights. 


A Dynamic Programming Solution 


The shortest path through a multistage graph can be 
found using a backward recurrence with initial conditions 
lj = Ch-lyl 1 Sjso, 


e e rh — 
and iterative step 


= min 

1sk Sou 
where r;; is the length of the shortest path from the jth 
vertex in the ith stage to the lone vertex in the nth stage. We 
shall refer to this recurrence as the standard serial dynamic 
programming algorithm. 


eee a 
rij Iressa * Sia | lsisn 2,15jSo, 


Space and Time Complexity 


Since the edge weights of the graph lie between 0 and C, 
the cost of a path from a vertex in the ith stage to the vertex 
in the last stage can range from 0 to at most (n—i1)C. We 
remark that after the optimal costs for the i+ 1st stage are 
used to determine those in the ith stage they can be 
forgotten. Thus, the storage required is approximately 


ml((n—3)C) + mU(n—2)C) 
bits, which means the space complexity is O(mlogn). 


For the time complexity, we note that an addition in the 
ith stage requires time [((n—i—1)C)+0U(C) and a comparison 
requires 2l((n—i)C). Since determining the cost of the 
optimal path for each of the m nodes in the second through 
the n—2nd stages requires m additions and m—1 
minimizations, the intermediate stages contribute 


n-2 

L=2 
which is O(m2nlogn). Determining the optimal path from 
the vertex in the first stage requires m additions and m—1 
minimizations, and so contributes 


min — 2)C + U(C)] + Gm — 1)[2M(n — 1)C)] 
which is O(mlogn). Hence the time complexity of this 
algorithm is O(m2nlogn). 


mln — i — 1)C + UC)] + m(m — 1)2U(n — DO) 


Mapping the Recurrence into Silicon 


Li and Wah [3] present a VLSI processor array for serial 
dynamic programming which is quite general, at the 
expense of failing to exploit all of the potential parallelism. 
The architecture itself is based on the observation that the 
inner loop step of the algorithm is equivalent to an inner- 
product in the closed semiring (N, min, +, ©, 0). Hence, the 
problem can be solved as a sequence of matrix-vector 
multiplications, where the additive step is minimization and 
the multiplicative step is addition. The array is depicted in 
figure 2. 


To determine the shortest paths for vertices in the 
n—2nd stage, the multiplicand vector is shifted through the 


’ 
* 
ry 
. 
. 


Cn-2,3,m Cn-2,m,2 
nats as Cn-2,2,m : Ch-2,m,1 
oe? eco 
Cn.2,1,m : C2320 et 

: Cn-2,2,2 Cn231 et 
Cn-2,1,2 Cn.2,2,1 on, 
ee Lee 
poets 
Fic Lcd Lad as Ke 
Chet,1,1 ME Cnt2,1 Em Cnt3,1 Emre AE Chm 


figure 2. VLSI array for serial dynamic programming 


array while the cost matrix for that stage is applied and the 
product vector accumulates. For the next stage, the product 
vector is shifted while the multiplicand vector remains 
stationary. This alternation continues; on the last pass the 
cost vector corresponding to the first stage is input and the 
final value of the shortest path is accumulated in the first 
processor. 


This parallel architecture requires m processors. The 
first must be able to store values which can be as large as 
(n—1)C, the others as large as (n—2)C. Hence, each 
processor requires area O(logn), so the area of the entire 
array is O(mlogn). 


The time for the array to complete the computation is 


_ determined by that of the first processor, since it performs 
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both the first and last operations. In processing the ith stage, 
it must perform m additions which require time 
K(n—i—1)C)+U(C) and (m—1) comparisons which take 
U(n—i)C). Thus, the time required is 


n—-2 
S matt ~ i= 10) + KO] + (m — 1(n - DC) 
i=1 


which yields a time complexity of O(mnlogn). 


The Delta Transformation 


We now proceed with the main result of this paper, the 
development of which we illustrate with the serial dynamic 
programming problem. The knowledge that the edge 
weights in the multistage graph are known to have a fixed 
range is powerful; it permits us to transform the standard 
algorithm into one which requires asymptotically less area 
and time under the logarithmic cost criterion. 


The Residue Reduction 


We first observe that the bound on edge weights 
constrains the maximum difference between terms which 
appear in a minimization. 


Lemma 1: the difference between any two terms which 
appear in a minimization in the standard serial dynamic 


programming algorithm is at most 2C. That is, 


max 
LJ; Ry k, 


+ <=2C 


c.., ) 
9 Lyk 


= iste 9 


Cire, 2 “Lik, 


(for proofs of this and following lemmas and theorems, see 


[4]). 


This fact can be exploited to make determining the 
minimum simpler. In particular, it is only necessary to 
examine low order bits. 


Lemma 2: given three non-negative integers x1, x2, and 6 
such that 


x,-68sx,5%,+6 
choose any A greater than or equal to 25+1 and let 
41 = xymodA and yg = xgmodA. If we define 


weinhts = minly,, Yo ifmaxly,, Yo] _ minly,, Yol <§ 
ie ee maxly Vol otherwise 
then 


A ane 
min. Dy Yo = minI[x,, x,lmodA 


Now we replace the standard dynamic programming 
recurrence with the following 


A 


Fogg paiig: SISO 4 
and 
ae 
s— mint [eo 4 1<isn-2,15jso,, 
Pia: ac | "iste A igk 1<k<o. 


(where +, is addition modulo A) for any A greater than or 
equal to 4C +1. Whence we arrive at the main result of this 
section. 


Theorem 1: for r4; ; defined above 


ro =r. modA 1<isn,1Sjso. 
I tJ H 


That is, the new recurrence performs precisely the same 
computation as the old, only using values modulo A. We call 
rA; ; so defined a residue reduction; it is a distillation of the 
original algorithm. 


The Path Augmentation 


The residue-reduced algorithm does not give us the true 
cost of the shortest path through the graph, although it does 
determine its remainder modulo A correctly. We observe, 
however, that the costs for the n—1st stage (the initial 
conditions) are determined precisely as before. Consider 
picking an r4,_1, as an initial sample and inductively 
walking a path through the residue values to determine the 
true cost, r1,1. To this end, we state 
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Lemma 3: the difference between the costs of the shortest 
paths from vertices in any two consecutive stages is at most 
C. More specifically, 


re OST 


sr..+C 

Ly 

Now we make a second general observation; it concerns 
determining the true value of a non-negative integer given 
only the remainder of that integer modulo some value and 
the full value of another integer which is known to be close 
in magnitude. 


Lemma 4: given three non-negative integers x;, x9, and § 
such that 


x,-6sx,5x,+6 
choose any A greater than or equal to 26+1 and let 


yy = xymodA and p = min4;[y1, xymodA]. If we define 


Xo — (x,modA =. y,) 


x» + Y, = x modA) 


n _ ifp = y, 
aug ly, *9l = if. = x,modA 


then 


A oa 
aug. yy Xo] =X, 


By Lemma 3 we can augment the reduced recurrence 
along any path from the last stage to the first. If we want to 
retain the general interconnection scheme of the original 
matrix-vector multiplication array, we should make the 
augmenting values available in an end processor. A logical 
choice is the path along the top of the recurrence, 
corresponding to rj j, as these values are always available in 
the first processing element. 


Now, to the residue recurrence given earlier we add 


a .=r 
n-1  n—1Ll 
and 
A -.A . 
= <1< — 
a. Us oolT a4, lsisn-2 
We then assert 


Theorem 2: for the recurrence defined above and any A 
greater than or equal to4C +1 


4.1; 1sisn—-1 
L i,1 


That is, this new recurrence correctly determines the values 
a; = rj, and in particular the cost of the shortest path 
through the graph, ay = r,. We call the combination of a 
residue reduction and a path augmentation a delta 
transformation. 


Space and Time Complexity of the Delta Algorithm 


The residue-reduced recurrence r4; ; requires storage of 
I(A) bits for each value. As before, the space complexity is 
determined by that required to compute one stage of costs. 
There are m vertices in a stage, so the space complexity is 
O(mlogA) which is O(m) since A is a constant. For the 


augmentation path, we observe that a value at the ith stage 
can be as large as (n—i)C. Since only two augmentation 
values need be retained at any one time, the storage 
required is at most [((n—1)C)+U((n—2)C) bits. Thus, the 
space complexity of the entire delta-transformed algorithm 
is O(m+logn). This is asymptotically less than the 
O(mlogn) required by the standard dynamic programming 
algorithm. 


The time required for a min4s at any stage of the 
computation is O(log({A)), which is to say constant. Hence, 
the complexity contributed by the reduced recurrence is 
O(m2n). The time required for an aug4s at the ith stage is 
proportional to [((n—i)C), hence the augmentation path 
contributes O(nlogn) to the time complexity. Thus, the 
complexity for the entire algorithm is O(m2n+ nlogn), which 
is asymptotically faster than the O(m2nlogn) required by the 
original. | 


Mapping the Delta Algorithm into Silicon 


We can again use Li and Wah’s matrix multiplication 
scheme, with a slight addition. Recall that the value r; is 
available at PE, after the ith stage has been processed; on 
odd numbered iterations it is the stationary value, on even 
numbered iterations it is the first value to shift in. Hence, 
we add a processor external to the array on the left side, as 
depicted in figure 3. This processor is initialized with the 
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figure 3. VLSI array for the delta-transformed algorithm 


value ¢n_11,1. Then, every m clock cycles, it alternates 
between sampling PE;’s stored value and the first value of 
the next pass to shift in. 


This scheme requires a total of m+1 processors. Each of 
the m computing the residue recurrence store values which 
can be at most A; hence each requires area O(logA), which is 
a constant, so their combined area is O(m). The augmenting 
processor must manipulate values as large as (n—1)C, hence 
it requires area O(logn). Thus, the area of the entire array is 
O(m+logn) as opposed to the O(mlogn) that the original 
required. 


The reduced processors perform inner product steps on 
data values of constant size, which require constant time. 
Since each processor handles O(mn) of these steps, their time 
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complexity is O(mn). In the same time that these processors 
compute an entire stage, the augmenting processor must 
perform operations which require time at most O(logn). 
Since there are O(n) stages, the augmenting processor 
requires a total time O(nlogn). Hence the time complexity of 
the entire array is O(nmax[m, logn]) as opposed to the 
O(mnlogn) the original array required. 


Conclusions 


We have demonstrated a technique to simplify VLSI 
processor arrays for a general case of serial dynamic 
programming. Extensions to other dynamic programming 
algorithms, both serial and nonserial, are straightforward; 
the necessary and sufficient condition for application of a 
delta transformation is that all values to be minimized (or 
maximized) are close in magnitude. Unfortunately, the 
transformations as presented here will not yield space/time 
savings for every problem [4]. They can be applied, however, 
in many important instances, including approximate string 
matching, two-dimensional image comparison, and dynamic 
time warping of speech. For example, we have designed and 
fabricated a linear systolic array for comparing DNA 
sequences (essentially strings over a small alphabet) [5]. 
Using a delta transformation, we were able to place 30 
processors on a single nMOS chip which runs at six 
megahertz, whereas if we had implemented the unmodified 
algorithm, we would have been fortunate to fit four 
processors running at a much lower speed. 


This paper is a condensed version of [4]. 
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Abstract 


The 
large 


problem of merging two 
ordered sets A, and B is con- 
sidered. An efficient parallel algorithm 
is developed. The shared memory model 
of computation is used. 
Key words: Parallel algorithn, SIMD, 
parallel merge, external sorting. 


1. Introduction 

The merging and sorting problems 
have received considerable attention 
in the literature [1] - [5], [7] - 
[18]. While the merge operation 


serves as a basic operation in many algo- 


rithms [10] it is usually treated in 
the context of sorting. With the grow- 
ing use of databases, the problem of 
merging becomes more important. Much 
effort has been devoted to solving this 
problem on parallel processors [9], 
C12 hy [13], [17]. In this paper we 
present an optimal parallel algorithm 
for merging two sets of records. 
Classically the merge becomes 


external when the sets are too large to 
be contained in the memory of the pro- 
cessor. In parallel models input is 
considered to be external if it cannot 
fit in the internal memory of the PEs. 
For this particular problem, the two 
sets can be stored on a disk. Disks with 
multiple read/write heads can be 
regarded as a EREW P-RAMs. We associate 
with each read/write head a PE and con- 
sider the disk medium as the’ shared 
memory. The PES can communicate with 
each other in 0(1) time by leaving mes- 
sages in predetermined memory locations. 


2. Parallel External Merge Algorithm. 


Two sorted files A and B with m_ and 
n records respectively are to be merged. 
The result of the merge is a third sorted 
file Cc, with ntm records. The serial 
merge of the two input files, (given 
correct buffering and overlapped I/0 
and computation) can be performed in 
O(m +n) time. In this paper we present 
a solution to the external merge problem 
when a fixed number, p, of PEs are avail- 
able. 


Ideally 
p< mtn, the 


where 
in 


with 
files 


p PE's, 
can be merged 
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O((ntm)/p) time. Our algorithm performs 
the merge within this optimal time. Con- 
ceptually the algorithm has three stages. 
In the first stage we identify pairs 
of subfiles from files A and B that could 


be independently merged. In the second 
stage, we merge all these subfiles in 
parallel. Each of these pairs is merged 


serially by one of the p available PE's. 
In the last stage, the results of the 
individual merges are concatenated to 
produce the result file C. 


Let the key of record q in file X be 
X 


Kae Record i in file A has insertion 
e iJ e e e B B 
point j 1n file B if Ks < Ki < Ke41° If 


Ke < Ke (the key of the first record in 
B) then it has insertion point 0. If 
KP < Ky then record i has insertion 
point n+l (n is the index of the last 


record in B). 


We first partition each of our input 


files into p subfiles: subfile A; with 


at most [ m/p |, 1 < i < p and subfile 
rT n/p 1, 1<]<P 


the partition records to 


B, with at most 


records. Define 


be the records with indexes: aie 


T m/p T, 2*f m/p J, «--- ,(p-1)*] m/p 
T, min A and, [ n/p J, 2*[ n/p J, 
, (p-1)*7 n/p J, n in file B. Next for 


each of the 2p+2 partition records, we 
read into memory its key and mark 
whether it came from file A or file B. 


Observe that the pt+tl keys from each file 
are sorted. We now merge the two sets 
of keys, using a bitonic merge [11] to 
form a sorted sequence. Let the sorted 


sequence be S (, 1<h < 2p+2}, where 


X can be either Aor B. Let J,, 1<h 
< 2p+2 be the original index of Kn in 


file xX. Having created this sequence, it 


is easy to see that there is no need to 
search the whole file in order to find 
an insertion point for some record i. Let 


Kn be a key for partition record i in A, 


i. e. J, = i. Let Ke be the last key 


from file B in S_— such that Ke < oe 


Also let Ke be the first key from file B 


in S such that KA < Ke - Clearly, when 
looking for the iftsertt Fon point of 
record i of A in file B we can limit 
our search in file B to records 


between the records indexes Je to Jy 


By performing the bitonic merge 
on the keys of the partition records we 
reduced both the number of disk accesses 
and also the number of read conflicts. 
When looking for an insertion point for 
record i (j) of A (B) in B (A) we have 
to search aie subfile with at most 
T n/p | (T m/p |) records. Since the 


search range is clearly partitioned, no 
record has to be accessed more then 
once. Clearly the number of disk 


accesses is reduced. Although we did not 
eliminate read conflicts, we definitely 
reduced the probability of one occurring. 


Although we showed how to find 
insertion points in file B corresponding 
to the records in A, the same method can 
be used to find insertion points in file 
A corresponding to the records in file B. 
To identify the partition records in A 
that are candidates for creating read 
conflicts, we mark all instances in S 
where there are j > 1 keys from file A in 
between two given keys from B. Let us 
consider one instance of j > 1 keys from 
A falling in between two consecutive 
keys of partition records in B. More 


formally we look at the following subse- 
B B 

Kye Rneaeeeee Rey" Kn+jt1° 

explained earlier, this translates to 


quence of S: 
AS 


a search for insertion points in B for 


records jae of file 


Shen! S42 h+j 
A. The search for the j points has to be 
conducted on records Jy to Tht541 in file 


B. Since in the search step we assign 
one PE for each partition record in A, 


we can expect j PEs to be assigned to 


search in the interval Jy to Thtj+1 in 
B. These j PEs can partition the 
interval to j subfiles in B. Each of 
the new subfiles can have at most 


The new j par- 


T T n/p J/j J] records. 
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tition points from B can next be merged 
with the j keys from A. Now, there are 
two possible cases. Either there is one j 
point from A between two consecutive 
points from B, or there are several. In 
the first case, the PE can conduct the 
search independently; namely, there is no 
danger of a read conflict. In the 
latter case the process can he repeated 
until a situation with one PE searching a 


partition is arrived at (i. e. no read 
conflicts occur). In the worst case 
the search without read conflict prob- 


lem will be trivially solved when a 
partition with only one record is 
created. 


The introduction of the first level 

of bitonic merge has a cost of O(log p) 
when using p PEs. The search step, 
still with read conflicts, can be per- 
formed in O(log (n/p)) time on file B. 
On file A the search step takes 
O(log (m/p))-. To eliminate read con- 
flict we continue to partition intervals 
and perform bitonic merges until we can 


ensure searches without read conflicts. 
This process can _ be done in 
O( (log p)* (log, (n/p))) for file B and 


in O( (log p) *(log (m/p) )) for file A. The 
complexity of th partition stage of our 
new algorithm is readily seen to be O(log 
m+ log n) as before. 


In order to simplify the discussion 
above we assumed that a PE “knows" 
whether it is the only one to be search- 
ing the interval. To perform the 
Bitonic merge effectively each PE should 
also have the boundaries of the search 
interval and the number of PEs' sharing 
that interval. This information can be 
obtained in O(log p) time. A detailed 
discussion of this step is given in Dekel 
and Ozsvath [7]. 


3. Conclusions. 
In this paper we describe a paral- 
lel merge algorithn. We show how 


to eliminate the possibility of read 
conflicts without changing the complexity 
of the algorithm. We chose to develop 
an algorithm for an important operation, 
file merging. The need for algorithms 
like our merge algorithm will increase 
dramatically as more parallel computers 
are in use. Indeed, such computers are 
likely to become quite common in the near 
future. These computers will come with a 
fixed number of processors and will. be 
expected to perform operations on varied 
Sizes of data. 
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Parallel Algorithms for Bucket Sorting and 
the Data Dependent Prefix Problem 
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Abstract 


The data dependent prefix problem is to compute all 
the n initial products 7,0 47,0 ::-O x, 1<k <n, 
where the order is specified by a linked list. A parallel algo- 
rithm for the data dependent prefix problem is presented. 


This algorithm has time complexity 0 (7 +logn log) 


using p processors on the exclusive-read exclusive-write 
computation model. A bucket sorting algorithm is also 
developed to be used as a component of the prefix algo- 
rithm. This bucket sorting algorithm sorts nm numbers in 


the range {1,2,..,.m} using p processors in 
logm n 
O([ |(—+logp )) time. 
log(= +logp ) 


Index words -- Prefix, bucket sorting, parallel algo- 
rithms, graph algorithms, optimal algorithms. 


1. Introduction 


| Parallel algorithms have been studied in many 
different areas. Some serial algorithms are easily parallel- 
ized while some other serial algorithms seem difficult to 
parallelize. The study of parallel algorithms reveals many 
properties of the problems which were previously unknown. 
In the serial case, time complexity is usually the most 
important complexity measure for algorithms. In the 
parallel case, the processor complexity also comes into 
play. To measure how much parallelism of the problem 
has been exploited by an algorithm the parameter of 


speedup is defined as - where JT; is the time complex- 


ity of the algorithm using 7 processors and T, is the time 
complexity of the fastest known sequential algorithm for 


LT 
the same problem. It is not difficult to see aS. 
P 


T 
When 7, = we say the parallel algorithm achieves 


linear speedup with p processors. In the parallel environ- 
ment algorithms with maximum speedup and algorithms 
with optimal performance (linear speedup) are both sought. 
As the number of processors grows the utilization of the 
processors tends to decrease and linear speedup algorithms 
become difficult to obtain. The contribution of the 
research presented in this paper is a new parallel algorithm 
for the data dependent prefix computation. This algorithm 
achieves linear speedup allowing asymptotically more pro- 
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cessors than previously known algorithms. The time com- 
plexity of our algorithm improves the best previous result. 
A bucket sorting algorithm used as a component of the 
prefix algorithm is also presented. This sorting algorithm 
shows better performance than known algorithms. 


The shared memory computation model is assumed in| 
this paper. In a shared memory model both memory cells” 
and processors are linearly ordered and each memory cell 
can be addressed by any processor. Three shared memory 
models are distinguished according to the ways they 
resolve read or write conflicts[10]. The Concurrent-Read 
Concurrent-Write model(CRCW), which is the strongest 
model among these three, allows simultaneous read and 
write of one memory cell by several processors. The con- 
tent resulting after a cell is written by more than one pro- 
cessor can be determined in a variety of ways. In the 
Concurrent-Read Exclusive-Write(CREW) model several 
processors can simultaneously read one memory cell, but 
simultaneous write is prohibited. The weakest model 
among three, the Exclusive-Read Exclusive-Write(EREW) 


model, prohibits both simultaneous read and write. Our 
algorithms are EREW and CREW algorithms. 

Let O denote an associative operation. Given 
1, %9,°°*,2%_,, the (data independent) prefix problem is 
to compute 7,0 2,0 --:-O a, 1<k<n. The data 
dependent prefix problem 1s to compute 
2,0 220 +::O x», 1<k<n, for a given linked list 


containing 24, %o9,..., Z,, with z; following 2;_,}. For both 
cases, the product computation problem is to compute 
z,O toO +" -O zy, - 

The prefix problem is a fundamental problem in 
parallel computation. It finds applications in many prob- 
lems such as processor allocation and reallocation, data 
alignment, data compaction, sorting and _ polynomial 
evaluation. The data dependent prefix problem is espe- 
cially important since linked lists have been used widely as 
a data structure in many algorithms and prefix computa- 
tion is probably the most basic and fundamental computa- 
tion for linked list manipulation. One example is to use 
the prefix algorithm to number the data items in a linked 
list so that each item knows how far it is from the first ele- 
ment of the linked list (compute list rank for each ele- 
ment). This numbering operation is extremely useful when 
one intends to apply certain parallel operations to the 


_ linked list. 


Studies have been conducted on the parallel prefix 
problem [2][5][7|[8][9]. These studies assume that the input 
data is data independent in the sense z; is in memory cell 

-t. For the data dependent case, where input data forms a 


linked list and a map specifies which element follows 
which, optimal algorithms are also pursued. The best 


previous result is due to oes eee ae Snir(6]. 


__logn \ 


log@2n /p) p 


which achieves time complexity O 

p processors on the EREW model. : 
that time complexity O(—+logn log~)™ 
on the EREW model. 


whenever p =O( 


Our result achieves linear speedup 
n 

logn loglogn 
over the best previous result{6] which achieves linear 
speedup only when p <n‘, O<e<1. When our algorithm 
uses the maximum number of processors for which it can 
achieve linear speedup its time complexity becomes 
O (logn loglogn ), a polynomial in logn. In contrast the 
best previous algorithm |6] can achieve a time complexity of 
O(n‘) using the maximum number of processors for 
which the algorithm obtains linear speedup. Our algorithm 
is the first result which shows that linear speedup can be 
achieved for the data dependent prefix problem with poly- 
logarithmic time complexity. 


). This shows improvement 


Parallel sorting problem has been studied extensively. 
See [1] for a survey of this area. For our purpose, a sorting 
algorithm is required for sorting nm numbers in the range 


{1,2,... 


Parallel sorting algorithms mentioned in [1] do not meet 
our requirement. Parallel comparison sorting algorithms 
are not useful here because they will take at least 


(2 lesn ) time with p processors. Hirschberg’s bucket 


sorting algorithm [1][4] also implies a time bound of 
o (Blosn) Kruskal, Rudolph and Snir claimed a sorting 


algorithm which sorts n numbers in the range {1,2,...,n }. 


The = complexity of their algorithm is 
(——_-8* ” +logp )) with p processors. Application 
log(2n ip) ) 


of their sorting mules to our problem would imply a 
time bound of O (——_—2>-—— logn 


log(2n /p)-~ p 


slow. We have developed Sire own bucket sorting algo- 
rithm. Our algorithm sorts n numbers in the range 
{1,2,...,m } with p processors. The time complexity of our 


sorting algorithm is O ([ logm (= Hlogp )). Com- 


(+logp )) which is also too 


log(—-+logp ) 


pared to the results of Hirschberg and Kruskal et. al., our 
result’ is more flexible in the sense that it adjusts itself 
better to the range of the numbers and the number of pro- 
cessors used. 


2. Basic Fact and Definition 


Product circuits, which are directed acyclic graphs, 
are also widely used as a parallel computation model. 
Some results may be obtained relatively easily using this 
model. We allow indegree of any node in the graph to be 
no more than 2. A node with indegree 2 represents a pro- 
duct-..of its two inputs. The nodes with indegree 0 are the 
input nodes. The depth of a circuit is the longest directed 


Se . 
Logarithms in this paper are to the base 2. 
* 
We note here that oO ($+ logp )=0O (5 +logn ) and 


O (= +logp log )=0 (2 +logn log”). for p <n. 
e p a7) C gn log), pn 


using. 
is paper we show 


can be obtained 


jlogn } with p processors in time O (= +logp i 
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feeds 


path in the graph. Fig. 1 shows two circuits for product 
computation. For the prefix computation the following 
fact is important to us. 


Fact: 


Any circuit for product computation with the 
property that every node of the circuit represents a 


product of Qt <i<j<n, can be used to 
=4 


construct a circuit for the prefix problem, and the 
depth of the prefix circuit is within a constant fac- 
tor of the depth of the original product circuit. 


For the product computation, the circuits can be 
viewed as trees, which we will call product trees, with sons 
pointing to fathers. For the class of product trees satisfy- 
ing the condition stated in the above fact we can define, 
for each non-input node, its left son as the node which 


m j 
© 2, to it and right son which feeds O 2 to 
k =t k=m+1 


it. 
To 


representing 
Each of 


these fathers then passes the product received to its right 
son. Each interior node getting a product from its father 
will transfer it to its left son and then compute a new pro- 
duct by combining the product passed from its father with 
the product accumulated by its left son during the process 
of computing the product. This new product will be 
passed to the interior node’s right son. Fig. 2 shows this 
process. 


The fact comes from the following observation. 
compute the __ prefix, each node 


j 
LP 2, ,1<y <n, passes its product to its father. 
=1 


This fact essentially shows that, for the serial and 
parallel (data independent) case, the prefix problem is no 
harder than the product problem. This fact is noted by 
Schwartz|9] and many other researchers and is used to con- 
struct parallel (data independent) prefix algorithms. This 
fact also applies to the data dependent prefix problem as it 
was used by Kruskal et. al. to construct their algorithm. 
Our algorithm also uses this fact, i.e. we are going to con- 
struct a parallel product algorithm with the product tree 
which the algorithm implicitly builds satisfying the condi- 
tion stated in the fact. Thus our algorithm can be easily 
transformed to compute the prefix. This is accomplished 
by, when the product computation finishes, reversing the 
computation and popping out the product stored during 
the process of product computation. 


| Now we give definitions for some concepts related to 
linked list. 


Definition: 


A linked list with n elements is a data struc- 
ture whose n elements are associated with n con- 
secutive memory locations, where these n memory 
locations are linked one after another by arcs. The 
last arc in the list points to a special symbol nil 
denoting the end of the list. An arc (a ,0 ) indicates 
that the element associated with location a is fol- 
lowed by the element associated with location b. 
This arc is represented by value 6 in location a. a 
is the tail of the arc and 6 is the head of the arc. 


3. Previous Results for the Prefix Problem 


It is advantageous here to mention some of the previ- 
ous work on the prefix problem, since our algorithm uses 
some of the previous known algorithms as components and 
our work represents a continuation of the effort of these 
researchers. : 


The following well-known algorithm can be used to 
solve the prefix problem. | 


PREFIX_INDEPENDENT (X 
begin 

for k :=0 to [logn ]}-1 do. 

forall 1: 2" <i <n do 

X [i |:=X [i-2°|O X [i]; 


(0.n-1)): 


end. 


| Algorithm PREFIX_INDEPENDENT has a 
corresponding version for the data ee case, ae 
is expressed by Wyllie[11]: 


PREFIX_DEPENDENT(X (0..n -1], NEXTX (0..n -1], HEAD ); 


forall 7: 
begin 

temp :==HEAD ; 

if NEXTX |? | A nil then NEXTX |[NEXTX |t |]:= 

else HEAD :=1 ; 

NEXTX [temp |:=nil ; 

for k :=1 to [logn ] do 

if NEXTX |i |s4nil then 
begin : 

X [t|:=X [(¢] OO X [NEXTX [7 |]; 
NEXTX [i |:=NEXTX [NEXTX |i |]; 


end 


O0<i <n do 


end. 


If n processors are available both algorithms will take 
O (logn ) time. However, in no situation will they achieve 
linear speedup. This occurs because the total number of 
operations N involved in these algorithms is O(n logn ). 


With p processors the best one can hope for is O (een), 


Kruskal et. al. [6] discussed an improved version of 
PREFIX.INDEPENDENT which achieves linear speedup. 
The elements are partitioned into p blocks where block 1 
is X [7 -.. (i +1)-4-1], 0<i <p. Processor P; is assigned 

P 
to block 7 to calculate the prefix (and therefore the pro- 
duct p;) of block i. Then PREFIX_INDEPENDENT can 
be used to obtain the prefix p;* for the p products. Finally 
p; is added to each element of block i+1 by processor P; . 


However, no technique similar to the one mentioned 
above is known for algorithm PREFIX _DEPENDENT. A 


handy improvement can be made to achieve o (2 kksr) 
time with p processors. Fig. 3 shows that a linked list can 
be transformed into a linked tree of p branches, each with 


length no more than “41. This can be done in 
P 


O (logp ) time with p processors by repeatedly assigning 


NEXTX |i |:=NEXTX [NEXTX |i |]. This is equivalent to 
executing PREFIX_DEPENDENT for flogp]__ loops. 
Thereafter each processor works on one branch. The tim- 


ing will be O ct tegp 
2 P 
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Kruskal, Rudolph and Snir’s result shows that with p 


logn =) time can be achieved for the 


log(2n /p) p 
data dependent prefix computation|6]. The hard part of 
the problem is to process arcs, in an order that guarantees 


that the arcs being simultaneously processed are not 


processors O ( 


chained together. By dividing the n elements into = 


blocks, they apply p processors to visit the “ blocks one 


at a time. During this phase, only arcs savin a block are 
processed. Since no processor will look at the head of the 
arcs leaving the block which the processors are visiting, it 
is ensured that arcs simultaneously processed are not 
chained together. After this all the inter-block arcs are 


treated so the whole problem is divided into = equal size 


subproblems. By distributing processors aveuly to all the 
subproblems and applying recursion all the arcs will be 
processed. This is one pass of the so-called Pair-Off. After 
pair-off each remaining element must have both of its 
neighbors eliminated, thus the number of remaining ele- 


ments is at most an Elements are compacted and: 


another pass of pair-off is applied. The number of ele- 
ments left after each pass will form a geometric series and 
logn on ) is 


consequently, time complexity of O(———— 
log(2n /p) p 


obtained. 


4. A Parallel Bucket Sorting Algorithm 


In this section a parallel bucket sorting algorithm is 
presented. This algorithm sorts n numbers in the range 


{1,2,..., 
O([ 


m } using p processors. 


The sorting requires time 
: 
Em —](=+logp )). 


log(—-+logp ) 


To sort n elements in the range {1,2,...,.m } with p 


processors, we divide the elements into [>] blocks. These 


blocks will be visited by the p processors sequentially. 
Suppose element 7 valued v; is visited by processor P,, 
the element will be dropped into the ((v;-1):p+é )-th 
bucket. After all the elements are dropped into buckets, 
packing the elements will yield the sorted sequence. The 
details of the algorithm are shown below. 


BUCKETSORT[(A [1..n]) /*1<A [i]<m, 1<i <n. */ 


processors: P;, 1<2 <p; 


array: S{l.m-p], /* Buckets */ 
C [1..n |, /* Counter for each element to 
calculate its rank. 
| * 
G [1..m |; /* G[t]| is the total number 


of occurrences of elements 
valued 7. 


sf 


forall P;: 1<i <p 


begin 
/* Initialization. Takes O (m ) time.*/ 
S [1..m -p ]:=0; 


/* Throwing elements into buckets. 
Takes O(—) time. 
*/ 
for k :=1 to al step 1 
begin 
C [(k -1)-p +4 ]:=S [(A [(4 -1)-p +7 ]-1)-p +4]; 
S (A (4 -1)-p +4 ]-1)-p +4 := 


F S\(A [(k-1)-p +7 ]-1)-p +7) +1; 


/* G |i] is going to be returned to the calling 
procedure. Takes O (m +logp ) time. 
* 
/ - 
forall kk: 1<k<m doGl{k):= YY} 


i =(k-1)-p +1 


Si]; 


/* Packing. Takes O (+m +logp ) time. +) 
PREFIX_INDEPENDENT(S [1..m -p ]); 
for k:=1 to [—] step 1 
P 
begin 
C |(k -1)-p +7 ]:=C [(k -1)-p +7] 
+5 (A [(4 -1)-p +7 ]}-1)-p +7 -1); 
end 
end 


For n elements ranging from 1 to m, BUCKET- 
SORT takes O(7-+m +logp ) time units with p_proces- 


sors. When m > +logp , we can use the idea of radix 


sorting. The logm bits representing the numbers are 


divided into [——08™ _) blocks. Starting from the 
log(—+logp ) 


least significant block of bits, BUCKETSORT can be 
applied. After applying BUCKETSORT [——°8™ 
log(— +logp ) 

times, once for each block of bits, the sorting is accom- 
plished. The total time will be 
logm n ; 

O([ |(—+logp )) since the m in the BUCK- 

log(—+logp ) 


ETSORT is ~+logp now. 
p 


We note here that this sorting algorithm can also be 
used to pack data. Assigning 1 to each marked element 
and 2 to each unmarked element, execution of this sorting 
algorithm will pack the marked elements to the beginning 
of the array and unmarked elements to the end of the 
array. 


5. The Main Algorithm 


Our algorithm for the data dependent prefix problem 
is presented in this section. This algorithm uses the obser- 
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vation stated before, i.e. an appropriate product computa- 
tion algorithm can be used to construct an algorithm for 
prefix computation. Thus we are focusing on constructing 
a suitable algorithm for product computation. The idea of 
Kruskal et. al. that appropriate pairs of elements are com- 
bined such that one pass of combining (or ‘pair-off’) will 
reduce the number of elements by at least one third is fol- 
lowed here. However, instead of using p processors to cut 


n elements into — blocks we observed that the n arcs in 


the linked list an Be divided into O (logn ) groups and all 
the arcs in one group can be eliminated simultaneously by 
one ‘pair-off pass. The details of this idea are expressed 
below. 


We divide arcs into groups by the following rule: arc 
(a,b) is in group 2k-a, where 
k= min {7 | the ¢-th bit of a XOR 6 is 1}, 
XOR is the exclusive-or operation, bits are counted from 


the least significant bit starting at 1, a, is the k-th bit of 
a. We prove the following lemma. 


Lemma: 


a) 
b) 


Each arc belongs to one and only one group. 


For any group G, if two different arcs (a ,b )eG and 
(c,d )eG, then a,b ,c,d are all distinct. 


Proof: 


Part a) is obvious from the way we form groups. 
Now we prove part b). 


Assume G is an arc group indexed by 2i-a;. ac 
since no two arcs can originate from one element, and 
b=4d since no two arcs can have the same target. Since 


both arcs are in G, a;=c;. Now by the definition of 
grouping, b; =d;=(a; XOR 1). Hence a, 6, c,d are all 
distinct. 


Q.E.D. 


By the above lemma, we know pairs of elements con- 
nected by arcs in a group can be combined simultaneously 
without interfering with each other. Since at most 
|logn |+1 bits are used to represent the elements in the 
linked list, at most 2|logn |+2 groups are needed. To 
determine, for arc (a,b), which group it belongs to, we 
compute 


c:=a XOR 0; 
c:=(((c -1) XOR c )+1)/2; 


This will set all the bits of c to zeros except the lowest bit 
valued 1. Now we use value ¢ to index into a table T to 
determine which bit of c is 1. Table T can be built by 
the following simple procedure. 


TABLE(n ,p ) 
begin 
for 7 :=0 to logn step 1 
T (2? |:= 7 +1; 
end. 


It is possible that several processors attempt to read the 


same entry of the above table simultaneously. We can use 


this table on the CREW model anyway. The time for table 
building will not dominate the overall time complexity. 


The procedure for determine the arcs’ group membership is 
as follows. 


MEMBER(X (1..n |, NEXTX (1.. .n|) 
forall P;: 1<i <p 
begin 


for k:=1 to — step 1 
begin 
M [i ]:=X ((¢-1)2+4] XOR NEXTX ((i-1)2+k }; 
M | }:==(((M [i ]-1) XOR M [i])+1)/2; 
A |(i =i) |:=2-T [M [i]; 


if (M[i] AND X (i) +k })>0 then 


7 A Atel) re |:=A ise J-1; 
end. 


After the group of each aft is calculated, the bucket 
sorting algorithm is used to move arcs so that arcs in the 
same group are gathered together. Now, one group at a 
time, all the arcs are visited. This will take no more than 


acre +2 
)=0 (F +logn ) 


% Seu 


| G(i)| is the number i arcs in group #. Specifically, 


when the head of an arc visited by a processor is already 
deleted due to the previous processing of that arc’s prede- 
cessor then the processor does nothing, otherwise it com- 
bines the head and the tail of the arc, marks the tail 
deleted, and updates the arc to point to the successor of its 
successor. After one pass of this ‘pair-off’, any remaining 
element must have both its predecessor and successor 


time, where 


marked deleted. So at most =n elements can survive. 


The remaining elements are packed and after O (log) 


passes, there are no more than p elements left. Finally the 
remaining elements are combined by algorithm 
PREFIX_DEPENDENT. The following gives the details of 
our algorithm for the CREW model. 


PRODUCT_PREFIX(X [1..n ], NEXTX [1..n], HEAD ) 
begin 
TABLE(n ,p ); 
COMBINE(X [1..n |], NEXTX [1..n], HEAD ); 


end 


COMBINE(X [1..n |], NEXTX (1. ie HEAD ), 
forall P;: 1<t <p 
begin 
DELETED [1..n |:=false; 
ifn <p then 
PREFIX _ DEPENDENT(X (1..n], NEXTX [l..n |, HEAD ); 
else 
begin 
/* Calculating group membership. */ 
MEMBER(X [1..n], NEXTX [1..n ]); 
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/* Bring elements of the same group together.*/ 
/* BUCKETSORT returns each element’s rank 


in array C. 
* 


BUCKETSORTY{A [L..n ]); 
for k:=1 to = step 1 
p 


Bic (i) +4 =(i “17 +k 


/* Visiting arcs one group at a time. */ 
/* Array G is returned by BUCKETSORT. 
G [k | is the total number of elements 
in group k. 
‘i 
j:=1; 
for k :=1 to 2logn +2 step 1 
begin 
while G [k|>p do 
begin 
ee [7..5 +p ]); 
G [k}-—G [bk -p; 
end 
if G [k ]40 then PAIR-OFF(B [j..j +G [k ]]); 
j:=j+G [k]; 
end 


/* Now pack. PACK is a simplified version 
of BUCKETSORT. PACK returns the size 


of the packed array. 
* 


dl :=PACK(A [1.n)}) 


COMBINE(X [1..n' |, NEX TX (1.0 J, HEAD ); 
end 


PAIR-OFF(B [a..6 }) 
begin 
if b-a <p then Disable P;: 1 >6-a +1; 
if NOT DELETED [B [a +4 -1]] then 
begin 
X [B [a +2 -1]]:=X [B [a +¢-1]] O 
X [NEXTX [B [a +4 -1]]]; 
DELETED (|NEXTX |B [a +#-1]]|:=true ; 
NEXTX [B{a +4 -1]]:= 
NEXTX [|NEXTX |B [a +7 -1]]]; 
end 
Enable P;:1<1 <p; 
end 


Each execution of procedure MEMBER takes O(—) 
time and each execution of procedure BUCKETSORT 
takes O (=+logn ) time, for the n numbers range from 1 
to 2logn +2. As mentioned before, visiting groups of arcs 


takes O (—-+logn ) time, and packing can be easily verified 


to take O (2-togp) time. Thus one pass of the combin- 


ing Speeee i.e. an execution of procedure COMBINE, 


takes O(—-+logn) time. After executing COMBINE no 
Pp 


more than = elements are left, and hence the following 


recursive formula for time complexity JT, can be esta- 


P 
blished. 
2 
T, (n)=O (-+logn eT, (=). 
Since O(log—) passes of COMBINE are used and the 
P 


remaining less than p elements can be combined in 


O(logp ) time units, we have T, (n )=O (+logn log). 


All the operations in the algorithm except the ones for cal- 
culating the group membership of the arcs are EREW 
operations. 


For the EREW model, one way to determine the 
group membership for an arc (a,b) is to continuously 
bisect the bits of a XOR 6 and ask if the lower half of 
the bits are 0’s. In this binary splitting fashion, we can 
calculate the group membership for all the arcs in time 


O (loglogn ). The loglogn factor can be eliminated by 
P 


using p copies of table 7, one copy for each processor. 
This table TE , which is of size n-p, has only (logn +1)-p 
entries which could possibly be indexed into. These entries 
are TE(il{[j], 1<i<p, j=—2*, 0<k<logn. Table 
TE {1..p |[1..n | can be built by copying relevant entries of 
table 7. Distribute processors such that P; 
(k-1):-—P_ —41<i<k-—2__,  1<k <logn +1, are 
logn +1 ogn +1 
assigned to entries TE ip ile 1). These processors will 
make p copies of the TJ (2*~] and put them in 
TE (1..p |[2*-1].. The total copying operation will take 
O(logn) time. The procedure for determine arcs’ group 


membership is obtained by merely replacing the statement 


A (i) +k j:=2-T [M [i ]] 


in procedure MEMBER by 
A (i) +4 j:=2: TE [i ][M [i]. 


The table storage technique due to Fredman, Komlés 
and Szemerédi|3] can be used here to reduce the size of the 
index table to O(p -logn ) while retaining constant time for 
table lookup, i.e. O(logn ) space is enough for one copy of 
the index table. Although the sequential time complexity of 
the algorithm[3] for building a single copy of the index 


table is O (log‘n), O (—-+Hlogn log—) time is more than 
enough when p processors are employed to build a single 
copy of the index table. Thus, we can first create a single 


copy of the index table with the help of p processors, then 
duplicate the table to p copies. The time complexity of 


O (= +logn log) still holds. The algorithm just obtained 
is an EREW algorithm. 
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As mentioned before, for the prefix computation we 
execute our algorithm first and then reverse the computa- 
tion and pop out products stored at each node of the pro- 
duct tree. 


6. Conclusion 


Parallel algorithms allowing a maximum number of 
processors to achieve linear speedup for the data indepen- 
dent prefix problem have already been obtained, even on 
weaker models such as ultracomputers|9]. This is not the 
case in the data dependent case. We conclude this paper 
by raising the question: is O(n/p-+logn) time bound 
achievable for the data dependent prefix problem on the 
EREW model? 
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Fig. 1. Two product trees. O is addition 
here. 
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Fig. 2. Use a product tree to construct a _ 
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prefix algorithm. 2z;; denotes O Dp. 
= 
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Fig. 3. A linked list of n items can be 
transformed into a tree of maximum depth 


ot in O (=logp ) time, using p processors. 
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ABSTRACT 


The problem of range search arises in many applications 
in areas such as information retrieval, database, robotics and 
computational geometry. There are a good number of sequen- 
tial algorithms for this problem based on data structures such 
as k-d tree, quad tree, range tree, d-fold tree, super B-tree, 
overlapping k-ranges, non-overlapping k-ranges, etc. In this 
paper we present a parallel algorithm for a Single Instruction 
Multiple Data (SIMD) computing system with p processing 
elements. We make use of the linearized multiple attribute 
tree as the underlying data structure. Our algorithm has the 
complexity of O(kN/p), p<N where k is the dimensionality 
and N is the number of points of the data space. 


0. INTRODUCTION 


The problem of range search arises in many application 
areas such as information retrieval, database, robotics, and 
computational geometry. There have been many sequential 
algorithms for range search problem using the data structures 
such as k-d tree, quad tree, range tree, overlapping k-ranges, 
nonoverlapping k-ranges, d-fold tree, super B-tree, k-fold tree, 
and multiple attribute tree [1,3,5-7]. 


In this paper, we present a parallel algorithm for the 
range search problem. We make use of the Multiple Attribute 
Tree (MAT) as the underlying data structure. Our model of 
computation is a Single Instruction Multiple Data (SIMD) 
computing system, consisting of p processing elements. The 
system has a single shared memory that supports simultaneous 
reads. The time complexity of our algorithm is given by 
O(KN /p), DSN. 

The organization of this paper is as follows: In Section 1, 
we discuss the MAT data structure and the corresponding 
directory. In Section 2 the basic range search algorithm is dis- 
cussed, and its complexity is estimated. Firstly an O (k) algo- 
rithm is obtained using a processor array of size N in a 
Straight forward manner. Then, an O(KN/p), p<N algorithm 
using a processor array of size p and an augmented directory. 


1. THE MULTIPLE ATTRIBUTE TREE 


The MAT data structure was first introduced and 
analyzed by Kashyap et al [2]. The MAT is shown to outper- 
form the inverted file structure for partial match and complete 
match queries in the cases when the directory resides on the 
main and secondary memories in [4]. An exhaustive treat- 
ment On various structural aspects of MAT can be found in 


[5]. 
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The k-dimensional MAT on & attributes A5,A3,...,A, for 
a set of records is defined as a tree of depth k, with the follow- 
ing properties [4]: 
1) it has a root at level 0, 


li) each child of the root is a (k-1)-dimensional MAT, on 
(k-1) attributes, A ,,A 2,--Az, for the subset of the records 
that have the same A, value. This value is the value of 


the root of the corresponding (k-1)-dimensional MAT, 
and 


iii) the child nodes of the root are in the ascending order of 


their values. This set of child nodes is called the filial- 
set.0 


From the above definition we note that there is a root 
node at level 0, and it does not have any value. Every node of 
level i ,i=1,2,...,k, corresponds to a value of the attribute A;. 
Fig 1(b) shows the MAT data structure for the sorted set of 
records of Fig.1(a). The attributes A 1A4,...A, form the 
hierarchy of of levels 1 through k. The nodes of every filial- 
set are ordered according to their values. Another important 
property is that each record or point is represented by a unique 
path from the root to the corresponding terminal node. The 
total number of nodes in the MAT is at most kN for a given 
data set containing N records or points. 


The MAT is linearized and stored as a directory. We 
make use of the breadth-first linearization in which the nodes 
are stored in the order they are encountered when MAT is 
traversed in a breadth-first manner. The directory is an array 


of M (SkN) directory elements and each directory element 
has the following fields: 


directory-element= _ record 
value: 1M; 
first-child: 1..M; 
last-child:  1..M; 
end; 


where the fields corresponding to a node T, at level ds 
and numbered n are defined as follows: 


value: the value of the node T; 
first-child: | node number of the first child of the child set of T; 
last-child: 


node number of the last child of the child set of T; 


The idea of breadth-first linearization is illustrated in 
Fig.2 and the corresponding directory is shown in Fig.3. In 
the first-child field of leaf nodes contain the pointers to the 
corresponding records. The time complexity of constructing 
this directory using a uniprocessor system is O(N logN) 
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(including the time required to sort the records on all the attri- 
butes) [5]. Here, we are concerned only with answering the 
range query, assuming that the directory is available in 
memory. This directory is utilized in the design of parallel 
range search algorithms in the next section. 

2. PARALLEL RANGE SEARCH ALGORITHM 


k 
A range query is given by Q = ~q;, where q; specifies 

i=1 
the range [J/;,h;] at the level i. In geometric terms a range 
query specifies a rectilinearly oriented hyper rectangle in k 
dimensional space. Answering a range query calls for the 
retrieval of the points enclosed by the specified hyper rectan- 
_ gle, and these points form a ’sub’ MAT, called the query 
MAT (QMAT), on the original MAT. These nodes of the 
QMAT are called the qualified nodes of MAT for the query 
Q. Answering a range query involves identifying the nodes of 
the QMAT on the MAT. Let R be the number of records con- 
tained in the hyper rectangle. The number of the nodes of the 

QMAT is no more than kR. 


An algorithm for range query proceeds by descending 
down the MAT level by level, starting from the first level. At 
each level 7, the child sets of qualified nodes of previous level 
are searched for the containment in the range [/;,h;]. The 
algorithms is as follows: 


algorithm RANGE-SEARCH(level,qnodes); 
begin 
tempset:= 9; 
for each n € qnodes do 
add to tempset all the child nodes of n that lie in 
the range [1 pve ievell 
if level < k 
then RANGE-SEARCH(level+1,tempset) 
else return the pointers given by the elements of tempset; 
end 


eS 


eo 


FIG. 2 


Breadth- first linearization 


In the algorithm RANGE-SEARCH, at a level i, all the 
t 


nodes that satisfy the partial query (97; are collected as 
j=l 

tempset. In the next level i+1, the child nodes of these nodes 

are tested for inclusion in the range [/;41,4;,,]. At the final 

level, the pointers to the information about the records that 

satisfy the given query are retrieved. 


The MAT data structure has some builtin parallelism 
with respect to answering a range query. The searching for 
the qualified nodes can be simultaneously carried out on the 
sub MATs of the same level. However, there seems to be a 
stringent sequentiality in the way the attributes are processed 
one after the other. Again, speaking in geometric terms, pro- 
cessing of each attribute reduces the dimensionality of the 
search space by one. From the above discussion, we conclude 
that O(K) is lower bound on the steps involved in answering a 
range query on the MAT-based approach. 


The model of computation is in the form of an array of 
processor elements PE;, j=1,2,..p, and operates in Single 
Instruction and Multiple Data (SIMD) mode. More 
specifically, each processor element executes the same algo- 
rithm in synchronism with all the other processor elements. 
We use single shared memory in which the breadth-first MAT 
directory and other variables are stored. As will be shown 
later there will be no conflicts in writing into the memory. 
But, simultaneous read operations are supported. The range 
query is represented by low-limit[i] and high-limit{i], 
i=1,2,...,.k, which give the lower and upper limits of the range 
for the attribute A;. 


Firstly, consider the processor array PE;, i=1,2,...N 
where N is the number of records in input set. The array 
base[i], i=1,2,...,4k+1) is stored in the memory, where any 
node of level i lies in between the entries indexed by base[i ], 
and base[i+1]-1 in the directory. We use the array enable[i ], 
i=1,2,...N to selectively enable and disable the processor ele- 
ments of the processor array. The algorithm consists of k 
steps and in step i, i=1,2,...,k the level i is processed to obtain 
the qualified nodes of the level i. Each enabled processing 
element acts on a child node of a qualified node at previous 
level, and checks if the child node satisfies the range con- 
straint of the current level. The processor elements 
corresponding to the qualified nodes of current level are 
enabled in the next step. For the final level, the points that 
satisfy the range query are obtained. The algorithm executed 
by the processor element PE;, j=1,2,..N for the step i, 
i=1,2,...k is as follows: 


algorithm PE i 
begin 
1. if enable[j] 
2. then 
begin 
3. enable[/ ] = false; 
4, n = base[i] + J; 
a: if low-limit[i ] < value[k ] < high-limit[i ] 
6. then 
begin 
7. for / = first-child[7 ] to last-child[n ] do 
8. enable[/ - base[i ]] = true; 
end; 
end; 
end; 


In the above algorithm the qualified records have to be 
returned in the final level by suitably modifying the lines 7 
and 8. This parallel algorithm implements the algorithm 
RANGE-QUERY and it is very easily seen that this correctly 
answers the range query. Since the algorithm is executed in k 
synchronous steps, it 1s evident that the time complexity of 
this algorithm is O (k). 


In applications involving large amounts of data, the value 
of N could be very large and the assumption of the array of N 
processors is not very pragmatic. We now present an algo- 
rithm that utilizes an array PE;, i=1,2,....9, pSN processor 
elements. The basic idea of ‘processing the MAT nodes level 
by level’ is still followed; at any level the processor array is 
invoked required number of times along breadth of the MAT. 
In this case the directory is augmented with two more fields - 
enable and level. The former is used to selectively enable and 
disable the processor elements and latter field gives the level 
of the node. The global variable current-level gives the level- 
number that is currently being processed, and it is incremented 
after each level is processed. The array index{i], i=1,2,...,p 
gives the current index (into the directory) to be used by the 
processor element PE;. Initially, the enable field is made true 
for all the nodes of level 1 and also the entries of the array 
index corresponding to the first level nodes are filled up. For 
the final level, the qualified records are retrieved. The algo- 
rithm executed by the processor element PE je j=1,2,...,) 18 as 
follows: 

algorithm PE; 
begin 
1. I = index[/]; 
; if ((current-level = level[{/]) and (enable[/])) 

3. then 


begin 
4. if (low-limit[current-level] < value[/ ] 
< high-limit[current-level]) 
5. then 
begin 
6. for m = first-child[/] to last-child[/] do 
7. enable[7m ] = true; 
end 
8. index[j] =/+p; 
9. enable[/ ] = false; 
end; 
end; 
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The if statement in line 2 makes sure that all processors 
of the SIMD array process the nodes of the current level. 
After the first p nodes of a level are processed, the next p 
nodes are processed by the use of the array index as in line 8. 
The lines 6 and 7 are to be modified to retrieve the qualified 
records in the final level. It is straight forward to see that the 
range query is processed correctly. 


THEOREM: The time complexity of the parallel range 
search algorithm, on an SIMD system of p processors is 
O(kN |p). 
PROOF: For the execution of the algorithm, at any level i, 
the maximum number of times the processor array is invoked 
is given by: 

[ (number of nodes of level i /p } 
The total number of times the processor array is invoked is 


k 
> ¢ (number of nodes of level i )/p] ) 


i=l 


k 
= (>/ (number of nodes of level i} )/p = O (KN /p+k) 
i=] 
Hence, the time complexity of this algorithm is O (kN/p). 0 


We note that for the case p=N this algorithm has the 
same complexity as the earlier one. But, earlier is superior as 
it uses less memory space. We also note that the maximum 
number of times the processor array is invoked is given by 
k(N/p+1). This upperbound corresponds to the wosrt-case 
structure of the MAT. In an average-case the number of times 
that the processor array is invoked will be less than this 
bound. The authors are presently working on this aspect. 


3. CONCLUSIONS 


We have developed an O(kKN/p) algorithm for range 
search problem using the linearized MAT data structure and 
SIMD computing system. Since the attributes are processed 
one after the other, the MAT based parallel algorithm has a 
lower bound complexity of Q(k) on the processor array con- 
tainig no more than N processor elements. 
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Abstract. A fundamental measure of processing power 
in a database management system is the performance of the 
sort utility it provides. When sorting a large data file on a 
serial computer, performance is limited by factors involving 
processor speed, memory capacity, and I/O bandwidth. In this 
paper, we investigate the feasibility and efficiency of a parallel 
sort-merge algorithm through implementation on the JASMIN 
prototype, a backend multiprocessor built around a fast packet 
bus. We describe the design and implementation of a parallel 
sort utility. We then present and analyze the results of meas- 
urements corresponding to a range of file sizes and processor 
configurations. Our results show that using current, off-the 
shelf technology coupled with a streamlined distributed operat- 
ing system, three and five microprocessor configurations pro- 
vide a very cost-effective sort of large files. The three proces- 
sor configuration sorts a 100 megabyte file in one hour, which 
compares well with commercial sort packages available on 
high-performance mainframes. 


1. Introduction 


Sorting of large data files is one of the most important 
and frequently performed operations in database management. 
Output files generated by a report are usually sorted with 
respect to an attribute or a combination of attributes that are 
of interest to the users of the report. In mass-storage devices, 
files are often maintained sorted with respect to a key attri- 
bute in order to facilitate subsequent searching and process- 
ing. Database management systems also pre-sort files in order 
to eliminate duplicate records [5, 14] or to process complex join 
and aggregate queries [14]. Thus, one of the fundamental 
measures of processing power in a transaction management 
system is the performance of the sort utility it provides [1]. 


In most systems, sort performance is not satisfactory. 
For large files, a sort may require hours of processing time 
during which it saturates the CPU and I/O devices. As a 
consequence, even in high-performance transaction manage- 
ment systems, a large sort will often be deferred to after-hours 
processing in order not to interfere with the execution of short 
interactive transactions. 


What limits the performance of a file sorting operation, 
and how can this performance be enhanced? A theoretical 
machine with a 100 megabyte memory and a 50 nanosecond 
Compare instruction could sort one million 100 byte records in 
1.5 minutes, including 30 seconds to read the file from a disk 
with a 3 million-byte per second transfer rate and 30 seconds 
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to write it back to the same disk® [1]. In practice, file sizes 
exceed main memory capacity by several orders of magnitude, 
comparisons and record moves are slower, and the effective I/O 
bandwidth is well below the maximum bandwidth of the disk 
channel. In addition, general purpose operating systems 
impose processor overhead and limit the utilization of other 
resources. The designer of a sort utility must strive to minim- 
ize record moves in memory, overlap I/O latency with compu- 
tation, and reduce disk seek times. Even then, the efficiency 
of file sorting on a serial computer with conventional 
mass-storage remains severely limited. Parallel computation 
combined with high I/O transfer rates is a natural approach to 
overcoming these limits. Indeed, a number of recent database 
machine designs use parallel sorting as a fundamental build- 
ing block for query processing [8, 9, 15, 16]. 


In this paper, we investigate the feasibility of parallel file 
sorting on a backend multiprocessor machine. We describe the 
design of a sort utility that exploits parallel computation and 
high I/O bandwidth, and its implementation on the JASMIN 
multiprocessor. We consider two schemes which differ in the 
starting location of the unsorted data (Figure 1), and may lead 
to significant differences in performance. The first scheme 
corresponds to a Backend Sort, where a file initially located on 
the host is downloaded to a backend multiprocessor for sorting. 
The file is distributed to the backend processors, which sort it 
and then return the sorted file to the host. The second 
scheme, a Distributed Sort, assumes that the source file is ini- 
tially distributed across a number of disks attached to backend 
processors. The processors sort the file in parallel, then send it 
to the host or write it to a backend disk. Basically, the paral- 
lel sorting algorithm used in both schemes is the same, but the 
design goal of overlaping computation with the distribution of 
the data in one case (Backend Sort), is replaced with overlap- 
ing computation with disk access in the other (Distributed 
Sort). 


The remainder of this paper is organized as follows. In 
Section 2, we describe the hardware and software architecture 
of the JASMIN system. In Section 3, we describe the parallel 
external sort algorithm, and the implementation choices that 
we have made — in particular the process structure and the 
buffering scheme. In Section 4, we present the performance of 
our algorithm as implemented on JASMIN. We analyze the 
execution time, processor utilization, and network traffic as 
functions of the file size and the multiprocessor configuration. 
Finally, Section 5 contains our conclusions. 


(8) With an efficient algorithm, sorting n records in main memory re- 
quires O(nlogn) record comparisons and exchanges. For n = 1,000,000, a typ- 
ical proportionality of 1.5 and 20 instructions for a comparison and an ex- 
change, this amounts to 600M instructions. 


DISTRIBUTED 


BACKEND 


-°"*& sort phase 


merge phase 


Figure 1: Backend and Distributed Sorting 


2. A Backend Multiprocessor 


Our parallel external sort algorithm was implemented on 
a prototype AT&T Bell Laboratories multiprocessor [2]. This 
multiprocessor may be configured with up to 12 microcomput- 
ers communicating through a fast packet bus. It also has an 
umbilical connection to a DEC VAX host running UNIX® Sys- 
tem V. The backend processors run the JASMIN distributed 
operating system kernel, which is under development at Bell 
Communications Research. JASMIN is designed to efficiently 
support distributed applications. In this section, we briefly 
describe the hardware characteristics of the multiprocessor 
and process management and communication in JASMIN. 
The reader is referred to [12] for a complete description of the 
operating system and to [7] for a description of a distributed 
database management system which runs on JASMIN. 


2.1. The S/Net 


The multiprocessor used in JASMIN is built around the 
S/Net, a fast (80 Mbit/s) packet bus developed at AT&T Bell 
Laboratories. The multiprocessor consists of a set of 
Multibus-based microcomputers, each with an S/Net Processor 
Interface Board. The S/Net itself consists of a set of Buffer 
Interface Boards, tied together on a single backplane, each 
connected to a Processor Interface Board. A designated Buffer 
Interface Board is connected to the VAX host. The VAX 
downloads the microcomputers via the S/Net and provides 
some support, such as file access and debug output, for the exe- 
cution environment. The microcomputers consist of a 10 MHz 
Motorola 68000 CPU and 1 Megabyte of main memory. Some 
of the microcomputers also have an SMD disk controller and a 
Fujitsu Eagle local disk. 


While the total bandwidth of the S/Net is 80 Mbit/s, no 
single processor can achieve that speed. The maximum kernel 
level data transfer rate that a single processor can attain on 
the S/Net is approximately 300 Kbyte/sec, or 3% of the 
network’s total capacity. Operating system overhead further 


reduces the effective data transfer rate between processes to 
100 Kbyte/sec. 


‘») UNIX is a trademark of AT&T Bell Laboratories. 
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2.2. The JASMIN Operating System 


The JASMIN distributed operating system is modeled 
after DEMOS [3]. The basic object implemented by the system 
is an interprocess communication capability called a path, over 
which small, fixed length messages can be sent. Holding a 
path enables a process to send a message to the creator of the 
path. Additionally, bulk data transfers between processes 
may be accomplished by attaching a buffer to a path in the 
creators address space and using kernel primitives to move 
data into or out of the buffer. The system allows any process 
to create paths and to pass them to other processes via mes- 
sages. 


A set of processes which share text and data address 
space, but have separate execution stacks, are said to form a 
load module. Since memory is shared by these processes, a 
load module cannot span processors. While processes in the 
same load module share text and data space, their execution 
stacks inhabit different memory segments, they hold different 
sets of paths, and they receive messages from separate queues. 
In other words, paths and messages are exchanged between 
processes, not load modules. The load module notion is a con- 
venience to permit cooperating processes to share memory. 


JASMIN process scheduling is particularly simple. 
When created, each process is assigned a static priority level. 
Within each oppriority level, scheduling is __ strictly 
non-preemptive. A running process will execute until it per- 
forms a blocking system call, or until a higher priority process 
becomes ready-to-run. This guarantees mutual exclusion 
among a set of processes operating on shared data structures 
within a load module, as long as these processes are run at the 
same priority level. 


3. A Parallel File Sorting Algorithm 


Large files cannot usually be sorted in main memory, and 
so file sorting requires an external sort algorithm. Most exter- 
nal sort algorithms are based on iterative merging. They par- 
tition the file into subfiles that separately fit in memory, inter- 
nally sort the individual subfiles, and then iteratively merge 
them into a single sorted file. During each iteration, every 
record is read and written once to mass-storage. These serial 
algorithms can be parallelized in a number of ways [4, 6, 10]. 


3.1. The Algorithm 


We have implemented a parallel N-way merge-sort algo- 
rithm adapted from [4]. The algorithm assumes a logical tree 
connection between the processors (Figure 2). Each of the leaf 
processors has a disk attached to it, and thus can perform an 
external sort, independently of the host and the other proces- 
sors in the backend. The parallel algorithm has two phases. 
We will refer to the first phase as the sort phase, and to the 
second as the merge phase. During the sort phase, each leaf 
processor creates fixed-length sorted subfiles from its partition 
of the file; during the merge phase, these subfiles are merged 
in parallel. Parallelism in merging is exploited both through 
pipelining merge steps between levels of the tree, and through 
concurrent merging performed by processors on the same level 
of the tree. The two phases of the algorithm are described in 
more detail below. 


During the sort phase, fixed-length sorted subfiles are 
created by the leaves as in a serial sort. Partitions of the file 
are sorted in memory using Quicksort, and then written to 
disk. Available memory is divided between a subfile and an 
array of pointers to records in that subfile. During Quicksort, 
data records are not moved in memory, only pointers are 


Figure 2: A Parallel Merge-Sort 


changed. 


After the sort phase completes, each leaf merges its 
subfiles, producing a single sorted partition of the file. As it is 
produced, the partition is transferred to the leafs parent in the 
merge tree. The parent processor merges the partitions it 
receives from its children and sends the result one level up in 
the tree. The complete sorted file is produced at the root of the 
merge tree. In our implementation, merge streams are 
transferred in blocks containing many records. While pipelin- 
ing of successive merge phases could be at the record granu- 
larity, communication overhead makes such a design prohibi- 
tively inefficient. 


It is interesting to note that, given a fixed topology for 
the merge tree, most of the steps in our sort algorithm are 
linear in terms of the number of comparisons performed. The 
partitioning of n records into n/b fixed-size sorted blocks of 6 
records each takes O(nlog(b)) comparisons. Similarly, merg- 
ing d sorted streams of n/d records takes O(nlog(d)) com- 
parisons. Thus, at non-leaf nodes, where the degree of the 
merge is fixed by the topology of the tree, the merge is linear. 
At the leaves, the degree of the merge is equal to the number 
of temporary files, which is in turn proportional to the size of 


input. The merge phase at the leaves is thus the only point — 


where the number of comparisons is non-linear. Because com- 
parisons are a small portion of the total work done in sorting a 
large file, even the uniprocessor configuration of our sort only 
shows significant non-linear behavior in file sizes over 25 
megabytes (see Section 4.2). 
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3.2. Implementation on JASMIN 


In order to concentrate on designing the parallel part of 
the sort utility, and have a good baseline for evaluating our 
implementation, we chose to start with a robust serial sort 
program. We considered the Linderman Sort, a program 
recently developed at Bell Laboratories [13]. The program has 
been extensively tuned, and achieves very good performance 
on UNIX. The algorithm implemented in Linderman’s pro- 
gram is a polymerge sort [11], where initial runs are gen- 
erated with Quicksort. Each sorted run is stored as a tem- 
porary file. Then the files are merged in one or more N—way 
merge passes. We started with adapting this code for sorting 
the subfiles in the sort phase of our parallel algorithm, and for 
merging subfiles on one processor. Then, we turned our atten- 
tion to design issues that are critical to the performance of the 
parallel sort program: 


(1) the process configuration and synchronization protocol 
(2) the memory configuration (including buffer allocation), 
(3) the layout of files on the local disks. 


3.2.1. Process Configuration On each backend pro- 
cessor, processes were configured so that delays due to data 
transfer and processor synchronization would be minimized. 
In particular, data transfers (processor-to-disk, or processor-to- 
processor) had to be overlapped with computation whenever 
possible. To avoid copying data within a processor, we adopted 
a structure consisting of a single JASMIN load module per 
processor. Multiple processes within each load module share 
access to a single buffer pool, but communicate and synchron- 
ize using messages. 


Data transfers are organized as streams of record blocks. 
Each stream consists of a series of fixed size blocks containing 
a number of records. There are three types of streams, each 
with its own block size. Input streams carry unsorted record 
blocks from the root to the leaves. Disk streams carry sorted 
temporary files between a leaf and its disk. Merge streams 
carry sorted record blocks between processors in the merge 
tree. 


During the sort phase, the records to be sorted are sent 
as input streams from the root to the leaf processors. In order 
to overlap communication latency, leaf processors interact with 
multiple communications processes in the root. In the leaf 
processors, an input spooler process reads the input stream 
from the the root, and builds sort buffers, which consist of data 
records and an array of pointers to these records (Figure 3). 


‘input! 
1 stream 


Figure 3: Sort Phase Leaf Process Structure 


leaf processor 


Once a sort buffer is built, it is passed to the main 
sort/merge process. This process performs a Quicksort on the 
pointer array, and passes the buffer to the jfile spooler process. 
The file spooler moves records for the first time from the input 
blocks, in which they were received, to disk blocks which are 
then written to disk. Each sorted buffer forms one femporary 
file on the disk. The use of two sort buffers per processor 
ensures a high degree of overlap between input, sorting and 
writing to disk. 


During the merge phase, the sort buffers are unused and 
so are returned to free memory for use as disk or merge block 
buffers. The sort/merge process is central to the merge phase 
on both leaf and non-leaf processors (Figure 4). In the leaves, 
it merges sorted disk streams from the temporary files, one 
stream per file. In a non-leaf node, merge streams from the 
node’s children in the tree are merged. In the former case, 
disk streams are read through the file spooler, which performs 
double buffering (disk read-ahead). In the latter case, a spe- 
cial network spooler process synchronizes data transfer from 
child processes and implements double buffering of merge 
streams. On all non-root processors, the output spooler process 
implements double buffering of merge streams being sent to 
the parent in the merge tree. 


3.2.2. Memory Configuration Having only one mega- 
byte of main memory attached to each processor restricts the 


maximum size of data buffers. After loading the operating 
system, utility tasks, the text segment for the sorting program, 
a stack segment for each sorting process, and miscellaneous 


\ merge 


streams 
' from 

| Children 
* 


non-leaf 


processor 
Figure 4: Merge Phase Process Structure 
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sort program variables, about 576K remains for buffers. This 
led us to a configuration of two 256K sort buffers, resulting in 
temporary files somewhat smaller than that (the buffer must 
hold not only the records, but also an array of pointers to 
them). In contrast, a mainframe sort routine will typically use 
a 1 megabyte sort buffer. 


During the merge phase, each leaf processor must read a 
potentially large number of disk streams, and merge them to 
produce a single merge stream. Each stream is double 
buffered so the total memory used when merging n files will 
be 


M = 2nBase + 2Bmerge 


where the B, are block sizes. Now, the memory used as sort 
buffers during the sort phase can be reused as disk and merge 
blocks during the merge phase. Thus, all 576K are available 
for allocation. If we are to sort a 100M file using two leaf 
nodes, then each leaf must handle 50M, or 200 temporary files. 
Rearranging our equation, we see that 


Baise = (M — 2Bmerge)/2n < (576K /400) = 1.4K 


So, choosing Byj, = 1K, we get Bmerge = 88K. 


The constraint on the size of a merge block does not arise 
at the leaves, but at non-leaf nodes. If a non-leaf node merges 
m merge streams to form a single sorted merge stream, then it 
requires 2m +2 merge blocks for buffering. Since non-leaf 
nodes use no memory for disk blocks, if we want to handle 
values of m up to 8, we get 


Berge = M/(2m + 2) S 576K /18 * 32K 


Thus, the demands of the problem we posed — sorting a 
100Mbyte file in configurations ranging from two to eight leaf 
processors using varying tree structures — leads to a rather 
narrow choice of buffer sizes. We used the closest powers of 
two: 


(B aise , B merge) = (1K, 32K) 


3.2.3 Disk File Layout The layout of files on disk is 
central to minimizing access time. We experimented with two 
approaches: contiguous files and interleaved files. In contigu- 
ous file layout, the logical disk blocks of a file lie in consecu- 
tive physical disk addresses. This means they will be physi- 
cally contiguous unless they cross a cylinder boundary. In 
interleaved file layout, the first logical block of all files lie in 
one consecutive set of disk addresses, followed by the second 
block of all files, etc. (Figure 5). 


Contiguous file layout minimizes seeking when reading 
or writing an entire sequential file. It maximizes seeking 
when multiple sequential files are to be accessed concurrently. 
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Figure 5: Contiguous vs Interleaved File Layout 


Interleaved layout achieves the opposite: maximal seeking 
when accessing a single file sequentially; minimal seeking 
when accessing many files concurrently. Intermediate 
schemes which interleave files in segments larger than one 
logical block are possible, but we did not implement any. In 
our sort algorithm, temporary files are written one at a time 
during the sort phase, and then all are read concurrently dur- 
ing the merge phase. Thus, contiguous layout performs best 
during the sort phase, while interleaved layout wins in the 
merge phase. 


3.3 Choosing Parameter Values 


We experimented with various ways to partition the 
576K of memory that were available for buffer configuration 
in each processor. The parameter to which the sort time was 
most sensitive was merge block size. Larger merge blocks 
decreased the overhead during the merge phase, and reduced 
the number of interprocessor synchronization points. With 
double buffering, the maximum size of a merge block that we 
could choose was 32K (see Section 3.2). To a lesser extent, 
increasing the input block size also increased sort speed and 
we settled on an input block size of 8K. Larger input blocks 
caused wasteful fragmentation of sort buffers. 


The large amount of idle time in the leaves during the 
merge phase suggests that double buffering of disk blocks may 
not be advantageous. It may be better to consolidate the two 
buffers into one large buffer to reduce the number of disk 
accesses. Additional enhancement could also come from 
improving the scheduling of network data transfers. In the 
current implementation, the network spooler in a parent node 
transfers merge blocks from the node’s children in the order in 
which they are requested. However, at the same time, double 
buffering may cause network traffic that blocks the transfer of 
data which is needed for merging to proceed. Children are ser- 
viced in approximately round-robin order, giving no priority to 
a transfer which is holding up the merge. | 


We compared the performance of our sort with the two 
alternative file layouts: contiguous and interleaved. Using the 
interleaved scheme provided an overall improvement of 5% in 
elapsed time. 


4. Measurements 


In this section, we present and analyze our measure- 
ments for a range of backend configurations and file sizes. In 
order to establish a performance baseline, we start with our 
uniprocessor sort (JS1) on JASMIN, and compare it to the 
Linderman Sort on a fast UNIX machine (Section 4.2). We 
then analyze in detail the performance of the parallel Backend 
Sort, for different multiprocessor configurations (Section 4.3). 
Finally, we estimate elapsed times for the Distributed Sort 
(Section 4.4). 


4.1. Parameters Varied 
In our experiments, we varied the following parameters: 
(1) Number of processors in the backend: 


We configured the backend as a single processor, as a two-level 
tree with 3 and 5 processors (2 and 4 leaves), and as a 
three-level binary tree with 7 processors (4 leaves, 2 internal 
nodes). In subsequent tables, these configurations are labeled 
as JS1, JS3, JS5, and JS7, respectively. In all the tree 
configurations, each leaf processor had a local disk. 


(2) Structure and size of the data file: 
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We varied the size of our data file from 1 to 100 megabytes. 
The data was randomly generated as a stream of 100 byte 
records, with a 10-byte ascii-numeric sort key. The record for- 
mat included a header consisting of a 2-byte record-length field 
and a 2-byte key-length field, which were interpreted by the 
sort program. Thus, our implementation can also sort variable 
length records. In addition to the data records, we used two 
types of control records, end of block and end of file, in order to 
form streams of records. These control records are four bytes 
long. 


We also varied the memory configuration and the file lay- 
out scheme. However, all the measurements presented below 
correspond to experiments with double buffering and inter- 
leaved file layout, since these appeared to be more efficient. 


4.2. Uniprocessor Efficiency 


In order to establish a baseline performance measure for 
our implementation, we compared a_ single-processor 
configuration to the UNIX sort [13] on a faster machine. Fig- 
ure 6 shows the times recorded for sorting files of 1 to 50 
megabytes on a JASMIN 68000-based microcomputer and on a 
CCI Power-6 computer running UNIX 4.2 BSD. In both cases 
the disk was a Fujitsu Eagle. The JASMIN figures constitute 
a baseline that we will use to evaluate the parallel speedup 
achievable in multiprocessor configurations. We will use the 
UNIX-CCI numbers as a baseline for comparing the perfor- 
mance of a fast, general-purpose serial machine with that of a 
backend multiprocessor. 


The 68000-based microcomputer is rated at 0.6 MIPS, 
while the CCI processor is rated at 5 to 6 MIPS. Considering 
the difference in processor capabilities, the comparison greatly 
favors the JASMIN sort. The UNIX-CCI sort is 1.6 times fas- 
ter for small files, but only 1.2 times faster for large files. 
From Figure 6, we observe that, up to 35M, the rate of the 
JASMIN sort is almost constant at 18 Kbyte/sec. On the other 
hand, the rate of the CCI-UNIX sort degrades rapidly for file 
sizes above 20M. This degradation is at least partly due to the 
limit of 20 on the number of open files in a UNIX process, 
which necessitates multiple N-way merge passes in the merge 
phase. The comparison illustrates the limits imposed by a 
general-purpose operating system (UNIX) compared to JAS- 
MIN. The JASMIN kernel essentially gives an application the 
full power of the underlying computer. In this application, we 
observed a processor utilization rate of 90%. 
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Figure 6: Serial Sort Times in Seconds on JASMIN and UNIX 


4.3. Backend Sort 


In Table 1, we show the total execution time of the Back- 
end Sort for 4 processor configurations, and for file sizes of 5, 
12.5, 25, 50 and 100 megabytes. The times shown correspond 
to the root’s elapsed time, which was only slightly higher than 
the leaves’ or internal nodes’. Figure 7 compares the sorting 
speed achieved in these experiments to the UNIX-CCI sort. 


The most striking observation from this table is how fast 
the 100M sort is, for both JS3 and JS5. The backend mulfipro- 
cessor sorts 100M in 1 hour with 3 processors, and in 52 
minutes with 5 processors. This performance is comparable to 
that of highly-tuned commercial sort packages such as the 
IBM SYNC-SORT on a high-performance mainframe (e.g. an 
IBM 3081, rated at approximately 7 MIPS). It is thus clear 
that our parallel Backend Sort provides a very cost-effective 
alternative to main-frame, serial sorting of large files. 


Further analysis of the data in Table 1 leads us to the 
following observations on parallel speedup: 


(1) The 3 processor configuration is faster than the 1 proces- 
sor configuration by a factor of 1.5 for small files to 1.6 
for a 50M file. 

Using two additional leaves (ie. the 5 processor 
configuration) further improved the sort time by up to 
16%. 

Using a two-level merge tree with 7 processors does not 
improve performance. In fact JS7 shows slightly higher 
overall times than JS5. 


[File Size | JS1_| JS3_| JS5_| JS7 | 


| BM | 269 | 178 | 178 | 173 | 
| 125M | 682 | 439 | 419 | 427 | 
| __-25M | 1384 | 884 | 836 | 869 | 
| 50M | 2930 | 1782 | 1584 | 1682 | 


| oom | —© | 3625 | 3131 | 3305 | 


Table 1: Total Sort Times In Seconds 
1, 3, 5 and 7 Processor Configurations 
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Figure 7: Parallel and Uniprocessor Sort Speeds 


‘) This number is unavailable due to processor memory limitations. 
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The added processing power and [I/O bandwidth in JS3, 
with 3 processors and 2 disks, proved very effective in reduc- 
ing up the elapsed time of the sort. However, as more proces- 
sors and disks were added, the additional performance 
enhancement was limited. These observations suggest that 
the parallel sort is limited by network data transfer rather 
than by computation or disk activity. We tested this 
hypothesis by instrumenting the sort program to record idle 
and elapsed times for both the sort and merge phases. Table 2 
shows these measurements for a 25 megabyte sort with 3 pro- 
cessors, and a 50 megabyte sort with 5 processors. 


The times in Table 2 are for the root, and one of the 
leaves. (Since all leaves had similar elapsed and idle times, 
we only present our measurements for one.) Discrepancies 
between the elapsed times shown in Table 1 and Table 2 for 
the same configurations are due to distortions introduced by 
measuring idle time. The additional data in Table 2 supports 
our hypothesis in the following ways: 


(1) The sort phase of the 50M sort lasts almost twice as long 
as the sort phase of the 25M sort, although the leaves do 
exactly the same work in both. The additional time 
shows up as idle time at the leaves, while they wait for 
data from the root. In both configurations, the root 
shows little idle time during the sort phase. I/O delays 
(shown as Disk Busy time) are almost null during the 
sort phase. 


(2) The merge phase of the 50M sort lasts almost exactly 
twice as long as that of the 25M sort, and substantial idle 
time accumulates at both the leaves and the root. Unlike 
the sort. phase, where the root processor is saturated, 
both root and leaves appear to be waiting on the net- 
work. Disk waiting during the merge phase accounted 
for less than 10% of the idle time, in both configurations. 
Furthermore, it did not decrease with the use of two 
additional disks (JS5 versus JS3). 


The non-idle time at the leaves during the sort and 
merge phases, is approximately the same on both 
configurations: sort phase 320 seconds, merge phase 340 
seconds. The non-idle time at the root varied linearly 
with the amount of data in both phases. The sort phase 
might be speeded up by starting with a distributed file 
(see Section 4.4), or increasing the power of the root pro- 
cessor. However, it is clear that a higher network 
transfer rate is necessary in order to speed up the merge 


(3) 


phase. 
leaf _ root leaf _—root 
Sort Phase elapsed time 638 639 | 1004 1005 
disk busy 2 0 0 0 
idle time 320 162 680 26 
Merge Phase _ elapsed time 937 938 | 1871 1873 
disk busy 53 - 55 _ 


idle time 598 629 | 1534 1274 


Total 
(Both Phases) 


elapsed time | 1575 
disk busy 


idle time 


1577 | 2875 
55 


2217 


Table 2: Sort, Merge and Idle Time In Seconds 
3 and 5 Processor Backend Sort | 


2878 


1300 


We conclude that for the Backend Sort, the limitation of 
a low network transfer rate means that the use of more than 4 
processors as leaves can speed the sort phase only marginally, 
and the extra processor power is not used efficiently. Simi- 
larly, use of additional processors as interior merge nodes in 
the 7 processor configuration only reduces non-idle time at the 
root, which does not improve the overall merge speed. A net- 
work which delivered data at the faster rate would be able to 
increase the utilization of the leaves and root during the 
- Inerge phase, and so could effectively utilize more processors. 


44. Distributed Sort 


The above discussion of Table 2 indicates that the distri- 
bution of the file from the root to the leaf processors created a 
bottleneck. In a Distributed Sort, where the file is initially 
distributed across the local disks at the leaves, the sort phase 
would be much faster. In fact, since, during this phase, the 
computation is fully partitioned and every leaf processor has 
its own disk, the sort phase time at the leaves is simply the 
uniprocessor sort phase time for one partition. Thus, a good 
estimate of the sort phase time in the Distributed case is the 
sort phase time of one partition in the JS1 configuration. 
Merge phase time in the Distributed Sort is identical to that 
in the Backend case. By combining these estimates of the sort 
and merge phase times, we obtained the total times for the 
Distributed Sort. The results obtained are shown in Table 3. 


| sd] Sort Merge Total 


JS3 10M | 147 158 305 
50M | 738 828 1566 
JS5 10M 74 157 231 


50M | 370 696 1066 


Table 3: Distributed Sort Estimate 
Sort, Merge and Total Times in Seconds 


While in the Backend Sort, the bottleneck at the root 
made the use of more than 3 processors inefficient, we now 
observe a substantial speedup with 5 processors. For a file of 
50 megabytes, the elapsed time is 1566 seconds for a distri- 
buted 3 processor sort and 1066 seconds for a 5 processor sort. 
This is a speedup of 1.5 compared to 1.1 for the Backend Sort. 


5. Conclusions 


We have described our experience with developing and 
evaluating a parallel sort utility. Our goal was to explore the 
feasibility of fast file sorting algorithms under limited parallel- 
ism, limited inter-processor communication bandwidth and the 
constraints of conventional I/O devices. Our testbed was a 
multiprocessor that can be configured with up to 12 micropro- 
cessors, communicating through a fast packet-switched bus 
and running a distributed operating system kernel. Multiple 
disk drives, each attached to one microprocessor, provided high 
I/O bandwidth. 


We implemented a parallel external merge-sort algo- 
rithm, and measured its performance for a range of file sizes 
and processor configurations. After detailed analysis of our 
measurements, and scaling of our results, we reach a number 
of conclusions: 


(1) Using current off-the-shelf technology, 3 and 5 processor 
configurations provide a very cost-effective solution to 
sorting a large file. The 3 processor configuration sorts a 


100 megabyte file in 3625 seconds (1 hour), which 
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compares well with commercial sort packages available 
on high-performance mainframes. The 5 processor sort is 
16% faster for large files. This level of performance, at a 
low hardware cost, results from a design which makes 
effective use of a streamlined, distributed operating sys- 
tem to exploit a limited amount of parallelism in compu- 
tation and [/O. 


Latency in network data transfer prevents the effective 
use of higher levels of parallelism in the backend sort 
model, where the data file is not distributed. Interest- 
ingly, the 10 Mbyte/sec physical bandwidth of our net- 
work was not the limiting factor: Only 300 Kbyte/sec of 
that is available to any one processor, and operating sys- 
tem overhead reduces that to approximately 100 
Kbyte/sec for the actual process-to-process data transfer 
rate. At this rate, network latency allows even a 
microprocessor (rated at 0.6 MIPS) to sort and merge 
data as quickly as it can be moved between machines. 
Adding processors to the backend configuration simply 
increased idle time. 


(2) 


(3) We observed that distributed storage of files provides a 
substantial benefit in sorting. For instance, with the 5- 
processor configuration, a 50-megabyte distributed file 
could be sorted in 17 minutes, compared to 26 minutes 
needed for a non-distributed file. Our distributed sort 
model makes efficient use of additional leaf processors to 
accelerate the sort phase, which the backend model is 
unable to do. 


This study demonstrates that, with current microprocessor and 
communication technology, parallel sorting of large files is a 
viable, cost-effective alternative to highly tuned sort utilities 
on mainframes. Our analysis indicates that its efficiency can 
be expected to increase dramatically due to improvements in 
local area network technology in the next few years. 
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Abstract: We present an efficient time stamp based 
distributed scheme for synchronization of communi- 
cating sequential processes. In the communication 
model assumed, processes name ports in their com- 
munication commands and both the input and output 
commands can appear in the guards of an alternative 
command, a general framework than that allowed by 
Hoare’s CSP. We further assume synchronized com- 
munication between processes. An implementation of 
the scheme is described. The scheme can become part 
of the kernel for a programming language for distri- 
buted computation. We provide a discussion on fault 
tolerance of the algorithm against node failures. We 
prove the correctness of the scheme by establishing 
that the scheme is deadlock free and further that if 
two processes are willing to communicate and don’t 
synchronize with any other, they do synchronize with 
each other. A specification in interval logic of the 
scheme and the verification of it, including strong fair- 
ness, is given elsewhere. 


1. Introduction 


A distributed system is an interconnection of a 
network of computing elements with each of the ele- 
ments having its own local store. The local storage of 
one computing element is not shared by any other. A 
distributed program consists of a system of processes 
with each of the processes executing on some comput- 
ing element of the distributed system. The processes 
of a distributed program cooperate so as to achieve a 
common purpose. Because of the absence of shared 
storage, the processes achieve cooperation by the 
exchange of messages. Thus, interprocess communica- 
tion using message passing constitutes a basic para- 
digm for the computations evoked by a distributed 
program. 

A model of distributed computation known as 
Communicating Sequential Processes(CSP) has been 
proposed by Hoare[8]. A programming notation for 
the expression of distributed algorithms, constituting 
a distributed program, has also been suggested in the 
same work. The model consists of a set of sequential 
processes, without nesting, with each of the processes 
confining access to its own data and effecting coopera- 
tion between one another by the use of input and out- 
put commands for message passing. Processes are 
explicitly named in the input and output command. 
Communication between the processes occurs when 
the output command of a process matches the input 
command of another process. Such a communication 
model is termed as synchronous or unbuffered. An 
alternative command based on Dijkstra’s guarded 
command construct [6] permits an arbitrary selection 
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of a command, for execution, from out of the com- 
mands whose guards have been successful. A guard in 
the command can be a boolean expression or a 
boolean expression followed by an input command or 
an input command itself. A boolean guard is said to 
be successful if it evaluates to true. An input com- 
mand is successful if a matching output command 
from the process named in the input command can be 
found. A guarded input command is said to be suc- 
cessful if the boolean guard preceding the input com- 
mand evaluates to true and the input command is 
successful. In an implementation scheme, an arbitrary 
selection of a command from out of those guards that 
are successful at the earliest may be made. This stra- 
tegy assures that a process may synchronize, while in 
an alternative command, with any of the processes 
named by it and are ready to synchronize. 

However, the introduction of output commands 
in the guards of an alternative command can.simplify 
the specification of distributed algorithms using the 
CSP notation[7, 8]. Buckley and Silberschatz|4] sug- 
gest a distributed implementation scheme based on 
assignment of static priorities between processes. They 
offer four distinct criteria to be met by any implemen- 
tation scheme. They are as follows: 


1) The number of system processes involved in estab- 
lishing a handshake must be minimal so as to reduce 
processor interrupts. 

2) The amount of system information that each of the 
processes needs in order to make a decision about syn- 
chronization must be minimal. 

3) There must be a bound on the number of messages 
exchanged so as to guarantee that two processes wil- 
ling to communicate do so within a certain finite time. 
4) The number of control signals exchanged between 
processes must be minimal. 


The implementations suggested in [3, 7, 14, 15] violate 
one or more of these four criteria. 

Time stamps have been used for the termina- 
tion detection of CSP programs(2]. The algorithm 
given in this paper is based on the use of time stamps 
and is superior to that given in [12] both in its simpli- 
city and its performance. 

The paper is organized as follows. In section 2, 
a communication model is presented. In section 3, the 
time stamp based distributed implementation scheme 
(ANa scheme) is presented. In section 4, an imple- 
mentation of the algorithm in a programming nota- 
tion is given. In section 5, various properties of live- 
ness, deadlock freedom, weak fairness and the perfor- 
mance parameters of the scheme are established. In 
section 6, a discussion is provided on a) the degree of 


concurrency, b) the comparative performance of the 
algorithm and c) the resilence of the algorithm against 
node failures. Section 7 concludes the. paper with clos- 
ing remarks. 


2. The Communication Model 


We assume a port directed communication 
model. In this model, each process is associated with a 
finite collection of input ports and output ports 
through which the messages are sent and received. A 
process can use an output port only to send messages 
and an input port only to receive messages. The input 
port of one process may be connected to the output 
port of another process. Such interconnection of ports 
using channels may be specified separately. We 
require that each port of a process must be connected 
to one and only one other port and the communica- 
tion between processes is synchronous. Each process, 
while in an alternative command, names a collection 
of ports on which it is willing to communicate. The 
port based communication paradigm is in contrast to 
explicit naming of processes in CSP. 

To facilitate communication between processes, 
each process is associated with a port manager. The 
process, upon entering an alternative command with 
communication commands in the guards, transmits to 
its associated port manager the set of ports (called the 
candidate communicant set) on which it is willing to 
communicate. The details of transmission of argu- 
ments are ignored for reasons of brevity and to con- 
centrate on the implementation scheme establishing 
synchronization. 


3. The Algorithm (ANa Scheme) 


Whenever a process enters an alternative com- 
mand, the process requests its port manager to pro- 
vide the alternative command service by transmitting 
to the port manager the set of ports named in the 
alternative command of the process, i.e., the candi- 
date communicant set. If an input command (output 
command) appears as a statement in the process, then 
the process utilizes the alternative command service 
made available by its port manager to achieve a 
matching output command (input command) on the 
designated port. Once a process triggers the alterna- 
tive command service of its port manager, the process 
waits until the port manager establishes the 
handshake. Each of the port managers runs the same 
distributed synchronization algorithm with which it 
obtains a synchronization partner. Since we assumed 
a fully distributed system without shared memory, the 
port managers have to find a synchronization partner 
by exchanging control signals with other port 
managers of the ports linked to the candidate com- 
municant set. We now describe, informally, the dis- 
tributed scheme given in figure 1. 

A port manager uses a variable Turn for each 
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of the ports to mark the port to indicate whether it is 
its own turn to initiate communication on that port or 
if it is the turn of the port manager of the port linked 
to that port to initiate communication. Initially, 
when the distributed program is set up, the port 
manager of each of the processes marks for each port 
the variable Turn as MyTurn, meaning that it is 
MyTurn to contact a partner on that port. A port 
manager uses a variable Committed for each of the 
ports to mark the port to indicate that a Commit sig- 
nal has been received on that port and that a reply 
has not been sent. A port manager acquires a time 
stamp for each of the alternative commands it services 
(the generated time stamps are monotonically increas- 
ing). 

Assume now that some of the processes in the 
distributed program have requested the initiation of 
the alternative command service by their respective 
port managers. Consider a process P executing an 
alternative command. The port manager of P picks 
arbitrarily one of the candidate communicant ports, p, 
whose state is MyTurn and sends the control signal 
Commit(p,Tid), where Tid is the time stamp acquired 
for this alternative command service by the port 
manager. The port manager then waits for a 
response. During this period of waiting for a response, 
two things can happen. A response may arrive on the 
port p and has to be dealt with. In addition, during 
the waiting period, some other port manager(s) may 
send Commit signal(s) in the hope of finding a syn- 
chronizing partner in P. 

If the response to the Commit signal sent has 
been a Commit signal, then the port manager regards 
that as a willingness of another port manager to syn- 
chronize and treats that a synchronization partner has 
been found. It synchronizes and then sets 
Committed(p) to false and Turn(p) to MyTurn. 

If the response to the Commit signal sent has 
been a No signal, then the port manager regards that 
the port manager of the correspondent of the port p is 
at the moment unwilling to synchronize and marks 
that fact in the variable Turn(p) as ItsTurn, denoting 
that it is the turn of the correspondent’s port manager 
to seek synchronization on that port. 

Suppose that during the period of waiting for a 
response to a Commit signal sent on the port p, Com- 
mit signals arrive on other ports. Then what should 
the port manager do? 

If the port on which a Commit signal has been 
received is not in the communicant set, then, natur- 
ally, the port manager of p cannot synchronize on 
that port, because it is not a suitable partner. And 
hence the port manager sends the control signal No as 
a response and sets Turn(p) to MyTurn. 

If the port on which a Commit signal has been 
received is in the communicant set, then the port 
manager knows that, if he were not waiting for a 
response, he could have synchronized with that port 
manager. It may withhold sending a response. 


However, the port manager cannot hold back a 
response to all of the Commit signals it has received 
on its candidate communicant set during its waiting 
period until it itself has received a response to its 
Commit signal. This is because there is a potential for 
the creation of a cycle of port managers each of which 
is waiting for a response, but each of which delays 
sending a response to every other Commit signal 
received on the candidate communicant set in that 
period. Thus there will be deadlock. We seek to 
break this deadlock by requiring that each port 
manager send the time stamp that it has acquired 
along with the Commit signal. Thus, during the 
period of waiting for a response to a transmitted 
Commit signal, the port manager of p sends a No sig- 
nal on those candidate communicant ports on which a 
Commit signal with a time stamp less than that of its 
own time stamp is received. In addition, the port 
state variable Turn is set to MyTurn, denoting that 
the port manager of p must attempt communication 
on that port at any future instant. On the other 
hand, if the time stamp received on the Commit sig- 
nal is larger than its own time stamp, then the port 
manager of p delays a response to that Commit signal 
and marks the variable Committed for that port as 
true and the port state variable Turn as MyTurn. 

If a port manager has failed to synchronize on 
the chosen port p, then it seeks to send a Commit sig- 
nal on a port, arbitrarily selected, on which the send- 
ing of a response to a Commit signal is pending. Syn- 
chronization then occurs on that port, Committed is 
set to false. The port manager sends the control Sig- 
nal No on the rest of the ports on which response is 
delayed, marks the variable Turn for that port as 
MyTurn and Committed for that port to false, so that 
future invocations of the alternative command may 
use that information. 

If there are no ports on which a response to a 
Commit signal is delayed, then the port manager 
selects a port q other than p and sends a 
Commit(q,Tid) signal. The port manager waits for a 
response and takes actions during the period of wait- 
ing in the same way as described above. 

If it happens that the port manager receives a 
response of No on each of its candidate communicant 
ports to a transmitted Commit signal, then the port 
manager waits. It synchronizes with any port in its 
candidate communicant set on which it receives a 
Commit signal the earliest thereafter. If more than 
one Commit signal is received, then it selects one of 
them to synchronize and sends a No signal as a 
response to the others. | 

If the port manager is not servicing an alterna- 
tive command, then the port manager sends a No sig- 
nal as a response to every Commit signal received on 
any port and marks the Turn of that port as ItsTurn 
for future use of that information. 


4. The Implementation of the Algorithm 
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An informal programming notation has been 
employed for the description of the implementation 
scheme. An explanation of some relevant notations is 
given below. 


; denotes the sequential composition of statements. 
*(S] denotes that the statement S is repeated forever. 
S1||S2 denotes a parallel composition of the state- 
ments Sl and 82. 

Committed(p) denotes that a Commit signal has 
been received on port p and a reply has not been sent. 
CP denotes the set of candidate communicant ports. 
receive(...) denotes that the event of the reception of 
a message has occurred on some port and the recep- 
tion is asynchronous. 

Select(p) denotes the procedure of the port manager 
which selects a port for a communication attempt. 
Selected(p) denotes that a Commit signal has been 
sent on port p and either a reply has not yet been 
received or the transport of the message has not yet 
occurred. 

send(...) is the primitive employed by the port 
manager to transmit a signal on some port and the 
transmission is asynchronous. 

synchronized denotes that the port manager has 
achieved the transport of the message. 

Tid denotes the current transaction identifier. 

Time is the clock function of the port manager 
TM(p) denotes the activity relevant to the transfer of 
the message datum. 

Turn(p) denotes which of the predicates, MyTurn or 
ItsTurn is satisfied on the port p. 

wait(p) is a command providing for the suspension of 
the task until the predicate p is true. 


The algorithm consists of two loops and is 
given in figure 1. 

The first loop waits for the reception of a Com- 
mit signal and upon the reception of a Commit signal 
certain actions are taken. If the port on which the 
signal was received is not a port in CP, then the offer 
to communicate is rejected with a No signal and Turn 
is set to MyTurn for that port. The case where the 
port is in CP is more complicated. We require the 
port manager which sends a Commit signal to wait for 
a response. This requirement could allow cycle of 
waiting port managers to form therefore, we prevent 
the formation of a cycle by a requirement that a port 
manager while awaiting a response must reject some 
offers to communicate that are received during the 
waiting time. The rejection of some offers is depen- 
dent on the time stamp received with the offer and 
the current state of the port manager. If the time 
stamp received with the signal is less than the time 
stamp obtained by the port manager in the current 
alternative command and the port manager is await- 
ing a response to a Commit signal of its own then a 
No signal is sent in response. Otherwise Committed is 
set to true for that port. 


Vp Turn(p) := MyTurn; 
7p Committed(p) := false; 
\/p Selected(p) := false; 
CP = gp : 
BEGIN 
YP 
*| wait( receive( Commit(p,u) )); 
Turn(p) := MyTurn 
If p¢CP then 
send( No(p,Time)); 
else if u>Tidv \/reCP-Selected(r) then 
Committed(p) := true endif 
else if u< TidA reCP Selected(r)Ar-~p then 
send( No(p,Tid)) endif 
endif | 
|| *[ wait(CPZ ¢ 
Tid := Time; 
synchronized := false; 
repeat 
select(p); 
Selected(p) := true; 
send( Commit(p,Tid)); 
if Committed(p) then 
TM(p); synchronized := true; 
Committed(p) := false; 
Turn(p) := MyTurn 
else 
wait( receive( No(p,u)) or 
receive( Commit(p,u))); 
if receive( Commit(p,u)) then 
TM(p); synchronized := true; 
Turn(p) := MyTurn; 
Committed(p) := false 
else Turn(p) := ItsTurn 
endif 
endif 
Selected(p) := false; 
until synchronized; 
J peCP 
If Committed(p) then 
send( No(p,Tid)); 
Committed(p) := false; 
Turn(p) :-= MyTurn 
endif; 


Fig 1: Algorithm executed by a port manager. 
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The second loop executes only when a process 
is in an alternative command. Upon entering an 
alternative command Tid is set to a unique value 
obtained from a local clock which is maintained in 
loose synchronization with the local clocks of the 
other port managers in the system[10]. The selection 
of ports on which to attempt synchronization consists 
of two phases. The first phase involves active 
attempts by a port manager to synchronize with a 
candidate communicant via any port on which it is 
the port manager’s turn to initiate communication 
(Turn = MyTurn) or on any port on which it has 
received an invitation to synchronize (Committed(p) 
= true). Selection of a port for a communication 
attempt is accomplished in the procedure select(p). 
(It is intended that priority be given to the port on 
which a Commit signal was least recently sent and on 
which Turn = MyTurn.) Next, the signal 
Commit(p,Tid) is transmitted on the port p and a 
period of waiting for a response (wait(receive(No(p,u)) 
or receive(Commit(p,u)))) follows. If the predicate 
Committed is satisfied on the port p which was 
selected, then no response is required therefore the 
transport of the message (TM(p)) occurs. Otherwise, 
if the response is Commit(p,u), then the transport of 
the message occurs and if the response is No(p,u), 
then some other port will be selected for another com- 
munication attempt. The second phase is entered 
when for each candidate communicant port it is the 
turn of some other port manager to initiate communi- 
cation (Turn(p)=ItsTurn). the port manager awaits 
an invitation to synchronize and will synchronize with 
the first invitation and reject all other invitations 
received during this phase. In the second phase, the 
select operation must wait for the reception of a Com- 
mit signal on a port in CP. 


5. Some Properties of the ANa Scheme 


The specification and verification of the scheme in [1] 
establishes the results formally using interval based 
temporal logic[13]. 


Theorem 1: 


If the predicate function ItsTurn ts true on all of the 
candidate communicant ports of a process which 1s in 
an alternative command, then, upon the reception of a 
Commit signal on a candidate communicant port, the 
port manager of the process will exit the alternative 
command eventually. 


Proof: The hypothesis corresponds to case 5 of the 
scheme. Since case 5 leads to case 4, and case 4 is 
specific to the port manager servicing the alternative 
command, the port manager will exit the alternative 
command service eventually. 


Theorem 2: Deadlock freedom of the implementation. 


If a port manager ts committed on some port p, then, 
eventually, 1t makes progress on that commitment. 


Proof: 


case 1: Assume that there does not exist a cycle of 
commitments between port managers. Then, a port 
manager receives as a response either a Commit signal 
or a No signal to a Commit(p,id) signal transmitted 
by it on some candidate communicant port p. The 
two cases are dictated by case 2a and case 2b of the 
scheme. Thus the port manager makes progress on its 
commitment in these two cases. 

case 2: Assume that there exists a cycle of port 
managers P,,P.,...P, with transaction identifiers 
t),to,...t, such that, for 1<i<n, P; has sent a commit 
signal to Pj.1 moan 2nd is awaiting a response. Since 
the transaction identifiers obtained by each of the 
port managers in the cycle are totally ordered and 
responses to Commit signals may be delayed to 
processes with greater transaction identifiers, there 
exists a port manager, say P,, with the minimum 
transaction identifier in the cycle which has sent a 
commit signal to a port manager Pyi1 modn Thus it 
may be assumed that the transaction identifiers are 
ordered as follows: t)>tg>°°° >t, <t,4)> °° t, 
and t,>t,. The response in Py) modn 18 dictated by 
case 3c of the scheme. Hence, a No signal is transmit- 
ted on a port to P, and the cycle is broken. 
Thereafter, progress in commitment is achieved as in 
case l. 


Theorem 8: 


If an alternative command names M ports as candidate 
communicants, then the stgnal Commit(p,id,) need be 
sent on some candidate communicant port p at most 
once, by the port manager, prior to synchronization 
with any one of the candidate communicants. 


Proof: By step 1 of the scheme a Commit signal may 
only be transmitted on a candidate communicant port 
p satisfying the predicate Turn(p) = MyTurn. If on 
such a port a Commit signal is transmitted and as in 
case 1 of the scheme and a Commit signal is received 
on the port p as a response to the Commit signal, 
then synchronization occurs and the theorem is 
satisfied. If, however, a No signal is received as a 
response to the Commit signal then case 2b of the 
scheme specifies that Turn(p) = ItsTurn is satisfied 
on that port and the scheme requires that no further 
signals be transmitted on p until Turn(p) = MyTurn 
is satisfied. The predicate Turn(p) = MyTurn is 
satisfied on p under the condition of the reception of a 
Commit signal on p as in case 2a, case 3a, case 3c or 
ease 6 of the scheme. If on p a Commit signal is 
received, then, after a possible delay, either synchroni- 


zation occurs following the transmission of a Commit 


signal on p as in case 4a or case 5 of the scheme or, if 
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synchronization occurs on some other candidate com- 
municant port then case 4b specifies that a No signal 
is transmitted on that port. In all of the cases, only 
one Commit signal is sent and the theorem is satisfied. 


Lemma 1: 


If Cl; and Cl, are, respectively, the communicant 
lists of two processes P; and P,, both of which are in 
alternative commands with transaction identifiers id; 
and id;, respectively, and nether of which has syn- 
chronized, and further, the ports p,; and pj are linked, 
where the port p,eECl, and the port pjeECl, then 
Its Turn(p,)—+ MyTurn(p;)A id; <id;. 


Proof: By case 2b of the scheme the candidate com- 
municant port p; satisfies Turn(p;) = ItsTurn iff the 
port manager of the process P; has received a No sig- 
nal as a response to a Commit signal on port p;. If 
the port p; is linked to the port p; and the port pj; is a 
candidate communicant port of the process P;, then 
the port manager of the process Pj, by case 2a, must 
satisfy Turn(p;) = MyTurn on the port p;. Thus the 
lemma is established. 


Theorem 4: Synchronization Condition 


If two processes P; and Pj, each of which ts tn an 
alternative command, each name some pairs of poris, 
pj and pj, linking P; and P; as candidate communi- 
cants, and further if nerther of them synchronizes with 
any other candidate commumnicant, then they synchron- 
we wth each other on any pair of the named pairs of 
linked ports p; and pj. | 


Proof: From Lemma 1, the linked pairs of ports p; and 
pj of the processes P; and PP, _ satisfy, 
ItsTurn(pj)+MyTurn(p;)Aidj<id;, and since it is 
assumed that there is only a finite number of candi- 
date communicant ports and since, by Theorem 2, the 
implementation is deadlock free, the port manager of 
at least one of the processes P; or Pi, say process Pi, 
by case 1, must eventually transmit a Commit signal 
on one of the linked ports, say p;, to which the port 
manager of the process P;, by case 4a, must respond 
with the transmission of a Commit signal on the 
linked port p;. Thus the theorem is established. 


Theorem 5: 


When a process is in an alternative command with M 
number of candidate communicant ports, the port 
manager of the process need send at most M Commit 
signals on the candidate communicant ports tin order 
to synchromze. 


Proof: From Theorem 3, a Commit signal need be 
sent at most once on a candidate communicant port 


prior to synchronization. Since there are M candidate 
communicant ports, at most M Commit signal need be 
sent. 


Theorem 6: 


When a process 1s in an alternative command with M 
number of candidate communicant ports, the port 
manager of the process need send no more than 2M-1 
Commit or No signals on the candidate communicant 
ports. 


Proof: If, by case 1, the port manager of the process 
P., with M number of candidate communicant ports, 
has sent a Commit signal, on some candidate com- 
municant port p,, and is awaiting a reply to the Com- 
mit signal, it may, while awaiting the reply, receive at 
most M - 1 Commit signals on the other candidate 
communicant ports pj. The port manager of the pro- 
cess P; must, by case 2b, send at most M - 1 No sig- 
nals on the ports p; on which it received the Commit 
signals. (If after having sent the No signals the port 
manager continues to await the reply to the Commit 
signal sent on the port p,, and receives a second Com- 
mit signal on any of the other candidate communicant 
ports p;, then the port manager will delay response to 
that Commit signal since by Theorem 3 the second 
Commit signal received on the port p; represents a 
new alternative command transaction and contains a 
transaction id greater than the transaction obtained 
by the port manager of the process P;.) If, thereafter, 
the port manager of the process P; receives a No sig- 
nal on the candidate communicant port p;, it may, by 
case 1 or case 4a of the scheme, send Commit signals, 
in turn, on each of the M - 1 candidate communicant 
ports p; on which No signals have been sent. For 
each of the Commit signals transmitted by the port 
manager of the process P;, a No signal may received 
as a response. In this eventuality case 5 is applicable. 
That is Commit signals may be received on each of 
the M candidate communicant ports of the process P;. 
The port manager then transmits a Commit signal on 
one of the candidate communicant ports of the pro- 
cess P;. Upon synchronization, and prior to the termi- 
nation of the alternative command, case 4b requires a 
No signal be sent, by the port manager of the process 
P,, on those candidate communicant ports on which a 
Commit signal is pending. Since there can be at most 
M - 1 such ports, at most M - 1 No signals are sent 
after synchronization. Thus, when a process is in an 
alternative command with M number of candidate 
communicant ports, no more than 2M - 2 No signals 
and 1 Commit signal are transmitted by the port 
manager of that process. 


Theorem 7: 


When a process is in an alternative command with M 
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number of candidate communicant ports, the port 
manager of the process need expect no more than 6M - 
2 signals on the M candidate communicant ports. 


Proof: From the theorems 5 and 6, the port manager 
need send at most 2M - 1 + M signals. Each of the 
signals sent is either in response to a Commit signal 
received or is a Commit signal soliciting a communica- 
tion. Thus the total number of signal that a port 
manager need expect on the candidate communicant 
ports is no more than 6M - 2. 


6. Discussion 


a) Degree of concurrency obtained by the algorithm: 
The degree of concurrency achieved by the 
algorithm is something hard to quantify. However, a 
comparison of the level of concurrency achieved by 
the implementation scheme with respect to that 
achieved by the implementations suggested in the 
literature can be made by identifying relevant compu- 
tational steps in each of the algorithms. For such 
purposes of comparison, the implementation scheme 
suggested by Buckley and Silberschatz[4], hence forth 
referred to as BSi scheme, is considered. For purposes 
of comparison, the BSi scheme is mapped onto the 
communication model of this paper. The _ port 
manager of the process in the ANa (BSi) scheme 
awaits a response for a Commit (QUERY) signal 
transmitted by it. If during the period of waiting, the 
port manager receives a Commit (QUERY) signal on 
another candidate communicant port, it either delays 
a response or transmits a No (BUSY) signal, (If a 
Commit(QUERY) signal is received on a noncommun- 
icant port then a No(NO) signal is sent). Thus each 
of the schemes are bound to achieve the same degree 
of concurrency or loss of concurrency there of in their 
abilities to synchronize. Unlike the ANa scheme, the 
port manager of a process, in the BSi scheme, 
transmits a RES signal on all its candidate communi- 
cant ports on which it has transmitted a BUSY signal 
in response to QUERY signals received by it. After 
the transmission of the RES signal, the port manager 
awaits RETRY signals on those ports. Of all the 
RETRY signals received by the port manager, the 
port manager selects one to synchronize with, by 
sending a COMMIT signal, and rejects all others by 
sending a NO signal. In the ANa scheme, on the 
other hand, if the port manager of a process has 
received a No signal in response to a Commit signal 
on any of its candidate communicant ports, then the 
port manager expects a Commit signal to arrive on 
that port. When on all of the ports the port manager 
has received No signals in response to Commit signal 
sent, the port manager awaits Commit signals on 
those ports. Then, of all Commit signals received by 
the port manager, the port manager selects one to 
synchronize with, by sending a Commit signal, and 
rejects all others by sending a No signal. Thus the 


degree of Concurrency in the selection of a port on 
which to synchronize is the same in both of the 
schemes. 


b) Comparative Performance of the Algorithm 

The performance of the algorithm is measured 
in terms of the number of signals which must be 
exchanged to achieve synchronization. In the BSi 
scheme, a process sends out at most 2M unsolicited 
signals per each execution of an alternative command 
where M is the number of candidate communicant 
ports. The unsolicited signals in the BSi scheme 
correspond to the Commit signals in the ANa scheme. 
In the ANa scheme the number of unsolicited signals 
needed is halved. Theorem 5 shows that in the ANa 
scheme at most M Commit signals need be sent out. 
Natarajan distinguishes between input ports and out- 
put ports. On the assumption that input and output 
ports are equally likely to occur in an alternative com- 
mand, Natarajan’s scheme is also able to halve the 
number of unsolicited signals required to achieve syn- 
chronization. Natarajan’s scheme requires the port 
manager to initiate communication on output ports 
while on input ports the scheme requires the port 
manager to have received a signal prior to attempting 
synchronization. Thus, Natarajan’s scheme sends at 
most N unsolicited signals on output ports, where N is 
the number of output ports. If there are N input 
ports and on those ports a Ready signal had been 
received and a Reject signal sent and a Not Ready 
signal has not yet been received, then the port 
manager may send the equivalent of an unsolicited 
signal on those ports, thus achieving the M number of 
unsolicited signals, where M = 2N. In the ANa 
scheme the total number of signals generated in the 
system due to the initiation of an alternative com- 
mand by a process is 3M, where M is the number of 
candidate communicant ports. In the BSi scheme the 
total is 3M + Q, where Q is the number of candidate 
communicants with priority number less then that of 
the process. 


c) Resilence against node failures: 

In a distributed system nodes tend to fail. For 
the distributed program to achieve its goals in spite of 
node failures, the ANa scheme has to recover from 
node failures. A simple scheme for effecting such a 
recovery in the presence of node failures can be 
imposed on the ANa scheme. It is assumed that, at 
each node of the distributed system, there exists 
mechanisms for the detection of the failure and the 
restart of the node. In addition, it is assumed that 
when a node fails, the states of the port managers at 
that node are lost. The communication system is 
assumed to be reliable and that the messages and sig- 
nals sent are delivered in the order of transmission 
and, further, when a message is transmitted, it is 
either delivered to the destination port or an excep- 
tion is signaled to indicate the failure of the 
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destination node. 

Whenever the node kernel is restarted after the 
failure of the node, the kernel signals the restart to 
each of the port managers at that node. Upon the 
receipt of a restart signal from the node kernel, the 
port manager at the previously failed node transmits a 
Restart signal on all of its ports. The port manager 
then waits for responses to its Restart signal in order 
to reset it local clock. A port manager which receives 
a Restart signal on a port satisfies the predicate func- 
tion MyTurn on that port and transmits on that port 
the signal, Time(p,t), containing the time, t, indicated 
on its local clock. The port manager receiving the sig- 
nal Time(p,t), sets its local clock to the maximum of 
the Time signals received. 

If the communication system detects a failure 
of the destination node, of a signal or a message, then 
the port manager which transmitted the signal or 
message receives from the communication system a 
failure signal on the port on which the message or sig- 
nal was sent. The port manager then satisfies the 
predicate function ItsTurn on that port. 


7. Conclusions 


We have presented a simple and efficient time 
stamp based synchronization scheme for communicat- 
ing processes, utilizing output commands as guards in 
the alternative command. The scheme was developed 
out of an interest in the utility of temporal interval 
logic for the specification and synthesis of parallel 
algorithms. A temporal interval logic based 
specification of the scheme is presented in [1]. The 
scheme is port based, utilizes the exchange of control 
signals to select a suitable communication partner and 
relies on time stamps to introduce the asymmetry 
required to avoid deadlock. The selection of a com- 
munication partner requires a process to send at least 
one and at most M signals indicating its willingness to 
establish communication, where M is the number of 
ports on which the process is willing to communicate. 
During the time that a process is attempting to find a 
communication partner, and prior to the determina- 
tion of its communication partner, at most M - 1 sig- 
nals will be sent rejecting pending offers to establish 
communication. These signals are required to prevent 
deadlock in the implementation. After the selection 
of a communication partner and prior to the termina- 
tion of the alternative command, at most M - 1 sig- 
nals are required to reject pending offers to establish 
communication. These are required to insure the pro- 
gress of other processes in their own selection algo- 
rithm. The various properties of deadlock freedom of 
the implementation and performance of the scheme 
have been established. A critique of the scheme has 
been presented in the light of some other schemes 
appearing in the literature. It has been shown that 
the scheme is efficient in the number of signals 
transmitted for achieving synchronization. The 


degree of concurrency achieved by the scheme is also 
comparable to that achieved by that of other schemes. 
The time stamps may be implemented through the 
use of logical clocks in each process which are kept in 
loose synchronization using the algorithm due to Lam- 
port{10]. The storage requirements of the scheme are 
small, a local clock and for each port two bits to indi- 
cate the state of the port. 

It is hoped that the implementation scheme be 
imbedded in the runtime system of a compiler which 
extends OCCAM{[9] to include output commands in 
the guards. Propositional temporal] logic has been 
shown to be suitable for the synthesis of synchroniza- 
tion skeletons for communicating processes|5, 11] and 
we are exploring the suitability of temporal interval 
logic for the specification and synthesis of parallel 
algorithms. 
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Intra-transaction concurrency in distributed databases and 
protocols which use transaction aborts to preserve 
consistency: a performance study.° 
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ABSTRACT: Concern for performance of concurrency control 
protocols, in distributed database systems, raises an important 
question ‘does performance improve by permitting intra- 
transaction concurrency’. This paper answers this question for 
protocols which use abortion of transactions to preserve con- 
sistency. Variants of the protocol proposed by Reed [REE78] are 
simulated and used as a representative of those protocols which 
abort transactions. The results strongly points out that as the 
workload increases the intra-transaction concurrency leads to 
poorer performance. Further, it suggests that all entities should 
be assigned an order and, whenever access set of a transaction is 
known a priori, the transaction should access entities in this 
assigned order with possible skips. 

1. Introduction. 


To improve performance of distributed database systems, we 
may like to execute each transaction concurrently at multiple 
sites. Further, if such Intra-transaction concurrency at multiple 
sites does improve the performance then we would like to 
increase this improvement by increasing the number of sites: we 
could stretch this to the extreme by having each entity as a site 
(i.e. by associating a processor with each entity.) In this paper we 
look in to the efficacy of permitting Intra-transaction con- 
currency, particularly in systems which abort transactions to 
preserve consistency. Some of the protocols which use transac- 
tion aborts for preserving consistency are, Optimistic protocol 
[KUN81], Reed’s protocol [REE78], Two-phase locking protocol!* 
[ESW76], the protocol presented in [BAY80], etc. An increase in 
the workload?” will lead to increase in the transaction aborts for 
each of these protocols which in turn will increase the workload. 
For sustained workloads higher than a certain level (which is 
different for each protocol) this cyclic process will lead to a non- 
linear drop in performance. Effect of different system parameters 
on such drop is not well understood. The extent and the rate of 
this drop will vary, but for each protocol, the performance will 
be unacceptable for workloads higher than a certain workload 
level. We investigate the effect: of intra-transaction concurrency 
on this level by conducting simulation studies. We note that, it 
is desirable to have this workload level as high as possible even if 
it results in a poorer performance for the lower workloads, since 
poor performance is more affordable at lower workloads than at 
higher workloads. For the purpose of this study we chose vari- 
ants of the protocol proposed by Reed [REE78] which permit 
varying degrees of intra-transaction concurrency. 


We found that as the workload increases the performance (in 
terms of the response time and the throughput) deteriorates at a 
faster rate for higher levels of Intra-transaction concurrency. In 
section 2 we summarize the variants of the Reed’s protocol. In 
section 3 we briefly describe the simulated System Model. In sec- 
tion 4 we analyze the results and in section 6 we summarize the 
conclusions. 


2. A summary of the Reeds protocol. 
First we briefly describe a variant of Reed’s [REE78] Protocol to 


be termed as RP. Then we shall explain how we can incorporate 
different levels of intra-transaction concurrency in RP. 


* This work was partially supported by National Science Foundation under 
Grant MCS-9214613. 

** This work was done while the Author was at U.T.Austin. 

1* In context of Two-phase locking protocol, we use the term ‘abort’ to mean 
releasing of the locks granted to a transaction during the locking phase, when a 
deadlock is detected. 

2* In this paper we do not define workload quantitatively and refer to a 
change in the workload in terms of a change in the parameter values. Typical 
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Multiple versions and tokens are maintained at each entity and 
are logically ordered in the order of the logical precedence among 
their creator transactions. A token is a possible version written 
by a transaction which is still executing. A token may be either 
converted to a version or deleted, depending upon whether the 
creator transaction completes or aborts. Each version and token 
has four fields: first, value of the version; second, the ‘start’ field, 
which is the timestamp of the transaction which created the ver- 
sion or the token; third, the ‘end’ field, which is the timestamp of 
the transaction which last read the version (in the case of tokens, 
this is same as the start field); fourth, a boolean ‘commit’ which 
is ‘true’ for a version and ‘false’ for a token. In addition each 
token has a pointer to a record of the status of its creator tran- 
saction called possibility of the creator transaction. 


When a transaction T is initiated, the initiating site assigns it a 
globally unique timestamp TS and creates its possibility P. 
Transactions logically execute in the order of their timestamps. 
P indicates whether T is active, or it has successfully completed 
or aborted. When T is active then associated with P is a list of 
messages which are replied when T either completes or aborts; so 
that the tokens created by T may be either converted to versions 
or deleted and the transactions waiting to read these tokens may 
be activated. When all the tokens created by T have been 
notified about completion or abortion of T then P is deleted. 


When T wants to read an entity, it must select a version or a 
token with start field of the highest value less than TS. If a ver- 
sion is selected, then T reads the version and the value of the 
end field of the version is set to TS if it is less than TS. How- 
ever, if a token is selected, then a message is associated with the 
token to activate T when either the token is converted to a ver- 
sion or is deleted and T waits.°° When a message is received 
indicating that the creator of a token has completed then token 
becomes a version and any waiting transaction is activated to 
reads it and to update the end field if required. When a message 
is received indicating that the creator of a token has aborted 
then the token is deleted and each waiting transaction is 
activated to restart the read operation. 


When T wants to write an entity, it first selects a version (or 
token) just as for a read operation. The new token to be written 
will be placed after this selected version (token). End field of 
such selected version is then compared with TS. If it is not less 
than TS then T aborts,*” and the associated tokens and P are 
deleted. Otherwise, the new token can be written without violat- 
ing consistency. If T wants to read the entity before writing 
then it does a read operation on the selected version or token. If 
reading of such token fails then T restarts the write operation 
otherwise the write operation continues as follows. A new token 
is created with TS as the start field and is appropriately placed 
among other versions and tokens of the entity. A message is 
placed in the list associated with P to notify the site of this 
entity whenever T completes or aborts. 


When T has accessed all the required entities then a message is 
sent to T’s initiation site. When this message is received, mes- 
sages are sent to entities at which T created any version and T is 
committed. Old versions are deleted as soon as it is established 
that no transaction will need to read them in order to maintain 


reasons for an increase in the workload are, a reduction is inter-arrival time 
between transactions, an increase in the average number of entities accessed by 
transactions, an increase in the extent of data sharing etc. 


. 3* In [REE78], the proposed procedure is a little different. 
4* A transaction with timestamp higher than TS has read the selected version 
after which the new token will be placed. 


consistency. This simulates a system with the least number of 
versions, without resulting in abortion of read transactions. This 
procedure is different, and deletes old versions sooner, then the 
procedure proposed in [REE78]. 


[REE78] does not require or suggest any particular order in which 
a transaction may access entities. We considered three options 
RP1, RP2, and RP3, as shown n Figure-1. These options permit 
different levels of intra-transaction concurrency. In RP1, a tran- 
saction executes sequentially (i.e. no intra-transaction con- 
currency) and it accesses entities in an order pre-assigned to all 
entities.°” This option requires that each transaction be able to 
incrementally declare its access subsets from the successive parts 
of the pre-ordered entities. In RP2, at each site, transactions 
access entities Just as in option 1, and they may concurrently 
execute at all the sites to be accessed. This option permits par- 
tial intra-transaction concurrency, and requires that each tran- 
saction be able to incrementally declare its access subsets from 
the successive parts of the pre-ordered entities on each site . In 
RP3, a transaction concurrently attempts to access all the 
required entities. This option permits the highest possible intra- 
transaction concurrency. 


Nature of a transaction may impose some constraints on the 
order in which entities should be accessed. A transaction may 
not be able to determine whether to read/write an entity e,, or 
may not be able to determine the value to be written until it 
reads some entity e, after e, in the pre-assigned order. In all 
simulated options such writes are done after e, has been read. 
However, whenever possible such reads are done in anticipation 
(ie. before reading e, .) 


3. The System Model. 


In the process of describing the system model we shall define 
parameters and give, in parentheses, the values assigned to the 
parameter in the simulation runs. The simulated system has S 
(10) sites. There are a total of M (1000) entities in the database 
system, and each site has M/S (100) number of entities. There is 
no data replication. We presume a network topology in which 
the communication delays between any two sites have the same 
mean and the same distribution. Communication delay between 
any two sites either is zero or follows an exponential distribution 
with a mean DEL (100 time units) over the entire range of com- 
munication traffic generated in the system. The communication 
network of the system never fails. Transactions are of two types: 
read transactions and write transactions. Read transactions only 
read. Write transactions write all of the entities to be accessed; 


and, before writing an entity, they read it. A write operation 
may have a constraint such that before writing e, the N™ (5) 
entity to be accessed after e, must be read. Each write transac- 
tion may have L (1) such constraints. R/W (3, 15) is the ratio of 
the number of read and write transactions entering the system. 
Inter-arrival time between transactions follows an exponential 
distribution with a mean of ARRL (1, 10, 20, 100 time units). 
Each transaction accesses SN (3) number of sites and an equal 
number of entities at each site. Transaction size, or number of 
entities accessed by a transaction, follows a geometric distribu- 
tion with mean transaction size TZ (3, 6, 15, 100). Each read, 
write or a comparison takes 1 time unit. 


When a new transaction is initiated, it is assigned a transaction 
type and an initiation site. Then sites to be accessed by the 
transaction are selected. First site is selected with uniform proba- 
bility from the 1% through the (S—SN)" sites. Let k” site be 
the last selected site when SM sites have been selected. The next 
site is chosen from (k +1) to (S-SN-SM)" sites, such that 
the distance between two consecutively selected sites®” follows a 
geometric distribution with mean Dis-s (5.0, 1.66, 1.0). This pro- 
cess is repeated until all SN sites to be accessed have been 
selected or until n sites are yet to be selected and only n more 
sites remain to choose from, in which case all the remaining n 
sites are selected. Process for selection of entities is similar to the 
process for selection of sites except that the mean distance 
ie two consecutively selected entities is Dis-e (5.0, 1.66, 
1.0). 
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4. Results and the analysis. 


The simulator was validated by validating various events and the 
results for extreme cases. The results were collected for 2000 (in 
few cases 1000) completed transactions. The data was collected 
after completion of 500 transactions. Since RP3 is a special case 
of RP2 (in which each entity is a different site) we planned to 
run simulations for RP3 only if RP2 performed better than RP1 
for an interesting range of parameter values. As the results will 
show we did not have to run simulations for RP3. The average 
response times reported here refer to the response times of com- 
pleted transactions. Hence, under heavy workloads (when many 
transactions have not completed), the reported average response 
times are less than the actual average response times. Each such 
occurrence has been marked with an asterisk (*) alongside the 
average response times. 


[AHU84] reports results for a wide range of parameters and con- 
cludes that it is sufficient to study results for two extremes of 
dispersal of accessed sites and entities, which are termed high 
and low. A combination of Dis-s=5.0 and Dis-e=5.0 represents 
a high dispersal and combined effect of Dis-s=1.0 and Dis-e=1.0 
represents low dispersal. Comparison of the performance of RP1 
and RP2 based on the results reported in [AHU84] can be typi- 
cally summarized by the selected results presented in Table-1. 
These results correspond to DEL=100, TZ—15, SN=3, R/W=3, 


for low and high dispersal. 


We observe that at low workloads (i.e. for ARRL=100), the 
average response times are lower for RP2 compared to RP1. 
With respect to other parameters also, RP2 performs marginally 
better than RP1. The primary reason for this is the lower com- 
munication delay for RP2 during execution of a transaction. 


As the workload increases, however, the response time of transac- 
tions and the rate of abortion of write transactions increase (as 
shown by an increase in the number of times a write transaction 
had to be submitted.) The aborted transactions result in the 
abortion of other transactions, and the abortion rate increases in 
a cyclic process. We are interested in finding out if this process 
occurs at substantially different workload levels for RP1 and 
RP2. In the following discussion, we shall cite a reduction in 
ARRL as the reason for an increase in the workload. However, 
the observations made here are true for an increase in the work- 
load due to other reasons (e.g., an increase in TZ or dispersal.) 


We observe that for high ARRL (e.g., 100) the abortion rate of 
transactions is marginally lower for RP2 than for RP1. With a 
reduction in ARRL, however, the abortion rate increases much 
more sharply for RP2 than for RP1 and for RP2 it quickly 
becomes extremely high. In Figure-2 we have plotted the aver- 
age number of times a write transaction had to be submitted for 
both RP1 and RP2 against ARRL, for low and high dispersals. 
These plots support the above observations and they show the 
magnitude by which the abortion rate and the rate of its increase 
are higher for RP2 compared to RP1. The reason for this 
behavior is that above a certain workload level, when a transac- 
tion aborts, the increase in the probability of aborting other tran- 
sactions is higher for RP2 than for RP1. At first it is due to the 
higher average number of entities accessed by aborted transac- 
tions’” and at yet higher workloads it is due to higher abortion 
rate in RP2. 


5* An alternative to RP1 which does not permit any intra-transaction con- 
currency could be that each transaction executes sequentially and accesses enti- 
ties in a random order. Few simulation runs strongly confirmed that this option 
does not perform as well as RP1 (due to a transaction visiting a site more than 
once and due to higher abortion rate, both due to transactions accessing entities 
in a random order.) So we eliminated this option from further detailed study. 


6* Distance between two neighboring sites (entities) in the order of sites (enti- 
ties) is considered to be one. 


7* At low workloads, on average an aborted transaction will have accessed 
more number entities in RP2 than in RP1. For RP1, when a transaction is 
aborted it will have read entities before (but not after) the entity at which it 
could not write. For RP2, however, when a transaction is aborted it might have 
accessed the entities both before and after the entity which it could not write. 


At low workloads, higher number of entities accessed by aborted 
transactions in RP2 leads to higher abortion rate, due to the fol- 
lowing reason. The value of the end field of the version read by 
an aborted transaction J; will be equal to (or more than) the 
timestamp of T;. This may lead to abortion of another transac- 
tion with a timestamp lower than the timestamp of T; (while 
logically, 7; did not read the version, since it was aborted). 
Hence, for RP2 the cyclic process leading to higher abortion rates 
occurs at lower workloads and at a faster rate. For example, let 
us consider an increase in the dispersal of accessed sites and enti- 
ties from low to high, for ARRL=10. For RP2 the abortion rate 
increased 7.8 times (from 22.28 to 179.0) as compared to a 
corresponding increase of 2.7 times (from 4.47 to 12.04) for RP1. 


As the workload increases, the cyclic process leads to more fre- 
quent abortion of transactions in the earlier stages of execution: 
and this happens at higher rate for RP2. At high workloads this 
abortion of transactions at earlier stages can be confirmed for 
RP2 by the results in Tables-1. Consider the example of an 
increase in the dispersal of accessed sites and entities from low to 
high for ARRL=10. The average number of entities accessed by 
an aborted transaction reduced by approximately a factor of 
1.95. 


We conclude that at low workloads average number of entities 
accessed by an aborted transaction is higher for RP2 compared 
to RP1. This leads to the cyclic increase in the abortion rate at 
much lower workloads for RP2. This conclusion can be extrapo- 
lated to compare RP1 with RP3. If a transaction accesses TS 
entities, then RP3 can be viewed as RP2 with SN=TS. For 
SN=TS the analyses given above for comparing RP1 and RP2 
are likely to be all the more valid. Hence, it is advisable to 
choose RP1 compared to RP2 and RP3, and it is advisable to not 
permit intra-transaction concurrency. This leads to chosing a 
marginally higher cost at lower workloads (when the higher cost 
is affordable) in favor of the savings at higher workloads. It is a 
wiser choice, since the need for a well performing protocol is 
more critical at higher workloads. Just as in RP, in other proto- 
cols which permit transaction aborts, an increase in the workload 
increases the abortion rate. Also the factors which increased this 
rate for RP when intra-transaction concurrency is permitted also 
occur in these protocols. Hence by the same reasoning, for these 
protocols also it is advisable to not permit intra-transaction con- 
currency. 


5. Conclusions. 


We conclude that permitting intra-transaction concurrency is not 
an effective way to improve performance, if transactions are 
aborted to preserve consistency. Also whenever possible transac- 
tions should access entities in some pre-assigned order. This 
improves the overall system performance at high workloads. If 
such ordering cannot be implemented over the entire database 


then it may be done over a part. 
COMPARISION OF SELECTED RESULTS FOR RPI & RP2 


‘ 
ee Ce ee ee 
EXTENT OF DISPERSAL OF ENTITES 28 igi an 
AND SITES ACCESSED 8Y TRANSACTIONS [~ iy 


AVERAGE RESPONSE TIME fart] eee | 600 | 030 | os | ate | axe 
OF WRITE TRANSACTIONS [ara] eoo | a2 | seo | 07 | 200 | 200 


AVERAGE RESPONSE TIME fap] tsae*| 1466 | 2112) 732 | 802 | 477 | 
OF READ TRANSACTIONS |RP2| soos*| ses} 1654°) 436 | 334 | 326 | 
THROUGHPUT OF WRITE TRANSACTIONS: 
AS PERCENTAGE OF WRITE TRANSACTIONS ra] sas | r20 | 77s | 72 | oo | 
STC [wal an feo | oe ww | ow 
PERCENTAGE OF TOKENS Rpt] 188 | 289 | 213 | 536 | 71.9 | 886 | 
CONVERTED TO VERSIONS Rp2| 12 | 48 | 11 | 62.7 | 840 | 89.0 | 


seas en fstert tetas ee 
PER ENTITY 
LRP 


AVERAGE NUMBER OF FIELDS 
PER ENTITY 


AVERAGE NUMBER OF TIMES A WRITE 
TRANSACTION HAD TO BE SUBMITTED 
FOR SUCCESSFUL COMPLETION 


AVERAGE TIME A TRANSACTION 
WAITS FOR READING TOKENS 


[Api] 626 | 346 | see | 158 | 36 | 4 | 
Rez] 201 | 166 | 207 [7 | ef 7 


PARAMETERS: DEL = 100, TZ= 15, SN=3,R/N=3 


AN ASTERIK (*) INDICATES THAT THE REPORTED AVERAGE RESPONSE 
TIME 1S SMALLER THAN THE ACTUAL AVERAGE RESPONSE TIME, DUE 
TO UNCOMPLETED TRANSACTIONS. 


TABLE-1 
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TRANSACTION SEQUENTIALLY ACCESSES 
E 4,2 Ey —E2,ng — Em,2 Emr 
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SITE m WITH 
Orr ENTITIES E),; "5 #* ENTITY ON (th sire. fim ENTITIES 
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ACCESSES THE REQUIRED ENTITIES. 
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ABSTRACT -- Task allocation problems for a class 
of distributed systems are studied. The subtasks 
comprising the computational task should satisfy 
certain precedence constraints and the minimization 
of the system response time is the criterion used to 
determine the optimal allocation strategy. This task 
allocation problem is NP-complete. An _ optimal 
branch and bound algorithm as -.well as two 
approximate algorithms based on the greedy and 
local search approaches are developed ffor its 
solution. Finally, a comparison of the algorithms is 
presented. 


INTRODUCTION 


An important design problem of distributed 
systems is the allocation of the subtasks comprising 
the computational task to the individual processors. 
Task allocation for distributed systems has been 
studied in [1-6]. These studies employed the 
maximization of the system throughput or the 
minimization of the interprocessor communication 
cost as the criterion to determine the optimal task 
allocation. 


The requirement of fast real-time response, 
however, necessitates the use of a_ different 
performance criterion. The optimal task allocation 
must now be determined by minimizing the system 
response time. This criterion was used by Stone 
[7] to study the allocation problem for a task with 
no parallelism and a nonhomogeneous system of 
processors. 


The allocation problem for a_ task _ that 
exhibits parallelism is considered in this study. This 
task is modeled as a directed acyclic graph (DAG), 
whose nodes denote the subtasks and whose arcs 
indicate precedence constraints that must be 
satisfied during the execution of the subtasks. The 
optimization problem considered corresponds to the 
partitioning of the nodes of the graph in such a 
way as to minimize the system response time. 


if the execution of two subtasks must satisfy 
a precedence constraint and these subtasks are 
allocated on different processors, the constraint can 
be satisfied via interprocessor communication. 
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However, communication between processors may 
incur significant overhead. In that case, a system 
not allowing for interprocessor communication may 
provide faster response times despite the fact that 
it introduces the need for redundancy of subtask 
executions. The system without  interprocessor 
communication is also interesting for another 
reason. Communication links between processors 
may be among the least reliable parts of the 
system and communications between processors 
may collapse due to hardware or software faults or 
to externally inflicted damage. Therefore, in many 
real-time applications with — high reliability 
requirements (e.g. critical control and navigation 
computers), the multiple processor system should be 
able to operate in an optimal way without 
interprocessor communication links. 


Optimal and approximate algorithms are 
developed for the solution of the task allocation 
problem without interprocessor communication and 
their performance is evaluated. 


PROBLEM DEFINITION 


The computational task considered consists of 
subtasks whose execution must. satisfy certain 
precedence’ constraints. Such a task can be 
modeled as a directed acyclic graph (DAG) (see Fig. 
1). The nodes of this DAG correspond to the 
subtasks and the arcs indicate precedence 
relationships for the execution of the subtasks. Arcs 
emanating from node i and ending at nodes j1, j2, 
«., Jk, for example, indicate that subtask i must wait 
for data from the execution of all subtasks j1, j2, 
.. jk before it can be processed. 


The problem is now to schedule the subtasks 
on m_ identical processors subject to the 
precedence constraints indicated by the DAG. It is 
assumed that there is no interprocessor 
communication and consequently there is no data 


flow between subtasks allocated on different 
processors. 
Since no interprocessor communication is 


allowed, if a node is allocated to a processor then 
all of its immediate descendants should also be 
allocated to the same processor. For the example of 


Fig. 1, if node 5 is allocated to the i-th processor, 
then nodes 4, 2, and 3 must. also be allocated to 
the i-th processor because node 5 requires data 
from nodes 2, 3, and 4 A root of a DAG 
corresponds to an independent process’ which 
includes all nodes that can be reached from that 
root. All the nodes of an independent process must 
be allocated to the same processor. 


Clearly, the problem of _ task allocation 
corresponds to the allocation of independent 
processes derived from a DAG. For example, the 
DAG shown in Fig. 1 has_ three’ independent 
processes each corresponding to a root and defined 
by a subgraph of the DAG (or subdag) as follows. 


Process 1 (pq) : Subdag 1 containing nodes 8, 6, 4 
and 1. 


Process 2 (p59) : Subdag 2 containing nodes 9, 6, 4 
and 1. 


Process 3 (p3) : Subdag 3 containing nodes 10, 7, 6, 
5, 4, 3, 2 and 1. 


Results 
Roots: 
3 (+) 9 10 
| & | 
4 5 
1 2 & 3 
Data 
Figure 1: DAG of a computational task. The 


numbers inside the nodes represent the 
weights (execution times) associated 
with each subtask. 


The combinatorial problem is now introduced 
when the number of independent processes in a 
DAG is greater than the number of processors. 
Consider the DAG of Fig. 1 and a two-processor 
system. There exist three possible root partitionings 
of the DAG of Fig. 1 leading in general to. six 
possible allocations of the independent processes to 
the two processors. if the processors are identical, 
it is obvious that, for example, the allocation (pj, 
pz) to processor 1 and p3 to processor 2 is 
equivalent to the allocation p3 to processor 1 and 
(p>, p3) to processor 2. Thus, the number of 
possible allocations is reduced to three and these 
allocations are shown in Table 1. 


A specific weight equal to the execution time | 
E; of the subtask is associated with each node. Each 
independent process associated with a_ root 
partitioning has a total weight that is equal to the 
sum of the weights of the nodes it contains. 
Obviously, the process with the maximum weight 
determines the system response time. 


In order to optimize the system response 
time, the following optimization problem must be 
solved: “Minimize the system response time over all 
possible process allocations’. 


Table 1 shows all the possible allocations 
and the corresponding system response times for 
the example of Fig. 1. Note that while a subdag is 
allocated to only one processor, a subtask may be 
allocated to more’ than one processor if it belongs 
to more than one subdag. Thus redundancy is 
introduced whose amount can be significantly 
affected by the choice of process allocation. In our 
example, nodes 1, 4 and 6 are executed twice. 


An investigation [8] of the computational 


complexity of the task allocation problem described 
above has shown that it is NP-complete [9]. 


TABLE 1 


Independent Processes and System Response Times 
for a two processor system. 


Three independent processes derived from Fig. 1 
Pp; has nodes 8, 6, 4 and 1 
p2 has nodes 9, 6, 4 and 1 
p3 has nodes 10, 7, 6, 5, 4, 3, 2 and 1 
Three possible allocations on two processors 


TASK ALLOCATION 


Processor 1 


Processor 2 


Response Time 


ALGORITHM DESIGN AND ANALYSIS 


Three types of = aigorithms have been 
developed for the solution of our task allocation 
problem an optimal Branch-and-Bound algorithm 
and suboptimal algorithms based on the Greedy and 
Local Search approaches. The notation used in the 
presentation of algorithms is defined in Table 2. 


Optimal Algorithm 


This algorithm is implemented using the 
branch-and-bound method. It finds the optimal 
solution for an instance ‘but its run-time cannot be 
bounded by a polynomial of the input size of the 
instance. However, it finds the optimal solution in a 
tolerable time for instances of small input size. 


The solution space is represented by a tree 
and a feasible point is represented by a node in the 
tree. Here and in the following discussion, the term 
node applies to the tree of the solution space and not 
to a DAG. For each node in the tree, a child is 
generated for every possible allocation of the next 
subdag. Thus, the number of branches from a node 
is equal to the number of possible allocations of 
the next subdag. A node in the tree is called a 
solution node if all subdags have been allocated. 
Otherwise, it is called a branch node. 


The following procedure is used to “kill” 
nodes (i.e. not to generate children of this node) and 
thus reduce the computation time required for the 
search. An initial upper bound UB of the optimal 
solution is chosen arbitrarily. As the algorithm 
_ progresses, it is updated to the finish time of a 
solution node. A lower bound LB(X) of the optimal 


solution is computed for every node X_= and 
compared to the upper bound. If either one of the 
following two conditions holds 

(a) LB(X) > UB, or 

(b) LB(X) = UB and a solution exists 
then the node is killed. If a branch node is killed, 
then ali nodes derived from it will have lower 
bounds that are greater than the UB. Thus, there is 
no need to. search for such nodes. and the 
computation time is greatly reduced. 


The lower bound LB(X) of a node is computed 
as follows for our problem. Let LBSPT(X) be the 
lower bound of the sum of processing times from 
node X after a task is completely allocated. 
LBSPT(X) is equal to the sum of the current 
processing times plus the sum of execution times 
of unallocated subtasks. Then 


LB(X) = max { [LBSPT(XVM], PT;, i=1,2,..M } 
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TABLE 2 


Notation 


: Number of subtasks in a task 

: Number of processors 

: Number of independent subdags in a task 
: The set of subtasks of subdag i 

: Execution time of subtask i 


: Sum of execution times of subtasks 
in set S 


: Set of subtasks allocated to processor i 


: Processing time for processor i, T(P;) 


: Finish time of task, max PTj, i=1,2,....M 


Greedy Algorithms 


When a task is completely allocated, the sum 
of processing times SPT is given by 


M N 
SPT = 2 PT = ei + TRT (1) 

j= j= 
where TRT_ is the total repeated execution time 
caused by redundancy, i.e. the allocation of some 
subtasks to more than  one_ processors. The 
following inequality is derived for the finish time 
FT 

FT 2 f SPT/M | (2) 


Redundancy is minimized when TRT is made 
as small as possible. Also, better load balancing 
among the processors is achieved when FT is close 
to [| SPT/M ]. Note, however, that FT is not 
necessarily minimized when both TRT ~~ and 
(FT - [ SPT/M ] ) are at their minimum. 


Three approximate algorithms (GR__1, GR_2 
and GR__3) based on the greedy approach have been 
developed [8]. The greedy algorithms proceed stage 
by stage making the best decision at each stage 
[10]. In our case, a stage is the allocation of a 
subdag. GR__1 attempts to minimize the maximum 
processing time at each stage by balancing the 
computational load of processors. Algorithm GR__2 
attempts both to balance the computational load of 
processors and to reduce the repeated execution 
time due to redundant execution of subtasks. 
Algorithm GR_3 is a refinement of GR-2 [8]. 


The general version of the greedy approach 
for our task allocation problem is shown below. 


en ee ae rn ee ee ee ne ee 
Greedy Approximate Algorithms 


Input’ —-N, Ng, Sj, i=1,2,..Ng, Ej, j=1,2,..N, M 


Output: Pj, i=1,2,..,.M, FT 

Method: 

(1) Call the function SORT(subdags) to sort the 
subdags in order of nonincreasing execution 
times. 


(2) Remove the first subdag S, from the sorted 
sequence. 


(3) j « SELECT(P;, i=1,2,....M, S ,). 
The function SELECT chooses a processor j for 
the subdag Se according to the rules discussed 
in this section. if 


(4) Allocate subdag S, to processor j and update 
Pj to Pj + S¢ 


(5) Repeat steps (2) through (4) until all subdags are 


allocated. 
see ER ate ee eae aL eee ee See one eM MEER) 


Algorithms GR__1, GR_.2 and GR_3 employ 
different SELECT functions. Only the SELECT 
function for algorithm GR__1 is presented below and 
detailed descriptions of the corresponding SELECT 
oe for GR_2 and GR__3 may be found in 
8]. 

OF Se ee ne ee ee we eee | 


SELECT_1: 


Input: Pj, i = 1, 2, ... M, S, 


Output: j 
Method: 
and 


(1) Compute T(P; + Sy) for all used processors 
one unused processor. 


(2) Select processor j whose T(P; + Sy) is the 
minimum among those computed in step (1). 


(3) Return. 


Local Search Algorithms 


Local search is based on a trial and error 
optimization method [11]. Starting from some 
feasible solution x,-a neighborhood NBH(x) of x is 
searched for a better solution. This neighborhood 
NBH(x) consists of points that are “close” in some 
sense to the point x. The general local search 
algorithm is shown below. 
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ee ee ee ee ee ed 
LOCAL SEARCH ALGORITHM 

Input: The starting solution t. 

Output: The local optimal solution x. 

Method: 

()x ¢t 

(2) Repeat IMPROVE(x) until IMPROVE(x) is false. 


(3) Output x since “no further 


possible”. 


improvement is 


IMPROVE(x) searches for a better solution in 
a neighborhood of x and is defined by: 


TRUE and x ¢ y, if there exists 

a y in NBH(x) with ely) < c(x). 
IMPROVE(x) = 

FALSE and a better solution 

does not exist in NBH(x). 


The function c(x) above is the cost function 


for the optimization problem. If an improved 
solution exists, we adopt it and = repeat. the 
neighborhood search from the new solution. The 
search stops when a local optimum is reached. 

This approach is now applied to our 
particular task allocation problem whose cost 


function is the finish time FT. IMPROVE(x) searches 
a neighborhood of x for a solution yielding smaller 
FT than that of x. Two schemes- are used to 
reduce the finish time of a task and they are similar 
to those discussed in the previous section. 


(1) Balance the computational load of processors by 
exchanging subdags among them. This yields 
processing times that are close to one another, 
thus reducing the finish time FT. 


(2) Reduce the repeated execution time of shared 
subtasks by combining subdags that have 
repeated subtasks. 


These exchange and combination schemes 
are applied to a pair of processors P1 and P2 at a 
time and a skeleton code of the functions that 
implement them follows. 


a ee ee ea 
EXCHANGE! (P1, P2, TEST): BOOLEAN 


(1) Try exchanging one subdag of processor P1 with 
one subdag of processor P2 or moving one 
subdag from processor P1 to processor P2. 


(2) If the resulting FT is less than TEST, then return 
TRUE. 


(3) If no further exchange is possible, then return 
FALSE. Else go back to step (1). 


Di ree eg ge ee fe ee ee 
EXCHANGE2 (P1, P2, TEST): BOOLEAN 


(1) Exchange two subdags of processor P1 with one 
subdag of processor P2 or one _ subdag of 
processor P1 with two subdags of processor P2. 

TEST, then 


(2) If the resulting FT is less than 


return TRUE. 


(3) If no further exchange is possible, then return 
FALSE. Else go back to step (1). 


ee ee ee ey eg ee eee gee Ym Re 
COMBINE (P1, P2, TEST): BOOLEAN 


(1) Find all subtasks shared by P1 and P2. 


(2) if there are no shared subtasks, then return 
FALSE. 
(3) For a shared subtask, move all subdags 


containing it from P1 to P2. All subdags of P2 
that do not contain the shared subtask are 
considered for allocation- to either P1 or P2 
according to a GR__2 scheme. 


(4) If the resulting FT is less than TEST, then return 
TRUE. 


(5) If there are no more shared subtasks, then return 
FALSE. Else go back to step (3). 


These three functions form the basis for the 
two local search algorithms developed. The first 
algorithm LS__1 starts with the initial guess t and 
attempts to find a better solution according to the 
following procedure. 


(1) Select the processor with the longest processing 
time and name it P1. 

(2) Find the processor with the shortest processing 
time and name it P2. 

(3) Try to exchange subdags between P1 and P2 in 
order to obtain a shorter finish time FT. 

(4) If this search (performed by EXCHANGE1) is not 
successful, pick the processor with the next 
shortest processing time, name it P2 and repeat 
step (3) for P1 and the new P2. Repeat this 
process by always selecting as P2 the processor 


with the shortest processing time among those 
that have not been tested yet. 

(5) If such an exchange of subdags leads to a 
shorter finish time, a new solution has been 
found and an attempt to further improve this 
oo is made by repeating steps (1) through 
5). 

(6) If all processors have 
improving the _ solution, 
been reached. 


been 
a. local 


tested without 
optimum has 
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At this point the algorithm repeats steps (1) 
through (6) above in an attempt to improve the local 
optimal solution by combining the  subdags 
containing shared subtasks. The search is now 
performed by calls to the COMBINE subroutine 
described above and is carried out if (a) this is the 


first pass or (b) a _ previous search using 
EXCHANGE1 was_ successful in improving the 
solution. If the new search using COMBINE is 
successful, the algorithm attempts still another 


search using EXCHANGE1 and the entire procedure is 


repeated until no further improvement of the 
solution is possible. 
The second local search algorithm LS_ 2 


enlarges the neighborhood of a point searched for a 
better solution by using the subroutine EXCHANGE2 
instead of EXCHANGE. LS_2 is likely to obtain a 
better solution at the expense of longer run-times. 


COMPARISON OF ALGORITHMS 


The greedy and tocal search algorithms 
presented in the previous section were evaluated 
and compared. Problem instances’ (i.e. dags 


representing tasks) were randomly generated [8]. 
For a given problem instance, the error ERR(A) of 
algorithm A_ was defined as: 


ERR(A) = [SOL(A) - EXACT] / EXACT 


where SOL(A) 
algorithm A 


is the solution of an 
and EXACT is the exact solution 
obtained by the branch and _ bound algorithm. 
Computation times necessary for computing the 
exact solution were tolerable for small instances. 
Thus, average results obtained via the approximate 
algorithms could be compared to the exact solution 
for small randomly generated instances. 


instance by 


Greedy Algorithms 


Table 3 presents the average error, maximum 
error and average computation time for the 
algorithms GR__1, GR_.2. and GR__3. The results 
presented are for 50 randomly generated instances 
of 10 subtasks and varying number of processors (M 
= 2, 3 and 4). Table 4 presents again the average 
and maximum errors as well as the average 
computation time for the greedy and the branch-and- 
bound algorithms. A fixed number of processors (M 
= 2) was used for these runs while the number of 
subtasks varied (N = 10, 20 and 30). 


and maximum errors decrease 
the GR__3 algorithms 


The average 
as we go from the GR__1 to 
while the computation time increases. This small 
increase in computation time is expected because 
the greedy algorithms become more compiex as we 
go from GR__1 to GR_3. 


TABLE 3 


Results for Algorithms GR_1, GR_2 and GR_3 


AVERAGE ERROR 


MAXIMUM. ERROR 


AVERAGE COMPUTATION 
TIME (msec) 


Branch-and-Bound 
GR__1 
GR__2 
GR__3 


TABLE 4 


Results for Algorithms GR_1, GR_2 and GR_3 


AVERAGE ERROR 


MAXIMUM ERROR 


AVERAGE COMPUTATION 
TIME (msec) 


Branch-and-Bound 
GR__1 
GR__2 
GR_3 
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Local Search Algorithms 


Randomly generated starting solutions for the 
local search algorithms were tested first. In 
addition, solutions generated from the greedy 
algorithms were used to start the local search. 


Table 5 presents the results for one randomly 
generated instance and 20 randomly = generated 
Starting solutions. Two processors were considered 
and the number of subtasks varied (N = 10, 20 and 
30 ) for these experiments. 


The results show that when the optimal 
solutions were not obtained, the maximum errors 
were very small. As expected, LS_2 has a higher 
probability to provide a better solution, but it 
requires considerably more computation time. In 
some instances, LS_2 may run even more slowly 
than the branch and bound algorithm. 


TABLE 5 


Results for Algorithms LS_1 and LS_2 


Number of optimal 
solutions found 
Minimum error 
Maximum error 
Average run-time 


LS__ 2 

Number of optimal 
solutions found 
Minimum error 
Maximum error 
Average run-time 


Branch-and-Bound 
Optimal Solution 
Run-time 


Table 6 shows the results obtained with 
LS._.1 and LS_2 for 50 randomly — generated 
instances and one randomly generated starting 
solution. Two processors and N = 10, 20 and 30 
subtasks were considered. 


The starting solution was also generated by 
using one of the greedy approximate algorithms 
GR__1, GR__2 or GR__3. A local search algorithm, 
LS_.1 or LS_2 , was then applied to improve the 


Starting solution. The average and maximum errors 
for the results obtained by these combinations are 
presented in Tables 7 through 9 together with the 
average computation times. At the same time the 
results for these combinations are compared to 
those obtained from the greedy algorithms alone. 


The presented results show that by applying local 
search to a_ solution obtained by the _ greedy 
approach algorithms a _ significant reduction in the 
average error can be achieved. A fixed number of 
processors (M = 2) and a variable number of 
subtasks (N = 
runs. 


10, 20 and 30) were used for these 


TABLE 6 


Results for Algorithms LS_1 and LS 2 


AVERAGE ERROR 


LS__ 0.0016 0.0005 0.0021 
LS__ ; 0.0002 0.0000 0.0012 


MAXIMUM ERROR 


LS__ 0.0246 0.0036 0.0179 
LS__ , 0.0052 0.0000 0.0199 


AVERAGE RUN-TIME 


LS__ 14.6 47.0 153.4 
LS__ 2 43.4 363.6 1047.2 


MAXIMUM RUN-TIME 


LS__1 30.0 80.0 270.0 
LS_2 120.0 620.0 2700.0 


TABLE 7 


Average Error for GR_1, GR_2, GR_3 and 
for LS_1 and LS_2 with Starting Solution 
Obtained from GR_1, GR_2 and GR_3 


iar (GR_ 1) 
LS__2 (GR__1) 


GR_2 
LS__1 (GR__2) 
LS__2 (GR_2) 


GR_3 
LS__1 (GR__3) 
LS__2 (GR__3) 
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Table 9 shows that the time required for the 
branch and bound method shows a large standard 
deviation. Finally, a comparison of Table 6 to 
Tables 7 through 9 shows that the choice of the 
Starting solution does not greatly affect the results 
of the local search algorithms. The use of the 
solution obtained from a greedy algorithm as an 
initial guess for the local search does not lead to 
significant or even consistent improvement over the 
case of randomly generated initial guess. 


TABLE 8 


Maximum Error for GR_1, GR_2, GR_3 and 
for LS_1 and LS_2 with Starting Solution 
Obtained from GR_1, GR_2 and GR_3 


oy (GR__1) 
LS__2 (GR_1) 


GR__2 
LS__1 (GR__2) 
LS__2 (GR__2) 


GR__3 


LS__1 (GR__3) 
LS__2 (GR_3) 


TABLE 9 


Average Run-Time for GR_1, GR_2, GR_3 and 
for LS_1 and LS_2 with Starting Solution 
Obtained from GR_1, GR_2 and GR_3 


ALGORITHM 


GR__1 
LS__1 (GR__1) 
LS__2 (GR__1) 


GR__2 
LS__1 (GR__2) 
LS__2 (GR__2) 


GR_3 
LS__1 (GR__3) 
LS__2 (GR__3) 


Branch-and-Bound 


Avg. Run-Time 
Max. Run-Time 


CONCLUSIONS 


An optimal branch and bound algorithm was 
developed first for solving the task allocation 
problem. Its computation time in the worst Case, 
however, increases faster than 


o ( MNs 7 mi) 
as shown in [8]. Approximate algorithms were then 
developed in order to obtain solutions close to the 
optimal one in polynomial time. The greedy 
algorithms require 


O(M NZ) 


time and can reach a solution with 
is the best among 


computational 
average error below 5%. GR_3 
them. 


search algorithms can reach a 
solution with average error below 0.5% at the 
expense of longer computational times. Algorithm 
LS__2 needs on the average much more computation 
time than LS__1, sometimes even more than the 
branch and _ bound algorithm. Taking into account 
the computation time needed for LS_2, LS__1 is 
obviously a much better choice. 


The local 
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ABSTRACT -- The main aim of this paper ts to study 
allocation of processors to parallel programs executing on a 
multiprocessor system, and the resulting speedups. We first 
consider a general program represented as a sequence of 
steps consisting of parallel operations, then one represented 
as a task graph whose nodes are do across loops and whose 
edges represent precedence constraints, and finally a single 
do across loop. General bounds on program speedup are dis- 
cussed and measurements of code parallelism for the LIN- 
PACK numerical package are presented. 
1. INTRODUCTION 


Recently there has been an almost unanimous agree- 
ment in the computing community that the near future 
supercomputers will be based on the shared memory mul- 
tiprocessor architecture. But there are divided opinions when 
the question comes to the number of processors needed for an 
efficient and cost effective, yet very fast multiprocessor. A 
number of pessimistic and optimistic reports have come out 
on this topic. One side uses Amdahl’s law and "intuition" to 
argue against large systems [13], [1]. The other side cites 
simulations and real examples to support the belief that 
highly parallel systems with large numbers of processors are 
practical, and could be efficiently utilized to give substantial 
speedups [5], [8], [11], [3], [7]. This question will probably be 
answered in a definite way only by experience, after many 
multiprocessor supercomputers with varying numbers of pro- 
cessors will have been built. 


Presently the computing community has adopted a con- 
servative approach to this issue, and that is reflected in the 
first multiprocessors (e.g., CRAY-X-MP, CRAY-2, Alliant 
FX/8) that have appeared on the market; each of them 
comes with a small number of processors. Of course, this is 
due mostly to our inexperience in using efficiently a large 
number of processors. Only recently the applications com- 
munity has been systematically involved in research on 
parallel algorithms, languages and software. 


The investment in traditional (serial) software is so 
enormous that it will be many years before parallel software 
dominates the market. It is then natural to ask: ‘‘ How can 
we efficiently run existing software on multiprocessor sys- 
tems?’’ The answer to this question is well-known: by using 
powerful restructuring compilers. One such powerful restruc- 
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turer is the Parafrase compiler developed over the last fifteen 
years at the University of Illinois ([10], [17]). Besides restruc- 
turing Fortran programs into a form suitable for execution 
on a range of high performance machines, Parafrase also sup- 
plies various statistics useful for performance analysis and 
system design decisions. Such statistics include program 
speedup, efficiency, number of useful processors, 
serial/parallel timing estimates, etc. 


In this paper we start with a parallel program that is 
the result of restructuring a serial program for execution on 
a multiprocessor. We discuss different types of parallelism 
that can be observed in such a program, the kinds of over- 
head involved in parallel execution on multiprocessor sys- 
tems, and the effect of task size and scheduling overhead on 
the speedup. We then address the problem of allocating 
available processors to different parts of the program and 
estimating the possible speedup. Program Task Graphs have 
been chosen to provide concrete representations of parallel 
programs. These are directed graphs where a _ node 
represents a do across loop and an edge represents a pre- 
cedence constraint. (A do across loop is a loop whose itera- 
tions may be partially overlapped [5].) 


We have restructured LINPACK using Parafrase and 
computed the fraction of parallel code for all subroutines. 
Our results contradict the original Amdahl conjecture that 
most programs have at least 10% serial code, and hence can 
achieve a maximum speedup of at most 10. The experiments 
strongly support the view that there is enough inherent 
parallelism in real programs so that large numbers of proces- 
sors can be efficiently utilized. 


In Section 2, we present the basic concepts and 
definitions that are used throughout the paper. Section 3 
discusses program restructuring, partitioning, critical task 
size, and gives a brief outline of the Parafrase compiler used 
for our measurements. Section 4 presents general bounds on 
program speedup and the measurements of parallelism in 
LINPACK subroutines. Section 5 presents three models of 
program execution on multiprocessor machines and a heuris- 
tic algorithm for allocating processors to parallel programs. 
In Section 6, we consider analytically our general paraliel 
loop model and derive the speedup formula for multiproces- 
sors. Finally, Section 7 gives some conclusions. 


2. BASIC CONCEPTS 


A basic sequential machine is a single CPU computer 
that can carry out operations serially, taking one unit of 
time for each. A p-unit Multiple Execution Scalar (MES) 


machine is composed of p_ identical basic sequential 
machines, and each processor is driven by its own control 
unit. Because of its flexibility, we will consider only the MES 
machine in this paper. It is variously referred to as a mul- 
tiprocessor, a parallel machine with p-processors, or simply a 
p-processor machine. 


An assignment statement is a statement of the form 


x = FE, where 2 is a variable and EF an expression. A’ do 
across or DOACR loop [5] with delay d has the form 
L: DOACR I = 1,N 
B 
END 


where d, N are integer constants, J an integer variable with 
the range {1, 2,.....N }, B a sequence of assignment state- 
ments and DOACR loops, and it is understood that the 
iterations of L can be partially overlapped as long as there is 
a delay of at least d units of time from the start of iteration 
t to the start of iteration « + 1, (2 1, 2,.....N — 1). 
For L, the index variable is J, the number of iterations is N, 
and the loop-body is B. 


Consider now the two extreme cases of overlapping. If 
d = 0, there is complete overlapping, i.e. all the iterations 
of LZ can be executed simultaneously. In this case the do 
across loop is called a do all loop, and we write DOALL in 
place of DOACR. If d = 6, where 6 is the execution time 
(assumed to be independent of J) of the loop-body B, then 
there is no overlapping, i.e. the iterations of Z must be exe- 
cuted serially, one after another. In this case the do across 
loop is a standard serial loop, and we write DOSER for 
DOACR. A BAS or a block of assignment statements is a 
special kind of serial loop, namely a loop with a single itera- 
tion. (A BAS may also be regarded as a special case of a do 
all loop.) 


A program is a sequence of steps where each step con- 
sists of one or more operations that can be executed simul- 
taneously. A program is sertal if each step has exactly one 
operation; otherwise it is parallel. Two programs are 
semantically equivalent if they always generate the same 
output on the same input. Parallel programs are con- 
veniently represented in terms of do across loops (Section 3). 


Let PROG,, PROG, be two equivalent programs and 
let their execution times on a p-processor machine be 
T,(PROG,) and T,(PROG,) respectively. Then the 
speedup obtained (on this machine) by executing PROG, 
instead of PROG, is denoted by S,(PROG,, PROG,) and is 
defined by ‘ 
vie (PROG,) 
S,(PROG,, PROG,) = ——————. 
T, (PROG,) 


An immediate consequence of this definition is the following 
lemma. 
Lemma 2.1. If PROG,, PROG,, ..., PROG, is a sequence 
of programs any two of which are equivalent, then 

n-1 


S,(PROG,, PROG,) = J] 5,(PROG,, PROG,,,). 


t=1 


We usually write S, for the speedup when the two pro- 
grams involved are understood. Of special interest to us is 
the case where PROG, is a serial program and PROG, is an 
equivalent parallel program obtained by _ restructuring 
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PROG,. We assume that the execution time of a program is 
determined solely by the time taken to perform its opera- 
tions, and that the total number of operations in a program 
is never affected by any restructuring. These assumptions 
are not very far from the truth; they help to keep the formu- 
las simple, and yet let us derive important conclusions. If T, 
is the number of operations in the serial program PROG,, 
then T, is also the execution time of PROG, on the basic 
sequential machine, or on any parallel machine (i.e., 
T,= f, (PROG,)). The equivalent parallel program 
PROG, also has T, operations, but now these operations are 
arranged in fewer than 7, steps. We call 7, the sertal eze- 
cution time of PROG, and it can be obtained simply by 
counting the operations in PROG,. The execution time 
T, = T,(PROG,) of PROG, on the p-processor machine 
will depend on the structure of the program, the magnitude 
of p, and the way the p processors are allocated to different 
parts of PROG,. To distinguish it from T,, T, ; is referred to 
as the parallel execution time of PROG,. The speedup of a 
program is then the ratio of its serial execution time to its 
parallel execution time. 


For a given program, we have the unlimited processor 
case when p is large enough so that we can always allocate 
as Many processors as we please. Otherwise, we have the 
limited processor case. These two cases will be often dis- 
cussed separately. 

3. RESTRUCTURING, PROGRAM PARTITIONING 
AND CRITICAL TASK SIZE 


In our program model we assume that parallelism is 
explicitly specified in the form of tasks (disjoint code seg- 
ments) which are parallel loops (DOALL or DOACR). This 
can be done for any Fortran program written in a serial form 
by employing restructuring compilers. For this paper we used 
Parafrase, a restructuring compiler developed at the Univer- 
sity of Illinois, to transform programs into parallel form and 
compute some experimental values presented in the following 
section. Although a detailed presentation of the capabilities 
of Parafrase is beyond the scope of this paper, we briefly 
describe its basic functions for the sake of completeness. 


Parafrase transforms serial Fortran programs into a 
parallel form suitable for efficient execution on a variety of 
parallel architectures including MES. The compiler consists 
of a front-end which applies a series of machine independent 
transformations, and a back-end which is a set of machine 
dependent optimizations. For each compiled Fortran pro- 
gram, Parafrase builds a graph called the data dependence 
graph whose nodes represent program statements and edges 
represent data and control dependencies among statements 
[2]. Each branch of an IF or GOTO statement is assigned a 
branching probability by the user, or automatically by 
Parafrase [11]. We can therefore view any program as a 
sequence of assignment statements, where each statement 
has an accumulated weight associated with it. All loops in a 
program are automatically normalized, i.e., loop indexes 
assume values in [1, N] for some integer N. As in the case of 
branching statements, unknown loop upper bounds are either 
defined by the user, or automatically by the compiler (using 
a default value). During parallel execution of the restruc- 
tured program, data and control dependencies must be 
observed to assure that program semantics is preserved. The 
data dependence graph of the program is used by most 
transformations as a guide for this reason. 


One of the most important transformations in 
Parafrase is the recognition and marking of DOACR loops 
(including DOALL and DOSER loops as special cases of 
DOACR). If we consider a block of assignment statements 
(BAS) as a serial loop with a single iteration, a restructured 
program can be viewed as a series of outermost DOACR 
loops with each such loop being arbitrarily complex. This 
defines a "natural" partition of a restructured program into a 
series of code segments or tasks. Dependencies may exist 
between any pair of segments in the program. We can thus 
define the Program Task Graph as a directed graph 
G(V, E), where the nodes in V are the outermost loops L; 
in the program, and there is an edge from a node L; to a 
node L; iff loop L, depends on loop L;. Since backward 
dependencies are not allowed, G(V, EF) is acyclic. 


In a restructured program we may observe two types of 
parallelism: horizontal and vertical. Horizontal parallelism 
results by executing a DOACR loop on two or more proces- 
sors, or equivalently, by simultaneously executing different 
iterations of the same loop. Vertical parallelism in turn, is 
the result of the simultaneous execution of two or more 
different loops (tasks). Two or more loops can execute 
simultaneously only if there exists no control or data depen- 
dencies between any two of the loops. In the general case 
the program task graph exposes both types of parallelism. 


‘When we execute such a task graph on an MES machine, we | 


must decide how to allocate the available processors to the 
program tasks so that the program speedup is maximized. 


A serious problem arises when while executing a res- 
tructured program on an MES machine, we attempt to 
minimize the overheads of communication, synchronization 
and scheduling. This is a non-trivial optimization problem, 
and attempts to minimize such overheads usually results in 
reducing the degree of program parallelism. Most instances 
of this optimization problem have been proven to be NP- 
Complete [6]. Most heuristic algorithms attempt to minimize 
the communication cost by merging nodes of the graph 
together to avoid the overhead involved in communicating 
data from one processor to another. This however often 
reduces the degree of available vertical parallelism. 


As an example of node merging, consider two loops L, 
and L, in our restructured program model, with data depen- 
dencies going from L, to L,. The dependencies restrict the 
two loops to execute in this order since data computed in L, 
are used by L.. In this case only horizontal parallelism inside 
each loop can be exploited. If we do not coordinate the exe- 
cution of L, and L,, then data computed inside L, will have 
to be stored in a shared memory upon completion of L,, and 
then fetched from that memory to the processors executing 
L,. If on the other hand we consider the two loops as a sin- 
gle task, then we can bind iterations of L, and corresponding 
iterations of L, to specific processors. In this manner data 
computed by a particular iteration of ZL, and used by the 
corresponding iteration of L, need only be stored in fast 
registers of the processor, thus avoiding the overhead of the 
redundant store and fetch. For relatively small loops the 
savings by such "task merging” can be very significant. 


Task merging can also be used to decrease scheduling 
overhead that is involved when we distribute different pro- 
gram nodes across different processors. This scheduling over- 
head is in addition to the synchronization overhead and may 
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become disastrous especially for very small tasks. For the 
CRAY-XM-P for example, the overhead involved with 
scheduling two parallel tasks can be several msecs. This 
overhead imposes a minimum size on parallel tasks, below 
which the speedup becomes rather a slowdown (i.e., 
S, < 1). We call this the critical task size.. 

If during the execution of a program we schedule a set 
of parallel tasks, the parallel execution time is augmented by 
O,, where O, is the scheduling overhead. The maximum 
expected speedup therefore is given by 


rT, 


T,/p + Of 

In order to have a speedup of at least 1, we must have 
T, > T,/p + Op, ie, T, > p*Oz / p-1 which gives 
the critical task size as a function of the overhead and the 
number of processors. More generaly, the minimum program 


S 


p 


size T,,,, required to obtain a given speedup S on p proces- 
sors should satisfy: 
1 ares * Pp #0 p45 
re ee ee VOR a, ; 
Pnin / P + Op p-s 


Program partitioning for minimizing data communica- 
tion and scheduling overheads is a complicated optimization 
problem [15], [16]. 

4. GENERAL BOUNDS ON SPEEDUP 


In this section we consider an arbitrary parallel pro- 
gram, and think of it simply as a sequence of steps where 
each step consists of a set of operations that can execute in 
parallel. The total number of operations (and hence the 
serial execution time) is denoted by T,. Let py denote the 
maximum number of operations in any step 


Suppose first we are using a p-processor system with 
p = Po (This is the unlimited processor case). Let $;T, 
denote the number of operations that belong to steps con- 
taining exactly ¢ operations, (¢ = 1, 2,..., po). Then ¢; is 
the fraction of the program that can utilize exactly | : proces- 
sors, and we have })¢,; = 1. 


DA 
3=1] 


parallel part of the program or the fraction of parallel code, 
and 1— f = @, the serial part of the program or the frac- 
tion of serial code. (At least p — py processors will always 
remain unused.) 


We call f = 


Consider now a limited processor situation with a p- 
processor machine where p < py. The steps with more than 
p operations have to be folded over and replaced with a 
larger number of steps with p operations. (For simplicity, 
we are assuming that each new step has exactly p opera- 
tions, although one of them may actually have fewer than 


p). Let f; =f;(p) denote the fraction of the modified 
program that can utilize exactly 7 processors, 
(¢ = 1, 2,..., p). Then we have 
Po 
f,= 9, (¢ = 1,2,..,p — 1), and = )o4;. 
i=p 


As long as p > 2, the parallel part f is given by }) f,; and 


4=2 
the serial part 1 — f by fy. 


An arbitrary p-processor machine is assumed in the fol- 
lowing. The first two results are well-known [3], [12]. 


De cae 


Theorem 4.1. 
P i=. ? 

Proof: When executing on a p-processor machine, the frac- 

tion of the program that uses exactly 7 processors is 

{,T, (¢ = 1, 2,..., p). Hence, the number of steps where 


t processors are active is . The total number of steps 


is then given by 


Since 7’, is the serial and T, the parallel execution time of 
the program, we get 


P : 
1 p i 
Sy qT; i=l ' 
Corollary 4.2. 1<25,<p 
fi; fj 
Proof: We have f; > — > — (i =1,2,...,p) 
i p 
Hence 
p Df ae 
ye Op Se 
i=1 i i=1 P 
p 
i 1 
of, be See 
i= ' P 
1 1 
so that 1 > —— > —, 
5, p 
le. 1 < 5, < p.0 
Corollary 4.3. 5S, S a oe 


Corollary 4.4. The speedup Sp: the number of processors 
p and the fraction of parallel code f satisfy (for p > 1) 


p 
oo) a 4.1 
/ f+ (1 - f)p 1) 
ae | 
fj a= + (4.2) 
Sy p-—-1 
£S, 
and 2p (4.3) 
1- (1 - f)S, 


Proof: These three inequalities are equivalent; from any one 
the other two can be derived easily. Note that 


f 


P f; 
fi,+ OY am=1 -f + ’ 
Pp 


Pes; Pf; 
yo Sy 
a i=2 2 


2 


i=l imo P 


p 
since f = ))f,; and f, 
+=2 
>1-f + ©, oo that 5, 47. 
p f +(l1—- f)p 
Now, we have a program that can use a maximum 
number of py processors. If the fraction ¢, of serial code in 


1 — f. Then by Theorem 4.1, 


a 


1 
ea O 
Sy 
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[Sub. Name __f __| Sub. Name __f _| Sub.Name__f__| 


0.9997 - 0.9862 0.9257 
0.9988 0.9853 0.9189 
0.9975 0.9807 0.9164 
0.9974 0.9806 0.8561 
0.9961 0.9773 0.8314 
0.9961 0.9773 
0.9961 0.9767 
0.9954 0.9762 
0.9950 0.9753 
0.9905 0.9753 
0.9900 0.9751 

0.9746 

0.9746 

0.9745 

0.9664 

0.9629 

0.9350 


Table 1. Values of f for LINPACK subroutines. 


SQRSL3 
SGEFA 


it is very small, we can choose p (> 1) processors such that 
0) : 


i, = ¥>¢; ©1. Then, since 


t=p 

1 ae f; ds 

—_ = }}—+ 77 we get 

Sy i= ' P 
{ | | p-l | p-l 
(ee ee ee ae oe Ha ee 
| fm eee se ae Sen Pa 
15, Po i ee i=l 


or equivalently, S, © p ee ~p. Thus, if for some 
Po 


p > 1, }}¢,; * 1, then the program runs very efficiently 
i=p 

on a p-processor system giving an almost linear speedup. In 

this case, given the coefficients ¢, for the particular program, 

we can always determine the maximum number of processors 

that would get a linear speedup. 


Based on Corollary 4.3, Amdahl and some other 
researchers thereafter questioned the usefulness of very large 
MES systems, since, according to their argument, the major- 
ity of programs have an average of more than 10% serial 
code and therefore their speedup on any MES machine is 
bounded above by 10. 


We conducted some experiments to measure the frac- 
tion of parallel code f in LINPACK, a widely used numeri- 
cal package for solving systems of linear equations. Knowing 
the serial execution time T,, the parallel execution time T. 
and the number of processors p that were used during the 
execution of a subroutine, we can easily compute a lower 
bound for f from (4.2). All the above parameters are sup- 
plied by Parafrase. On the other hand, if the value of f for 
a particular subroutine is known and we want to achieve a 
specific speedup S for this subroutine, then (4.3) gives us a 
lower bound on the number of processors that we must use. 


The sorted (minimum) values of f are shown in Table 
1. The measurements were done on LINPACK subroutines 
after they had been restructured by Parafrase. From Table 
1 we observe that the majority of subroutines have a very 
high fraction of parallel code. For the first 37 subroutines 
(out of 49), the average fraction of parallel code was 
f > 0.9784. Almost 76% of the subroutines have 
f > 0.9 and only 18% have f < 058. 


Considering that LINPACK is a typical numerical 
package not very amenable to restructuring, the results of 
Table 1 are very encouraging. EISPACK for example 
(another numerical package), should be expected to have a 
much higher value of f than LINPACK [11]. Since most 
numerical packages are more amenable to restructuring than 
LINPACK, Amdahl’s pessimism should not be taken very 
seriously when designing large multiprocessor systems. His 
claim for the non-effectiveness of systems with large numbers 
of processors is mostly based on programs that exhibit 
f < 0.9. As mentioned in Section 3, the real performance 
threat for large MES systems lies in scheduling and interpro- 
cessor communication overheads. 


Secondly, we should consider all possible operating 
modes of a multiprocessor. There is no question that there 
exist numerical programs that could fully exploit hundreds or 
thousands of processors. For programs that utilize only a few 
processors, MES systems can be operated in a multiprogram- 
ming mode to keep system utilization high. The question 
then breaks down to whether we can have sites with enough 
users (workload) to keep system utilization at acceptable lev- 
els. The answer to this question is rather obvious and it is 
more of a managerial and planning issue. 


PROGRAM 
TASK GRAPH 


LAYERED TASK GRAPH 


Figure 1: A example of a program task graph 
and its corresponding layered graph. 


5. SPEEDUP AND PROCESSOR 
ALLOCATION FOR TASK GRAPHS 


We consider here an arbitrary parallel program 
represented by a task graph G = G(V, E). Recall from Sec- 
tion 3 that this graph is defined on a restructured program 
with nodes representing outermost DOACR loops, and edges 
representing data and control dependencies among loops. 
Let there be n nodes in V: V4, Uo, ..., v,. These nodes can 
be partitioned into disjoint layers Vin Voi taeas Vey such 
that (1) all nodes in a given layer can execute in parallel, and 
(2) the nodes in a layer V,,, can start executing as soon as 
all the nodes’ in layer V, have finished, 
(2 1, 2,...,k — 1). To construct this layered graph of 
G, we use a modified Breadth First Search scheme for label- 
ing the nodes of the graph. Initially, the first node of the 
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graph (corresponding to the lexically first loop of the pro- 
gram), is labeled 1 and queued in a FIFO queue Q. At each 
following step v,, the node at the front of Q is removed, and 
if 2 is its label, all nodes adjacent to v, are labeled « + 1, 
and are queued in @. Note that a node may be relabeled 
several times but its final label is the largest assigned to it. 
When Q becomes empty, the labeling process terminates and 
we get the layered graph by grouping all nodes with label z 
into layer V,. An example of a program task graph and its 
corresponding layered task graph is shown in Figure 1. We 
consider below three execution models of a layered task 
graph on a p-processor MES machine. The most general and 
the two extreme cases are discussed. 


As usual 7, denotes the total number of operations in 
the whole program. For the node v;, let g, T, denote the 
number of operations in the node and T a its parallel execu- 

n 


tion time, (j 1, 2,..., 2). Then 9; = 1. The abso- 


j=l 
lute speedup s of v,; is the speedup obtained by considering 
the node separately as a program and is given by 
S45 = 9;T, / T,;- 


Case 1. (Horizontal parallelism). Let k n and each 
layer V; consist of a single node v,, (¢ = 1, 2,...,). To 
get the maximum speedup for the whole program on p pro- 
cessors, we need to get the maximum speedup for each node. 
Detailed formulas are given below. 


, R 
The relative speedup Soi of v, is the speedup of the 


whole program when only v, is executed in parallel and all 
other nodes are executed serially. Thus 


ae 
T,-9;T,+ Ti 


Lemma 5.1. The absolute and relative speedups of a node 
v; are connected by the equation 
i . 1 + 
Sia ese a 9; 
A R 
Proof: It follows directly from the above definitions. 0 


Theorem 5.2. The speedup Ss, of the whole program (when 
all nodes are executed in parallel) is related to the absolute 
and relative speedups of the individual nodes by the follow- 
ing equations: 


1 9; ae | 
ee eg ae 
5, int 5, im 55; 


Corollary 5.3. If all nm nodes give the same absolute 


speedup Ss , then Sp —— Ss: If all ~ nodes give the same 
relative speedup 5, , then 
n n 
s < P < 
np—-p+l n— 1° 


Proof: The first assertion follows immediately from the 


above theorem, since })g,; = 1. For the second, we see 


i=1 
that when the relative speedups are all equal 


1 "4 
— = Pe -ntil=—-n4el 
R R 
Since S, < p, this implies 
no 1 
ape ee es ON 
5, p 
n n n 
sf < L = < o 
np — + 1 1 n—l 
n—-1+ 7 
P 


Each node of the graph can be an arbitrarily complex nested 
loop containing DOSER, DOALL, and DOACR loops. The 
problem of optimal processor allocation to such nodes has 
been solved optimally in [14]. 


Case 2. (Vertical parallelism). Let the task graph be (lat, 
i.e. let there be a single layer V consisting of n nodes. This 
is the case when no dependencies exist between any pair of 
nodes. Here we may exploit vertical parallelism by executing 
all program nodes simultaneously. We consider the extreme 
case where each node requests exactly one processor. Since 
each node is allocated one processor, if n < p the execu- 
tion time is dominated by the largest task. In the general 
case bin-packing can be used to evenly distribute the n 
nodes into p bins. The Multifit heuristic algorithm, for 
example, can be used in this case [4]. Then, if b,,1<i <p 
denotes the largest bin, we have the following theorem. 
Theorem 5.4. The total speedup resulting from the parallel 
execution of an n-node flat graph on p processors, where 
each node is allocated one processor, is given by 


5 1 
p i 
s 9; 
v,€ b. 


Proof: The proof follows directly from the definition of 
speedup and the assumptions stated above. 0 
Corollary 5.5. If n < p then 


1 
= 
p 
max(91, Jo) --+)In) 
Proof: This follows from the previous theorem since each 
bin contains one node. O 


Case 3. (Horizontal and vertical parallelism). In the most 
general case we have a program that exposes both types of 
parallelism, horizontal and vertical. In other words, the task 
graph consists of k (> 1) disjoint layers V,, V,, eye Ve 
with at least one layer containing two or more nodes (Figure 
1). If {V, | is the cardinality of the i-th layer, we assume 
that 1V,|[ <p, (3 1,2,.., 4). (In the case of 
1V; | > p we fold and fuse nodes such that |V,| < p 
[15].) Our aim in this case is tu exploit horizontal and verti- 
cal parallelism in the best possible way. Maximizing speedup 
is equivalent to minimizing parallel execution time. For each 
node of the task graph v; we define r;, to be the maximum 
number of processors that the node could use. When 
r, = 1, (j = 1, 2,...,) our problem is reduced to the 
classical multiprocessor scheduling problem, which has been 
proved NP-Complete [6]. Our general problem can be 


reduced to the latter one by decomposing each node v; into 
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r; independent sub-nodes of equal size. This trivially proves 
that our problem is also NP-Complete. Heuristic solutions 
are therefore the only acceptable approach to solving the 


problem suboptimally in polynomial time. 


Below we present a simple linear-time heuristic algo- 
rithm for allocating processors to general task graphs. We 
call this Proportional Allocation heuristic since it allocates 
to each node a number of processors which is proportional to 
the size of the node (this term was borrowed from [16] where 
a similar idea was independently proposed). The idea behind 
proportional allocation is to allocate processors to the task 
graph on a layer-by-layer basis, so that the load in each 
layer is evenly distributed across the available processors, 
resulting in a suboptimal execution time. Variations of the 
proportional allocation heuristic and experimental results are 
given elsewhere [15]. 


For each layer V,, (¢ = 1, 2,...,k) of G we carry 
out the following steps. (The notation z « a used below 
indicates the assignment of an expression a to variable 2.) 


Step 1. Each node vu, € V, is allocated one processor. If 
'V.| — q,, then the number of remaining processors is 
Pr p — q;. The tasks in V, are arranged in order of 
decreasing size. 


Step 2. The remaining pp, processors are allocated to the 
nodes of V, with r, > 1 so that each node receives a 


i j 
number of processors proportional to its size. For a node v; 
in V, with r, > I, the serial execution time is 
t; = 9,7, Lett; = (9; T,) denote the total execu- 
v,€ V. 
tion time of all nodes v; EV, with r, > i. Then, for all 
such nodes perform: 
bj 
P; = PRr* (5.1) 
7; 
p; «- min(r; — 1, p;) (5.2) 
Pr * Pr — P; (5.3) 


where p, is the number of processors allocated to node U;. 
Steps (5.1), (5.2), and (5.3) are repeated until all processors 
are allocated (pp 0), or all nodes in V, are processed. It 
should be noted that if at the end pp > O, then 
ppt+il= r,, (7 = 1, 2,..., q,). A formal description 
of the proportional allocation heuristic is given in Figure 5. 
A simple example of the application of this algorithm to a 
single layer with DOALL loops, is shown in Figures 2 and 3. 
The number of processors allocated to each loop by our 
algorithm is shown in the table of Figure 2. Figure 3(a) 
shows the processor/time diagram when loops are executed 
one by one on an unlimited (in this case) number of proces- 
sors, with a total execution time of 22 units. Figure 3(b) 
shows the processor/time diagram for the allocation per- 
formed by our algorithm. Processors were allocated so that 
both horizontal and vertical parallelism are utilized; 16 units 
is the total execution time in this case. The total program 
speedup on p processors that results from the application of 
the above heuristic is given by the following theorem. 


Theorem 5.6. The total program speedup that results from 


Seine Se 


PROCESSORS 
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 


DOALL 1/1 = 1,7 
}1 
1 END 


DOALL 2 /2 = 1, 14 
8 
2 END 
DOALL 3 /3 = 1,5 
}5 
3 END 
DOALL 4 [4 = 1, 20 
}4 


4 END 


DOALL 5 [5 = 1, 24 


}6 PROCESSORS 
5 END 


| Number of processors allocated to each loop | | Number of processors allocated to each loop | to each loop 


nia 


Figure 2. A simple program with DOALLs and 
the processor allocation profile. 
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the parallel (vertical and horizontal) execution of a k-layer 
task graph on p processors is given by 


93 
i h \, = — 
: : where : oe oA (5.4) 
1; 
i=l] 
(where oA ; is the absolute speedup of node v; for the number 


of processors (p;) allocated to it by the proportional alloca- 
tion heuristic.) 


Note that Theorems 5.2 and 5.4 are special cases of 
Theorem 5.6. If the graph is reduced to a flat graph, then 
k=1 and Corollary 5.5 holds. On the other hand, if each 
layer contains one node (linearized) then g; = 1 in (5.4), 
(i = 1,2, ...., k), and thus Theorem 5.2 holds true. 


6. SPEEDUP AND PROCESSOR 
ALLOCATION FOR DOACR LOOPS 


In this section we focus on a single node of the task 

flow graph representing the given program, 1.e. a DOACR 
loop. We extended and generalized the do across model, to 
allow for idling processor time caused by do across delays. 
In [5] it is assumed that no processor may become idle unless 
it completes all iterations assigned to it. This is true only in 
certain special cases. As we shall see later by means of an 
example, each processor may have to idle between successive 
iterations. The following theorem generalizes the do across 
model and accounts for idle processor time. 
Theorem 6.1. Consider a DOACR loop with N iterations 
and delay d, and let b denote the execution time of the 
loop-body. Then p processors can be allocated to the itera- 
tions of the loop in such a way that the speedup is given by 
S, = Nb/T,, where 


T, = ((IN/rl - 1) max(b, pd) 


+ d((N — 1)mod p) + 6} (6.1) 


N in their 
p in any order. The 


Proof: Let us number the iterations 1, 2.,..., 
natural order and the processors I, 2,..., 


2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 


10 
T 
12 


Figure 3. (a): The processor/time diagram for the program of Figure 2 (Case 1). 
(b): The processor/time diagram after the application of the algorithm (Case 3). 
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DOACR I ~ 1,8 
}10 


{a=} 
END 
Procl 


Proc2 Proc3 


~~, 
_ 


WB F< - 4 


, 
bo 


Figure 4. An example of the application 
of Theorem 6.1 for p = 3. 


processors are allocated to the iterations as follows. Assume 
first that N > p. Iteration 1 goes to processor 1, iteration 
2 goes to processor 2, ..., iteration p goes to processor p. 
Then iteration (p + 1) goes to processor 1, etc., and this 
scheme is repeated as many times as necessary until all the 
iterations are taken care of. We can now think of the itera- 
tions arranged in a [N/p]X p matrix, where the columns 
represent processors. If N < p, we employ the same 
scheme, but now we end up with a 1 X N matrix instead. 
Let ¢, denote the starting time of iteration 
k, (k = 1,2,..., N), and assume that ¢, 0. (We meas- 
ure time when iteration 1 starts executing). Let us find an 
expression for ty. First, assume that N > p. For any 
iteration 7 in the first row, the time ¢, is easily found: 


t; = (j - Ud (Gj = 1,2... 


 P) 


Now iteration (p + 1) must wait until its processor (i.e., 
processor 1) has finished executing iteration 1, and d units of 
time have elapsed since ts the starting time of iteration p 
on processor p. Hence 


t +1 


= max (6, ft, 


+ d) max (b, pd). 


The process is now clear. If we move right horizontally (in 


the matrix of iterations), each step amounts to a time delay 
of d units. And if we move down vertically each step adds a 
delay of max(b, pd) units. Thus the time ¢, for an iteration 
that lies on row 7 and column j will be given by 


ft, = (¢ — 1) max(b, pd) + (j — 1)d. 
For the last iteration, we have 7 [N /p] and 
— if N mod p > 0 


p otherwise 


_—_— 
Since y — 1 can be written as(N — 1) mod p, we get 
ty (tv /o] — 1 | max (b, pd) + d((N — 1) mod p). 


Now let N < p. It is easily seen that 
ty =(N —- 1)d 


(tpl — 1) max (6, pd) + a((N - 1) mod p). 


Finally, since the parallel execution time T ; is given by 
T. >= ty + b, the proof of the theorem is complete. 0 
Figure 4 shows an example of the application of Theorem 
6.1. The DOACR loop of Figure 4 has 8 iterations, a delay 
d = 4, and a loop-body size of 10. The total parallel exe- 
cution time on p 3 processors is 38 units, as predicted 
by Theorem 6.1. We can maximize the speedup of an arbi- 
trarily nested DOACR loop which executes on p processors 
by using an optimal processor allocation algorithm described 
in [14]. 


Corollary 6.2. Consider a sequence of m perfectly nested 
DOALL loops numbered 1, 2,..., m from the outermost loop 
to the innermost. Let S, denote the speedup of the con- 


struct on a p,-processor machine when only the tth loop exe- 

cutes in parallel and all other loops _ serially, 

(2 2,...,m). Then the speedup S, on a p-processor 
m 


machine, where p 


[I p; and p; processors are allocated 
t=] 


to the ith loop, is given by S, 


7. CONCLUSIONS 


We have studied the problem of allocating processors to 
parallel programs executing on a multiprocessor system, and 
measuring the speedups. Our experiments with the software 
packages LINPACK show that there is a high degree of 
parallelism available in ordinary programs, and that after 
proper restructuring such programs would be able to utilize 
fairly large multiprocessors systems. As a convenient unit of 
parallel program we have taken a DOACR loop which gen- 
eralizes serial and DOALL loops. The speedup formula for 
DOACR loops has been generalized, and several other results 
have been derived that increase our understanding of how 
the speedup of a program depends on various factors. The 
general problem of processor allocation to programs 
represented as task graphs of DOACR loops being NP- 
Complete, we have given heuristic methods that seem to 
work well in practical situations and in linear time. 
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SOFTWARE BASED MUTUAL EXCLUSION IN A MULTIPROCESSOR 


Bernard T. 
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Abstract -- This paper is directed to the 
solution of a mutual exclusion problem involving a 
multiprocessor system. The multiprocessor 
contains six asynchronous microprocessors 
operating in a tightly coupled network. The 
system software has a master/slave organization. 
One processor is system executive or master while 
the remaining processors function as slaves and 
receive their tasks from the master. The task, or 
process, of master was previously performed on a 
full time basis by only one of the processors in 
the system. In the interests of efficiency and 
fault tolerance, the task of master can now float 
from processor to processor whenever needed. This 
paper presents a fast, software based method that 
allows only one processor at a time to assume the 
master task. 


Introduction 


A multiprocesssor system exists called the 
MAPP (Modular Adaptive Parallel Processor). It 
was built by Goodyear Aerospace for AFWAL at 
Wright Patterson AFB. The MAPP contains six 
asynchronous microprocessors operating in a 
tightly coupled network. The system software has 
a master/slave organization. One processor is 
system executive or master while the remaining 
processors function as slaves and receive their 
tasks from the master. The task, or process, of 
master was previously performed on a full time 
basis by only one of the processors in the system. 
It was desired, in the interests of efficiency and 
fault tolerance, that the task of master be 
allowed to float from processor to processor 
whenever needed. 

The primary problem associated with floating 
the master task was how to prevent two or more 
processors from acquiring the master task 
simultaneously. In accomplishing this, the 
following goals also had to be met: 

1.) No system hardware modifications were allowed, 

2.) The selected method had to be compatible with 
the current system software, 

3.) The selected method had to be fault tolerance, 

4.) The selected method had to be fast, 

5.) Memory requirements had to be kept to a 

minimum, 

6.) Future expansion of the system had to be 
allowed. 


Overview of the MAPP System 


The MAPP is a homogeneous multiprocessor that 
has been primarily used for real-time signal 
correlation. The MAPP system architecture is 
shown in Figure 1. Each of the six system 
processors contains a general purpose 32-bit CPU. 
Each processor is capable of executing between 2 
and 3 MIPS. Associated with each processor is a 
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private local memory. This memory can be used for 
algorithm storage and data Storage. The 


processors function asynchronously, running on 
independent internal clocks. 
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Figure 1 -— MAPP System Architecture 


Contained within the system are three global 
memories. The processors can each access these 
general purpose memories through a crossbar type 
port structure. Processor contention for access 
to global memory could cause a speed disparity 
among the processors if accesses are serviced in a 
completely random fashion. For the MAPP, speed 
disparity will not occur due to the discipline 
imposed by the global memory controller. The 
global memory controller prevents one processor 
from monopolizing global memory at any time. In 
order to access global memory, each processor must 
interface with the global memory controller. This 
controller contains a priority encoder that takes 
a "snapshot" of all the processor”’s requests for 
access to global memory. If a processor fails to 
have its request honored in one snapshot it is 
guaranteed to have its request serviced in the 
next. There is no way for a processor to 
circumvent the action of the global memory 
controller. 

At least 500 times a second, the MAPP system 
causes each processor to perform a self test. If 
a processor fails its own self test or if a 
processor detects a failure in another processor, 
a system-wide interactive test is performed. If a 
faulty processor is identified and confirmed 
during the system-wide test, it will be disabhed 
automatically and the system will continue 
operation without it. This fault checking 
software, in addition to looking for actual 


hardware errors in each processor, requires that 
each processor respond within a certain period of 
time. This timed response checks a processors 
speed of operation and causes a processor to be 
flagged as. faulty if it is operating below the 
speed of the other processors in the system. 
self testing insures that each processor is 
operating at nearly the same speed. 


This 


The Mutual Exclusion Method 


The MAPP has several attributes that are 
useful when trying to float the master task. 
These attributes are both hardware and software 
based. First, each processor in the system is 
identical and runs at the same clock speed. The 
clocks are asynchronous, however. Second, the 
system periodically runs fault checking software 
that insures that each processor is operating 
normally. Third, in the MAPP system, the global 
memory controller insures that a processor cannot 
be arbitrarily delayed because of competition for 
memory accesses. 

The devised mutual exclusion algorithm called 
Software Based Mutual Exclusion (SBME) uses a 
technique that can be categorized under the 
heading of busy waiting. Busy waiting is a 
general operation that requires a processor to 
loop several times while checking to see if 
another processor has acquired the master task. 
The SBME method uses a single location in global 
memory called MasterID as an indicator flag for 
the master task. If the master task is being 
performed by a processor, that processor’s ID 
number will be present at location MasterID. If 
no one is currently master, MasterID will contain 
a value of zero. The SBME algorithm is shown in 
figure 2 for the i” th processor of the system. 

To acquire the master task, a processor reads 
location MasterID in global memory. If it is 
zero, the processor writes its ID number to 
MasterID immediately. Allowance is made for for 


the fact that another processor contending for the 


master task can write over the ID number in 
MasterID. The devised method requires that a 
processor loop and read MasterID several times 
until the danger of its ID being overwritten has 
passed. If the processor discovers a value in 
location MasterID different from its ID number, it 
drops out of contention. Any of the other 
processors that read MasterID after the first ID 
is written will avoid writing to it. 

After a processor finishes the master task it 
clears MasterID in global memory. This allows 
another processor to assume the master task as 
soon as required. If a processor is ever stopped 
while operating outside the master task another 
processor would not be prevented from reserving 
the master task for itself. 

The number of MasterID reads required for 
master task acquisition was determined, based on 
the number of contending processors. The variable 
m, in the SBME algorithm, is used to define the 
amount of time that a processor must wait before 
checking MasterID to see if it is master. The 
variable n determines the number of times that a 
processor must check MasterID before declaring 
itself as master. 
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processor i: 


loop 
counter = 0 
read MasterID ; Uninterruptable 
if MasterID > 0 exit ; Section | 
write ID to MasterID ; of Code 

loop 
wait m cycles ; See if anyone 
read MasterID ; is trying to 
if MasterID ne. ID exit ; be Master 


increment counter 
until counter = n; 


out: perform master task 


Allow another 
to be Master 


write O to MasterID 


we we 


exit: do something else 


endloop 


Figure 2 — SBME Algorithm 


A complex relationship exists between m and n 
to insure mutual exclusion. The relationship 
involves the memory speed of the global memory 
being accessed and the maximum number of 
processors that are represented in a global memory 
controller snapshot. A complete explanation of 
this relationship is beyond the scope of this 
short paper. Some discrete relationships between 
m and n are shown in Figure 3. The curves are 
shown for various delays of m and also for the 
total number of processors in the system. As m or 
the amount of delay between reads increases, the 
total number of reads, n, required to insure 
mutual exclusion decreases. 

A trade off occurs when deciding upon the 
number of checks vs. delay periods that should be 
performed. The greater the number of checks, i.e. 
the more frequent the checking, the sooner a 
processor discovers that it is not master. 
However, more frequent checking also decreases the 
available bandwidth of global memory. Longer 
delays between checks would reduce memory 
bandwidth requirements but would tie up a 
processor longer when it was unsuccessful in 
acquiring the master task. 


Conclusion 


Classical and contemporary mutual exclusion 
methods were reviewed as alternatives to the 
adopted method. Most were found to be unsuitable 
in that they imposed additional hardware 
requirements or are incompatible with the existing 
memory controller. One contending method, 
Dijkstra”’s multiprocessor version of Dekker’s 
method, was suitable, but required longer 
acquisition time for present and expanded versions 
of the system. 

The SBME method requires approximately 8.5 
microseconds from the time a processor initially 


I 
3-I * m = 0.4 microseconds 
# OF I # CPU’s = 7 or more 
DELAY 2-1 * 
INTERVALS I 
OF m 1-I * 
I 
O-I * 
I---I---JI-~--I---I 
1 2 3 4 
# OF CHECKS - n 
I 
3-1 * m = 0.75 microseconds 
# OF I # CPU~s = 7 or more 
DELAY 2-1 * 
INTERVALS I 
OF m lel * 
I 
O-I * 
T--~[---[---I---1I 
1 2 3 4 
# OF CHECKS - n 
I 
3-1 * m = 1.2 microseconds 
# OF I # CPU“s = 7 or more 
DELAY 2-1 * 
INTERVALS I 
OF m 1-1 * 
I 
O-I * 
[enn] ---[--—I---] 
1 2 3 4 


# OF CHECKS - n 


Figure 3 -— Worst Case Operation (7 or more 
Processors) 


checks to see if anyone is master until the time 
it acquires the master task itself. As an 
alternate, the method patterned after Dijkstra 
took almost 49 microseconds. The primary reason 
for the time difference is that Dijkstra”s method 
for a best case required 15 accesses to global 
memory via read/write commands. A worse case for 
Dijkstra’s method could almost double the number 
of memory accesses required. The SBME method for 
a worse case requires only 6 read/write 
instructions. This difference is significant on 
the MAPP system, where read/write commands to 
global memory take twice as much time to execute 
as any other instruction. 

The time required to acquire the master task 
in the MAPP system can be carefully controlled 
because of several factors: 

1) The processors execute instructions at similar 
rates, 
2) Each processor executes the same SBME code, 
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3) Instruction execution time is frequently 
tested, 
4) The critical section of the SBME code, ie; 
read MasterID 
if MasterID > O return 
write ID to MasterID 
is kept to a minimum and cannot be interrupted 
because all interrupts are disabled here. 


Factors 3 and 4 as listed above require 
additional attention. The periodic system tests 
serve two purposes, they insure that the 
processors operating speeds are within limits and 
that if a processor fails as master that it can be 
disabled and a new processor can become master. 
When executing the mutual exclusion algorithm, a 
processor must not be interrupted between when it 
checks if someone is master and when it writes to 
MasterID. If a delay occurs here such as might 
result from a processor servicing an interrupt, 
one processor could assume the master task when it 
was not really available. An easy solution to 
this problem is to temporarily disable the 
interrupts while the critical section of code is 
being executed. If done properly, any interrupts 
that might have occurred during this time will 
still be pending and can then be serviced. 

The SBME method always produces a "winner" 
whenever two or more processors vie for control of 
the master task. A processor’s value may be over 
written by another processor during the first part 
of the resolving routine. However, a valid ID 
number always remains in MasterID during 
conflicts. This ID belongs to the processor that 
ultimately acquires the task. 
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Abstract _ 


In this paper we present a parallel execution scheme for 
exploiting AND-parallelism of logic programs. This scheme fol- 
lows the generator-consumer approach of the AND/OR process 


model. Using problem-specific knowledge, literals of a clause are 


linearly ordered at compile time. This ordering and run-time bind- 
ing conditions are then used to dynamically select generators and 
consumers for different variables at run time. The scheme can 
exploit all the AND-parallelism available in the framework of 
generator-consumer approach. Compared with other execution 


schemes based on the same approach, our scheme is simpler, and — 


potentially more efficient. 
1. Introduction 


Exploiting AND-parallelism in the execution of logic pro- 
grams is difficult when literals of a conjunct share variables. One 
possible solution is to evaluate every subgoal of a conjunctive goal 
in parallel, and then perform join operation on the solutions of all 
the subgoals. This approach would uncover all the AND- 
parallelism in a logic program. But, for most practical problems, 
this approach requires excessive computation and is suitable only 
if unlimited resources are available. Therefore, various approaches 
for exploiting AND-parallelism in a more limited form have been 
proposed [2] [5] [7] [10] [11]. In the generator-consumer approach 
introduced by Conery and Kibler in the context of AND/OR pro- 
cess model [4] [5S], if more than one literal in a clause body share a 


variable V, then one of these literals is designated as the generator . 
of V and the remaining literals are designated as the consumers of . 


V. The execution of any consumer of V is not started until the exe- 
cution of the generator of V has finished. Conery and Kibler use 
an ordering algorithm to identify the generator of each uninstan- 
tiated variable. Each time a non-ground binding is generated, the 
ordering algorithm is applied again to identify a new set of genera- 

tors. Note that the run-time overhead of setting up the AND- 

parallelism in their approach could be very high, as the ordering 

algorithm may have to be repeatedly executed while interpreting a 

logic program. In order to avoid the problem of excess run-time 

overhead, several schemes have been presented [1] [6]. PLM of 

Aquarius project at UC Berkeley, proposed by Chang, et. al. [1], 

makes use of static data-dependency analysis. The execution 

graphs and backtracking paths are all derived at compile time, 

based on the worst case activation of each predicate. The restricted 

AND-parallelism (RAP), proposed by Degroot [6], creates one 

execution graph expression for each clause at compile time. It then 
dynamically generates parallel execution sequences based on these 

execution graph expressions. However, since PLM makes use of 
compile-time worst case analysis and Degroot’s RAP utilizes lim- 

ited execution graph expressions, these two schemes, unlike 
Conery and Kibler’s scheme, can not always explore AND-. 
parallelism to the fullest extent allowable in the generator- 
consumer approach. This paper presents an execution scheme to 
achieve the same degree of AND-parallelism as it is exploited in 
Conery and Kibler’s execution model, but with less run-time over- 
head. 


This work was supported by Army Research Office grant #DAAG29-84-K-0060 
to the Artificial Intelligence Laboratory at the University of Texas at Austin. ~ 
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2. Execution Scheme 


In this execution scheme, we are dealing with pure Horn 
Clause logic programs. On top of that, we allow variable annota- 
tions. User may also suggest a partial order on the execution of 
literals in a clause. Unlike Concurrent Prolog [11], the variable 
annotations for shared variables are not absolutely necessary; i.e., 
it is possible that a variable is shared by many literals in a conjunct 
and none of its appearance is annotated. The suggested partial ord- 
ering is also for execution efficiency only. But the ordering would 
not affect the semantics of the logic program. 


Similar to the AND/OR process model proposed by Conery 
and Kibler, our scheme also includes three algorithms: ordering, 
forward execution, and backward execution. In Conery and 
Kibler’s approach, the ordering algorithm creates the data- 
dependency graph for a clause when the head of the clause unifies 
with a goal literal. The graph is used to select generators and con- 
sumers for different variables until some literal in the body pro- 
duces a non-ground binding for a shared variable, at which time a 
new data-dependency graph has to be generated. In our scheme, 
the ordering algorithm is run at compile time. Using problem- 
specific knowledge (such as variable annotations, programmer- 
suggested ordering, etc.), it merely specifies a linear ordering for 
literals of each clause [9]. Our forward execution algorithm uses 
this linear ordering and run-time binding conditions to choose gen- 
erators for variables shared by more than one literal, and the data- 
dependency graph is generated implicitly as the clause is executed. 
A simple but intelligent backward execution algorithm has also 
been developed for deciding which literals to re-solve in case some 
literal fails [8] [9]. Due to limited space, we focus only on the for- 
ward execution algorithm in this paper. 


2.1. Forward Execution Algorithm 


Suppose that the literals in the body of a clause have been 
ordered, and P; represents the i-th literal in the linear sequence. In 
addition, P 9 stands for the head literal. During the execution, there 
exists a token for each variable in the binding environment. Con- 
ceptually, if a literal holds the token of an uninstantiated variable 
V, it simply means that the literal is currently designated as the 
generator of V. 


When a clause is called and the clause has non-null body, an 
AND process, corresponding to the head literal Po, is created. 
Tokens for different variables in the clause (including free vari- 
ables in the clause body) are given to Po.! After Po has success- 
fully unified the head literal with the given goal, it creates several 
descendant OR processes, one for each literal in the body, and then 
executes Procedure So to pass tokens to literals in the body. A 
literal P; becomes executable when it receives tokens for all of its 
variables in the current binding environment. If P; shares variables 
with any literal appearing later in the linear sequence, then after its 
execution, P; also invokes Procedure S 9 to pass tokens. 


1 For brevity, we will often use "P;" to mean “the process corresponding to 
| rae 


Procedure S 9 


P; does one of the following for each variable V for which it has 
produced binding T. 


Variable V is bound to a ground term T. 
V is substituted by T in all the literals. No token has to be passed 


i) 


ii) Variable V is bound to a non-ground term T and T is depen- 
dent (i.e., there is another variable V1 in P; which is also bound to 
a non-ground term T1 where T and T1 share variables*). 

V and V/ are substituted by TJ and T/ in all the literals. For each 
new variable in T which is not shared, a new token is created and 
passed to P,, where m= min{k| P; had V and k >i}. Similarly, for 
each new variable in T] which is not shared, a new token is created 
and passed to Pn where n= min{k| P, had VJ and k >i}. Tokens 
for new variables shared by T and TJ are created and passed to P; 
where j= min{k| P, had V or VJ and k ><}. 


Variable V is bound to a non-ground term T and T is indepen- 
dent (i.e., any variable in T is not shared by any other non-ground 
bindings produced for other variables of P;). | 

V is substituted by T in all the literals. The token for V. is then 
replaced by several new tokens, one for each new variable in T. 
These tokens are passed to P; where j= min{k| P, had V and 
k >t}. 

Note that bindings of different variables can become dependent only if 
these variables are in the same literal. Therefore, the dependence check- 
ing for bindings can be done by each literal independently. 


iil) 


Depending on various binding conditions, different parallel 
execution sequence can be explored by this algorithm. An exam- 
ple is shown in Figure 1, where an arc with label X means the 
token of variable X has been sent along the direction of arc. Furth- 
ermore, a literal which has a dotted variable V means that the 
literal holds the token of V. Given the same linear sequence of the 
literals (Figure 1a), this example shows two possible sequences of 
parallel execution. In the first sequence, the clause is called with 
bindings X/X, Y/Y, Z/Z. Token for X is passed to literal p2, token 
for Y is passed to p1, and token for Z is passed to p3. The binding 
of a free variable (such as W in this example) is always non-ground 
and independent in the beginning, therefore token for W is passed 
to pl. At this time both literals p1 and p3 have received tokens for 
all of their variables in the current binding environment, and can be 
executed in parallel (Figure 1b). Suppose that pl generates bind- 
ings yo/Y, wo/W, and p3 generates binding zo/Z. Since all new 
bindings are ground, no token has to be passed around. Both 
literals p4 and pS are executable, as there are no variables in their 
current binding environments. Literal p2 is also executable, since 
it has received token for X from pO previously. Therefore p2, p4, 
and p5 can all be solved in parallel (Figure 1c). The data- 
dependency graph corresponding to the first execution sequence is 
shown in Figure 1d. 


In the second sequence, the clause is called with bindings 
V/X, V/Y, and zo/Z. Since the bindings of variables X and Y are 
dependent (they share the same variable V), the tokens for X and Y 
are replaced by a single token for V, and the new token is passed to 
pl (the first literal in the linear sequence which had either X or Y). 
The token for free variable W is passed to pl again. This time 
literals p1 and p3 can be executed in parallel (Figure le). After the 
execution of p1, binding S/W and vo/V is generated, and p2 and p5 


2 For simplicity, we are only considering the case in which two variables are 
bound to dependent terms. Extension to the case in which more than two vari- 
ables are bound to dependent terms is obvious. 
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can start executing in parallel (Figure 1f). After p2 generates bind- 
ing so/S, p4 becomes executable (Figure 1g). The data- 
dependency graph corresponding to the second execution sequence 
is Shown in Figure 1h. 


3. Comparison With Other Approaches 


3.1. Compile-time algorithm 


At compile time, our scheme only applies the ordering algo- 
rithm once. Given problem-specific information represented as 
partial orders between literals, it is straightforward to apply the 
ordering algorithm to come up with a linear sequence of literals for 
each clause. Note that good linear sequences of literals are 
beneficial to other schemes as well. Conceptually, we are not 
doing extra work. 


In PLM proposed by Chang, et. al. [1], the bindings of vari- 
ables in a clause are classified into three types, where NGI means 
the binding is non-ground and independent, NGD means the bind- 
ing is non-ground and dependent, and G means the binding is 
ground. Among these three types, NGD is worse than NGI, and 
NGI is worse than G. Programmers are asked to declare initial 
activation mode of each predicate, where the activation mode of a 
predicate P is a possible binding condition of variables in P when 
P is activated. Using an iterative procedure, the compiler then 
derives the worst case activation of each predicate.> The data- 
dependency graph of each clause is then generated based on the 
worst Case activation of its head literal. In order to provide reason- 
able initial activation mode of each predicate, programmers must 
be very much aware of what kinds of binding conditions can hap- 
pen at run time. Moreover, the iterative procedure of doing data- 
dependency analysis tends to be time-consuming. 


Degroot’s approach [6] is even more complicated. The com- 
piler is responsible for transferring each clause into an execution 
graph expression. However, in order to achieve all the AND- 
parallelism allowable in the generator-consumer approach, the exe- 
cution graph expressions would have to be very complex and may 
have substantial run-time overhead due to redundant checkings. 
On the other hand, using simpler expressions may reduce the 
degree of AND-parallelism. Constructing a compiler which can 
produce reasonable execution graph expressions appears to be a 
nontrivial task. 


3.2. Forward execution algorithm 


Unlike other approaches, our scheme does not try to find if 
two literals are independent to be solved in parallel. Instead, our 
scheme just lets each literal decide whether it is executable. The 
parallelism comes out automatically when there are more than one 
literal ready to be executed. 


The same clause 

pO(X, Y,Z) :- pl(Y,W), p2(X,W), p3(Z), p4(W?,Z), p5(Y). 
used as an example for the forward execution algorithm will be 
used again in the following to show advantages of our forward exe- 


cution algorithm. We assume that the worst activation mode of pO 
is 


p0(NGD,NGD,NGD) 


and the derived activation modes and exit modes* (based on the 


3 The worst case activation, as defined in [1], refers to the binding condition 
where each variable has the worst type of its possible bindings. 

4 The derived exit mode of a predicate P is the binding condition of variables 
in P after the execution of P , based on a given activation mode. The derived ac- _ 
tivation mode of P is the activation mode derived from the activation mode of the 
head literal and the derived exit modes of previously solved literals. 


worst activation mode of p0) of p1 to pS are 


activation mode exit mode 
p1(NGI,NGI) p1(G,NGI) 
p2(NGI,NGI) p2(NGI,G) 
p3(NGI) p3(G) 
p4(G,G) p4(G,G) 
p5(G) p5(G) 


Note that if the activation mode of p3 is p3(G), its exit mode is 
also p3(G). The following discussion is based on our assumption 
of these activation modes and exit modes. 


Suppose at run time when p0 is called, the bindings of vari- 
ables X, Y, and Z are V/X, V/Y, and z,/Z. Our forward execution 
algorithm will achieve the dynamic data-dependency graph shown 
in Figure 1h. 


PLM of Chang, et. al. uses only one static data-dependency 
graph, which is generated at compile time based on the worst case 
activation. In our scheme, we take run-time binding conditions 
into account, and hence our scheme is able to exploit parallelism in 
cases where PLM would fail to do so. Figure 2 shows the static 
data-dependency graph used by PLM. Compared with Figure 1h, 
we can see that when the activation mode of p0 is "better", PLM 
fails to exploit possible parallelism. 


Degroot’s approach can explore different data-dependency 
graphs dynamically. However, due to the predetermined execution 
sequence of substatements it may not be able to start processing a 
subgoal as soon as the subgoal is ready to be executed. Suppose 
the following execution graph expression is produced by the com- 
piler: 


(GPAR(Y,Z) 
(SEQ picy,W) 

_(IF IPAR(Z,X) 
(GPAR(W) p2(X,W) p4(W,Z)) 
(SEQ p2(X,W) p4(W,Z)))) 

(IPAR(Z,Y) p3(Z) p5(Y))) : 


If at run time p0 is activated with the bindings V/X, V/Y and z,/Z, 
then based on our assumption of activation mode and exit mode of 
each literal, the data-dependency graph of Figure 3 will be 
achieved by Degroot’s approach. As we can see, the execution of 
p3(Z) is postponed even if it is executable from the very beginning. 
The execution of p5(Y) is also postponed due to the predetermined 
execution sequence of substatements. One of the drawbacks of 
these situations is that if the executable literals happen to fail, 
Degroot’s approach will work longer in the failure branch of the 
solution tree.> Note that sometimes type checking could be redun- 
dant in Degroot’s approach. For example, if GPAR(Y,Z) is true, 
then IPAR(Z,Y) is surely true. However since IPAR(Z,Y) has to 
be there to cover the case when GPAR(Y,Z) is false, the redundant 
type checking is unavoidable. Overall it appears that the run-time 
overhead in Degroot’s approach is not much less than the overhead 
in our approach, even though Degroot’s approach exploits only 
limited amount of AND-parallelism. 


It is important to point out that, for any given clause, it is 
possible to create execution graph expressions in Degroot’s 


approach which would be able to explore all the AND-parallelism 
allowable in the generator-consumer approach. Unfortunately, 
these expressions are too complicated, and would involve too 
many redundant run-time checkings. 


Compared with Conery and Kibler’s approach, our scheme as 
well as their scheme can explore AND-parallelism to the fullest 


5 The approach of Chang, et. al. also has the same drawback. 
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extent allowable in the generator-consumer approach, as indepen- 
dent literals can always be executed in parallel. Moreover, if all 
the bindings produced by the generators are ground, then both 
schemes would achieve the same data-dependency graph for each 
clause. When generators can produce non-ground bindings, their 
scheme must execute the ordering algorithm every time a non- 
ground binding is generated, hence the data-dependency graphs 
achieved by both schemes could be different. Although repeated 
execution of the ordering algorithm may reorder the literals, it does 
not guarantee that a better data-dependency graph would be 
achieved. As we can see in Figure 4, during the execution of this 
particular example, the ordering algorithm must be executed three 
times (after pO is unified with the goal and after the execution of 
both p1 and p2) by their scheme, but the resulting data-dependency 
graph is never better. When the heuristic rules used by their 
scheme are fallible, the new graph could be even worse. Further- 
more, since the token passing scheme is simple, the run-time over- 
head of our algorithm in general is less than the overhead of 
Conery and Kibler’s AND-parallel execution scheme. 


In some cases, dynamically changing the linear sequence of 
literals could lead to better performance. However, when the 
choice of alternative linear sequences depends only on the activa- 
tion mode of the head literal, techniques such as control alterna- 
tives used in IC-Prolog [3] would allow our scheme to use different 
optimal sequence for each possible activation. The only case in 
which dynamic reordering of literals can be helpful is when the 
linear sequence must be reordered with respect to the bindings pro- 
duced by a generator in the clause body. 


Our scheme also has the potential for distributed implementa- 
tion, as the token passing scheme can be carried out by the 
corresponding process of each literal independently. In Conery 
and Kibler’s approach, AND process must be responsible for coor- 
dinating the execution of conjunctive goals, and hence could 
become the communication bottleneck. 


4. Concluding Remarks 


We have presented a scheme of achieving limited AND- 
parallelism in logic programs. Compared with Conery & Kibler’s 
approach, our scheme has less run-time overhead, and still 
achieves the same degree of AND-parallelism. Chang, et. al. [1] 
and Degroot [6] have also proposed schemes which reduce the 
run-time overhead presented in Conery and Kibler’s scheme. 
However, these schemes tend to limit AND-parallelism, and 
require more complicated compile-time analysis than our scheme. 
An additional feature of our execution scheme is that the forward 
(and backward) execution algorithm can be implemented in a dis- 
tributed manner. This is significant, as the centralized version of 
these algorithms can cause communication bottlenecks. We are in 
the process of designing message structures and communication 
schemes of the distributed version of these algorithms. 
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ABSTRACT 


In this paper, a parallel execution model based on the 
Dependency Relationship Graph(DRG) is proposed. 
Compared with another approaches, the execution order- 
ing of AND subgoal is determined by the role of the 
shared variable. The DRG is an execution ordering graph 
of AND subgoals and guarantees near maximal AND 
parallelism. The necessary information for constructing 
DRG is generated automatically in compile time and used 
in run time. By this automatic control technique, the log- 
ic 1s completely separated from the control and the origi- 
nal flexibility and the power of logic programs are main- 
tained without programmer’s burden of control. 


1. Introduction 


The logic language is a concurrent language in which 
parallelism is embedded naturally. This parallelism ori- 
ginates from the nondeterminism which is the most 
powerful characteristic of the logic language. However, it 
cannot be fully exploited under the typical von Neumann 
architecture because of the complicated control mechan- 
ism for the nondeterministic behaviour of the logic pro- 
gram. Such restriction makes the current logic languages 
less powerful and inflexible than the original language 
based on Horn clause logic. 


There are various approaches to the parallel processing 
of logic programs which support the natural parallelism 


efficiently. Especially, AND, OR parallelism have been 
the key issues. Compared with the OR parallelism, the 


AND parallelism requires some sophisticated control. 


To perform the AND parallelism easily, the mode con- 
cept and the variable annotation method have been widely 
used [1], [2], [4], [12]. Since these methods require 
programmer’s control for the program execution, the logic 
language loses one of its original purposes. In other 
words, the logic is incompletely separated from the con- 
-trol. 
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When there is no programmer’s control information, . 
some heuristic algorithms have been adopted 
[1],[2],[4],[12]. These heuristics, however, may decrease 
the AND parallelism by the eager ordering of AND 
subgoals. The problems of the heuristic algorithms will be 
described in section 2. 

In summary, these conventional methods have following 
disadvantages. 

1) Programmer’s burden of control when the mode or 
variable annotation are used. 

ii) Inflexibility of program by unidirectional use of one 
argument as input or output. 

iii) Expense of the completeness. 

iii) Decrement of the potential AND parallelism by the 
eager ordering of AND subgoals when heuristics are 
used. 

2. Some considerations of AND parallelism 


AND parallelism can be considered as a synchronization — 
strategy of the shared variables. Without the 
programmer’s control information, the conventional 
method of achieving AND parallelism is based on the 
heuristic algorithms, such as the connection rule [4]. 
Some careful considerations about the role of the shared 
variable can extract more AND parallelism than the. 
heuristic search algorithms. In convenience, we will use 


the C-Prolog syntax except the cut symbol. 
2.1 Problems and solutions of AND parallelism 


Given a clause such as | 

f(X) :- p(X),q(X)., there are two problems when p(X), 
q(X) are executed in parallel. One is the binding conflict 
which attempts to instantiate(bind) a shared variable X to 
two different values [5]. The other problem arises when 
one subgoal is a deterministic dependent of the other 
subgoal like coroutine, so the execution ordering of AND 
subgoals is required [4], [11]. 


At first, we will discuss the solutions of the binding 


conflict and propose a solution called Lazy Ordering. For 


solving the binding conflict, the join technique can be 
used. If a program has a deterministic search space, the 
join technique would exhibit the greatest parallelism as 
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well as higher efficiency [5]. In spite of assuring the 
maximal parallelism, this technique results in the explo- 
sion of the search space and known as an inefficient solu- 
tion [4]. That is, the nondeterministic characteristics of 
the logic program makes the join technique impractical. 
There have been used many heuristic search algorithms to 
avoid the explosion of the search space and to solve the 


binding conflict. In logic programs, the heuristic search — 


means the selection criteria of AND subgoals(literals). 

If a search algorithm can expand nodes that generate 
fewer descendants, then it might save itself needless work 
by traversing fewer unsuccessful branches [14]. This 


heuristic reduces the search space and solves the binding 
conflict. However, if it is adopted as a selection strategy 


in every execution of AND subgoals, it is efficient only 
when the parallel execution of them causes the search ex- 
plosion and the AND parallelism would be decreased. If 
we can find such explosion points and the heuristic is ap- 
plied only to these points, the AND parallelism is in- 
creased and the effort to use the heuristic algorithms is 
reduced. That is, the use of the heuristic algorithm in 
every execution step of AND subgoals is an eager order- 
ing which reduce the possibility of the AND parallelism. 


As another search heuristic, the selection by number of 
uninstantiated variables has been used. This heuristic as- 
sumes that AND subgoals with fewer uninstantiated vari- 
ables will generate fewer branches. Based on this algo- 
rithm, Conery suggests Connection rule which is the main 
ordering algorithm of AND subgoals [4]. Although Con- 
nection rule can find some AND parallelism, it can be 
thought as a sort of an eager ordering and the possibility 
of AND parallelism still remains. 


The shrinked AND parallelism of the two heuristics ori- 
ginates from the application of these techniques to every 
execution step(eager ordering). To extract more AND 
parallelism, we should invest our attention to find the 
point of search explosion. For example program in Fig. 
2.1(a), if all AND subgoals are executed in parallel, they 
reach such clause that are made up of only ground unit 
clause, e.g. s(X,U), t(U,Z), q(Z,Y), r(X,Z) in (b). In this 
situation, s(X,U) and r(X,Z) have an_ instantiated 
argument(X=a) but t(U,Z) and q(Z,Y) have no instantiated 
value. The explosion of search space takes place in the 
execution of t(0,Z) and q(Z,Y) and these clause are 
called the explosion points. Since these explosion points 
cause the search explosion, the further execution is de- 
layed until some ‘solutions are transferred. The principle 
of the Lazy ordering is to proceed the execution as possi- 


ble. 
The second problem between two AND subgoals arises 


977 


when one cannot be executed until another subgoal gen- 
erates a solution of shared variables. If the execution 
ordering is changed, an infinite loop or any meaningless 
operations will be occurred. If there is no programmer’s 
control information, the above two heuristics cannot order 
these AND subgoals properly, as the following procedure 
illustrates: 


append3(A,B,D,E) :- 
append(A,B,C),append(C,D,E). 
append([A|B],C,[A[D]) :- 
append(B,C,D). 
append([],A,A). 


Consider the goal 


:- append3(A,[2],D,[1,2]). 
The first AND subgoal of append3 is to be 
append(A,[2],C) and the second AND _ subgoal 


append(C,D,[1,2]). According to Conery’s algorithm, the 
left append(A,[1],C) is executed first and results in an 
infinite loop. If we previously know that the first append 
clause forms an infinite loop, the execution ordering 
should be right to left. This automatic ordering technique 
is suggested by Naish and used in Prolog environment to 
check infinite loops [11]. This technique is also useful in 
parallel environment. We will extend this technique and 
use as a solution for the second problem. 


2.2 Relationships between AND subgoals 


To extract more AND parallelism than the heuristics, the 

dependency relationship between AND subgoals should 
be redefined according to the role of the shared variables 
in each subgoal. By the definition of AND subgoal rela- 
tionship, we can classify the problems of AND parallel- 
ism and adopt the proper strategies, e.g. the Lazy order- 
ing or loop detection technique, which solve them and 
guarantee more AND parallelism. 


When any two subgoals share a variable, four cases can 
be considered. The first case arises when the instantiated 


value of the shared variable of one subgoal play a role of 
pruning the search tree of the other subgoal immediately ( 
Fig. 2.2a). The second case is similar to the first case but 
the pruning of the search tree is delayed for some time 
(Fig 2.2b). The third case arises when the shared vari- 
ables is used as a data channel from one subgoals to the 
other subgoal, that is, the relationship of the producer and 
the consumer (Fig 2.2c). In this case the consumer will 
perform meaningless operation like infinite loop without 
producer’s solutions about the shared variable. Finally, 


the system and the evaluable predicate may not be exe- 
cuted until the shared variables are sufficiently instantiat- 
ed (Fig 2.2d). According to these observation, we can 
derive the following definitions. 


Definition 1) If there is no shared variable between two 
subgoals or if shared variables are already instantiated at 
execution time, the relationship between two subgoals is 
defined as an independent relationship. Let I(p,q) be the 
relationship which represents the independent relationship 
between two subgoal p,q. 


Definition 2) If two subgoals share variables and are in 
one of the following three conditions, then the relation- 
ship between them is defined as a deterministic dependent 
relationship (tightly coupled relationship). 


i) When instantiated value of one subgoal about shared 
variables reduces the search space of the other subgoal 
immediately. 

li) When the execution of the subgoal is impossible like 
the system predicate and evaluable predicate without the 
instantiated value of shared variable. 

iii) In recursive clause, when one subgoal cannot be exe- 
cuted until the other subgoal generates some solutions of 
shared variable because an infinite loop is created by con- 
structing variable continuously or by the repeated recur- 


sive call without the status change of its arguments. 
To represent deterministic dependent relationship 


between subgoal p and q, Dx(p,q) will be used in first 
case and Sx(p,q) in second and third case, where x = {x1, 
x2,....xXn} and xi is a shared variable of p, q. 


Definition 3) If there are shared variables between two 
subgoal p, q and the relationship is not the deterministic 
dependent relationship, the relationship between two 
subgoal p, q is defined as nondeterministic dependent and 
is represented by Nx (p,q), where x = {xl, x2, ..., xn} 
and xi is a shared variable of p, q Goosely coupled rela- 
tionship). 

3. A parallel execution model based on Dependency 

Relationship Graph(DRG) 

3.1 Automatic generation of the control information 
To find the dependency relationship between AND 
subgoals without programmer’s control information, the 
control information should be generated automatically. 
From this control information we can derive the necessary 
condition for the valid execution of a clause in compile 
time and the DRG in run time. 


The control information is generated in compile time by 
the analysis of variable matching patterns. There are four 
patterns when an argument in a goal clause matches with 


_a corresponding argument in a candidate OR clause. The 


matching patterns are in table 3.1 and the abbreviated 
forms will be used. 


3.2 The necessary condition of a procedure 


In the point of operational semantics, a logic program is 
a finite set of procedures and each procedure has a set of 
OR alternative Horn clauses. To invoke a procedure, we 
can consider the following necessary condition of the pro- 
cedure. 


1) Necessary condition of a procedure 


The necessary condition of a procedure is the union of 
the sets, where each element is the necessary condition of 
a OR alternative clause. The empty set means that at 
least one argument should be instantiated to invoke the 
procedure. 

When a procedure is called, this condition play a role of 
a guard to derive a valid solution. If the condition of a 
procedure is violated, it is reasonable to delay the execu- 
tion of this procedure until some solutions are generated 
from other procedures of which the necessary conditions 
are satisfied. 


ii) Necessary condition of a clause 


The necessary condition of a clause is a set of tuples and 
each tuple contains the argument numbers of the clause. 


When a system or evaluable predicate, the necessary 


- condition is a set of single tuple and the tuple is com- 


posed of the argument numbers which should be instan- 
tiated for valid execution. 


When a self recursive clause, the necessary condition of 
the body recursive clause is determined first and the 
necessary condition of the head clause is obtained from 
those of body clauses. The recursive clause may cause an 
infinite loop because some arguments are repeatedly con- 
structed. In this case, we should discriminate the real con- 
structive argument from the pseudo constructive argu-— 
tment. The real constructive argument appears in such a 
clause that there is no generator clause for the variables 
in the body recursive clause except the head clause ,so 
the execution always makes an infinite loop without the 
instantiated value of ’mg’ argument. On the other hand, 


the pseudo constructive argument means that the infinite 
loop can be avoided because the generator for the vari- 
ables of the body recursive clause exists and the status of 
recursive call is changed in every recursion. 

When a ground unit predicate (fact), there is no neces- 
sary condition. In other cases like sort([X|U],S) in the 
following example, the necessary condition is obtained by 


the Algorithm 1. 
These necessary conditions play a role similar to that of 


the mode declaration or the variable annotation, but pro- 
vides more powerful and general methodology. The 
usage of them will be described in the section 3.3 and 
3.4. 


3.3 Static Data Dependency Graph(SDDG) 


This section defines the Static Data Dependency 
Graph(SDDG) and an algorithm to obtain the necessary 
condition of a clause. The necessary condition of a 
clause can be made only after those of body literals are 
determined. 


Definition 3.1) For a given clause A :-B1,B2,B3,..Bn. 
Static Data Dependency Graph(SDDG) is a graph over 
the literal set(={A,B1,B2, ...,Bn}) with labeled arcs by the 
set of variables in the clause. An arc(p,q) is labeled by V 
iff p <> q and V is the set of shared variables between 


P.q. 


Definition 3.2) The entry node in a SDDG is defined as 
follows ; For a given SDDG, a node Bj is an entry node 
Bi iff the dependency relationship between Bi and 
A(head) is Sv(Bi,Bj) or iff the necessary condition is an 
empty set and the arc(A,Bi) is Nv(Bi,Bj) or Dv(Bi,Bj) for 
the non empty set V. 


Given a SDDG, the entry node should have some shared 

variables with the a node, but the adjacent node to the 
head node is not always an entry node. If some input 
from the head node cannot satisfy the necessary condition 
of an adjacent node in SDDG, this node cannot be an 
entry node. 


Algorithm 1 : the necessary condition generation algo- 
rithm 


input : 

a SDDG 

the control information of body clauses 
the necessary conditions of body clauses 
output : 

N - the necessary condition of the clause 
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[step1] Determine the dependency relationship : For each 
literal of the SDDG, determine the dependency relation- 
ship with adjacent literals. Assume that U is the necessary 
condition of Bj, V is the set of shared variables between 
Bi, Bj. 

If a subset u of U is also a subset of V then the depen- 
dency relationship between Bi, Bj is Su(Bi,Bj). If a sub- 
set of v of V plays a role of the argument pruning of Bj, 
the dependency relationship is Dv(Bi,Bj). The argument 
pruning can be detected by the control information ’gm’ 
in the position of the shared variable. In other cases, the 
relationship is Nw(Bi,Bj), where w= V - (u union v). 

[step2] Make necessary condition set: 


sO) make a test set S from the necessary condition of the 
entry points of the head node A. S={t|t is a tuple} and 
the element of the tuple is the argument number of A. 


s1) Select an element t from S. Delete t from S. Search 


entry nodes of A which are satisfied by t. If S is empty, 
terminate this algorithm. 


s2) For newly determined entry nodes, search their entry 
nodes. 


s3) Repeat s2 until there is no entry nodes. 


s3) If there is no isolated node, t is inserted in the neces- 

sary condition N. The isolated node means a 
unsatisfiable node because there is no arc which satisfies 
the necessary condition of the node. 


s5) goto sl 
- Example 1 ) 
1) sort(X, Y):- 
perm(X, Y),ord(Y). 

2) perm([],[]). 

3) perm(Z,[X|Y]):- 
delete(X,Z,Z’),perm(Z’, Y). 

4) delete(X,[X]Y], Y). 

5) delete(X,[Y|Z],[Y|U]):- 

delete(X,Z,U). 

6) ord([]). 

7) ord([X]). 

8) ord([X, Y|Z]):- 

X <= Y, ord([Y|Z]). 

The necessary condition of delete(X,Z,U) is {<2>,<3>} 
and that of delete(X,[Y|Z],[Y|U]) is also {<2>,<3>}. The 
necessary condition of ord([Y|Z]) and ord([X,Y|Z]) is 
{<1>}. However, the perm(Z’,Y) in 3) has an empty set 
because there is no real constructive term. The necessary 
condition of the perm(Z,[X/Y]) and that of the sort(X,Y) 


is {<1>,<2>}, which can be obtained by the Algorithm 1. 
3.4 Dependency Relationship Graph(DRG) 


This section describes the abstract run time algorithm to 
construct the DRG of AND literals(subgoals). 


Definition 3.3) For a given SDDG of A :- B1,Bz2.,...,Bn., 
Dependency Relationship Graph(DRG) is a directed graph 
over the literal set(={A,B1,B2,...,Bn) with the labeled arcs 
by the dependency relationships between nodes. The 
dependency relationship is defined in the previous section. 


Algorithm 2 : The DRG construction Algorithm 

input : 
a SDDG 
t - the tuple in which 

the instantiated argument numbers of the head are. 

the control information of body clauses 

ni - the necessary conditions of body clauses 

N_ - the necessary condition of the head clause 

C - Node set of the SDDG 
output : a DRG 

[step 1] Test if t satisfies N, then search entry nodes of 
the head node A which are satisfied by t. Otherwise, 
select a node which receives the input by t and let it be 
an entry node. If t is an empty tuple and N is not empty 
set, select the left most literal node and let it be an entry 
node. Make directed arcs from A to each entry node Bi 
labeled with St(A,Bi). From C, delete such entry nodes 
that are satisfied by only t. 

[step 2] For newly determined entry node Bi, B2... Bn, 
search their entry nodes El, E2, ..En in C. Make 
directed arcs from Bi to the corresponding entry node Eis 
labeled with Sv(Bi,Ei), Nv(Bi,Ei), Dv(Bi,Ei), where v is 
the set of shared variable between Bi, Ei. From C, delete 
such Eis that are satisfied by only the shared variable 


with Bis. 
In Fig. 3.1, example programs and DRGs of goal state- 


ments are illustrated. 
3.5 A parallel AND/OR process model based on DRG 


In our model, a goal statement is solved via message 
communications between AND processes and OR 
processes. An AND process is created for solving a goal 
clause and generates a DRG which is the execution order- 
ing graph of body literals. The OR process is created to 
solve a body literal and the solution generated in OR pro- 
cess 1s not only sent to the parent AND process, but 
flows along DRG directly. 


There are five types of messages; start message, wait 
message, fail message, solution message and end mes- 
sage. Compared with Conery’s model, the wait message, 


the end message are added and the redo message, the 
cancel message are deleted. 


The wait message is created in following three cases; 
i) OR process --> parent AND process : When the Non- 


deterministic subgoals reach the clauses which need the 
instantiated value from the ancestor(Lazy ordering). 

ii) OR process --> parent AND process : When the vari- 
able solution from the generator causes the violation of 


the necessary condition. In this case, the reordering prob- 
lem is solved automatically by generating the wait mes-. 
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sage. 

iii) parent AND process --> child OR process : When the 
OR process waits the solution from the preceding OR 
processes, this process receives a wait message from the 


parent AND process instead of a start message. _ 
The end message is generated by OR processes when it 


cannot find any further solutions. This message is sent to 
the parent AND process and to the succeeding OR 
processes. The redo message is eliminated because there 
is no backtracking mechanism in our model. 

Fig. 3.2 shows the message transfer diagram of an AND 


process and Fig. 3.3 shows the message transfer diagram 
of an OR process. 


3.6 Analysis of the proposed model 


The proposed model is a process model based on 
Conery’s AND/OR process model. Compared with 
Conery’s model, the AND parallelism is enhanced by 
using DRG and the backtracking mechanism is removed. 
Instead of the redo message, the solution is sent to the 
parent process or to the succeeding OR process directly 
and the end message is used to indicates the end of solu- 
tions. Fig. 3.4 shows the message generation in solving 
qsort([2,1],S,[]) by Conery’s model and the proposed 
model (program in Fig. 3.1). We can obtain the follow- 
ing results. 

The total # of messages = 34 

The total # of processes = 13 


The total # of time step = 38 
In Fig. 3.4(b), 


The total # of messages = 40 

The total # of processes = 13 

The total # of time step = 23 

From these results, we can find that the parallelism of the 
proposed model increases 1.65(38/23) times because the 
qsort([1],S,[2/Y]) and the qsort([ ],Y,[ ]) are executed in 
parallel. In Conery’s model, they are executed in sequen- 
tial because of the shared variable Y. Since there is only 
one solution in solving qsort([1],S,[2[Y]) the redo mes- 
sage in Conery’s model is not diffused over all procedure, 
so the proposed model generates more messages. How-: 


ever, if there are many solutions and the backtracking 
takes place, the number of messages in the proposed 
model can be less than Conery’s model. We obtain the 
following results in solving sort({2,1],Y) which is the 
ordinary sort in Fig.3.1. 

Conery’s model 

total # of messages = 83 

total # of process = 29 

total # of time steps = 50 

the proposed model 

total # of messages = 78 

total # of process = 27 


total # of time steps = 23 
The parallelism increases 2.01(50/23) times and the 


number of generated messages are less than Conery’s 
model. The increased parallelism results from the pipe- 
lining solutions between OR processes. The parallelism 
of the proposed model will increase in proportion to the 
number of solutions. The redo messages for backtracking 
increase the total number of messages in Conery’s model. 
4. Conclusion 


In procedural oriented approach [6], the AND parallelism 

is more important than the OR parallelism because the 
OR parallelism can be achieved better by the Goal rewrit- 
ing approach. If the AND parallelism is restricted, the 
advantages of the procedural approach is decreased. 
However, the nondeterministic behaviour of the logic pro- 
grams makes it difficult to control the program and the 
conventional heuristic algorithms cannot perform the full 
AND parallelism because the explosion of the search 
space takes place. 

In this paper, we proposes a method to control the non- 
deterministic behavior of logic programs and an AND/OR 
process model based on this method. Compared with the 
other approaches, the proposed model performs near max- 
imal AND parallelism by finding the dependency relation- 
ships between AND subgoals in execution time. The 
dependency relationships between AND subgoals are 
determined by the role of the shared variable. In the pro- 
posed model, the backtracking mechanism doesn’t take 
place and the solution pipelining between OR processes is 
used. This mechanism, also, increases the parallelism of 
the proposed model. As a future study, the run time 
overhead should be reduced by compiling technique. 
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DETECTION OF AND-PARALLELISM IN LOGIC PROGRAMMING®) 
Yu-Wen Tung and Dan I. Moldovan 


Department of Electrical Engineering-Systems 
University of Southern California 
Los Angeles, California 90089-0781 


Abstract -- In this paper, we propose an ordering 
algorithm based on the automatic detection of AND- 
parallelism in logic programming at compile time. This 
algorithm consists of three phases: (1) analysis and 
detection of input-output mode, (2) automatic detection 
of data dependencies, and (3) subgoal ordering. Existing 
algorithms for the analysis of AND-parallelism proposed 
by previous researchers rely on mode declaration by the 
programmer. Based on the ordering scheme proposed in 
this paper, parallel compilers for logic programs can be 
conceived. A performance analysis of the ordering 
algorithm is presented through several examples. 


1. Introduction 


In this paper, automatic detection of parallelism in 
logic programming is studied. Definitions of technical 
terms employed in this paper can be found in 
[1,11,12,13,19]. 

There are four kinds of parallelisms in Horn clause 
logic programming identified by Conery and Kibler [4], 
Ito and Masuda [9], Pollard [16] and other researchers. 
They are (1) OR-parallelism, (2) AND-parallelism, (3) 
stream parallelism; and (4) unification parallelism. 
Among these parallelisms, OR- and AND-parallelisms 
are more important than others. OR-parallelism means 
to execute alternative clauses simultaneously, and AND- 
parallelism means to execute many subgoals in a clause 
simultaneously. OR-parallelism is much more 
straightforward then AND-parallelism because there is 
no interaction between the processing paths. On the 
other hand, when two subgoals in a clause share some 
variables, the answers obtained by executing these 
subgoals simultancously may conflict each other. Thus, 
AND-parallelism must be carried out carefully. 

For realization of AND-parallelism, various schemes 
had been proposed [2,5,6,7,15,16] and they fall into one 
of the following two principles: 


1. Whenever there is an output variable shared 
by two subgoals, these subgoals are executed 
in sequence. This is called none-shared- 
variable (NSV) scheme in [19], and most of 
the current works fall into this category, 
such as Conery’s scheme [5]. 


(a)-rhis research was supported by NSF Grant ECS-8307258, 
JSEP Contract F49620-85-C-0071. 
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2. Execute all subgoals simultaneously, and 
than “"reconciliate* possible conflicts via 
extra unification. This is called 
reconciliation (REC) scheme in [19], and the 
work of Pollard [16] is an example. 


According to the current research, the NSV scheme 
seems to outweigh the REC scheme for applications with 
strong declarative inference, such as knowledge-based 
expert systems [19]. Also, it is known that proper 
subgoal ordering is important in the NSV_ scheme 
[2,5,6,7,19]. However, most of these works rely on 
“mode declaration" to implement ordering for AND- 
parallelism [2,5,6,7]. The "mode declaration" is a 
programmer-provided control command which tells the 
machine the binding transfer direction of a variable in a 
clause. Although this approach is efficient, its machine- 
oriented nature is not desirable in a declarative language 
such as logic programming. In this paper, we propose 
the compile time ordering algorithm without 
programmer-specified "modes." 


2. Concept of Data Dependencies 


This work is based on the concept of input-output 
mode of each argument in a given subgoal and the data 
dependency relationships among subgoals, as explained 
below: 


2.1 Input and Output 


The concept of tnput and output in logic 
programming was first introduced by Kowalski [11]. It 
denotes the direction of the binding transfer during 
unification, in a way similar to the input and output 
arguments in procedure calling. Here we extend this 
concept to binding transfer through substitution and 
subgoal computation as well as unification. For 


simplicity, let Ct denote the clause head of C, and C” 
denote the clause body of C: Ct « C’. Consider the 
goal 


ay Open Cee 


where the selected atom A, is unified with some CF, 
with a most general uni fier (mgu) 0. Any variable X in 
C, which is instantiated by C0 is called an input 


variable of C.. Such an X is also called an input 


variable of atom B, BieC; . If an uninstantiated 


variable Y in on becomes instantiated only when some 
subgoal B, in Co is computed, then Y is called an output 


variable of oe and of B.. 

If a term has only input variables, it is called an 
input term. Similarly, if it has only output variables, it 
is called an output term, and a mized term or t-o term 
if it has both. An example of a mixed term is f(X,b) and 
f(a, Y) with a unifier 6 = {X/a, Y/b}. 

Although the computation of logic programs can be 
viewed as computing output arguments from given input 
arguments for each subgoal, the input-output mode of 
any given argument is not predetermined in any subgoal 
due to the fact that subgoals are "relations" rather than 
"functions." As a result, when two subgoals share the 
same variable X, they may not be executed at the same 
time if X is an output variable. This is due to a possible 
binding conflict. However, they may be executed 
simultaneously when X is an input variable to both of 
them. We say there exists some data dependency 
relation in the first case, but not in the second one. 
Below, four types of data dependencies in logic 
programming are defined, based on the concept of input 
and output. 


2.2 Data dependencies 


Consider a clause 


PS Gides ayy 


let v(p) be the set of variables that occur in p, v(q, 


) be 
the set of variables that occur in dis 1<i<k. 


Definition 2.2.1 Functional dependency: 


We say there is a functional dependency between q: 
and q, if there exists a variable x¢v(p), where x is an 
output variable of q. and an input variable of q.. The 
atom q; is called the producer of x and q; is called the 


coneumer of x. The variable x is an intermediate result 
generated by q, and consumed by qj- 


Example 2.2.1: 


The following clause expresses the calculation of 
Fibonacci-numbers: 


F(x+2,n) © F(x+1,n1), F(x,n2), Plus(n1,n2,n) 


where F(x,n) is a Fibonacci procedure that computes 
result n from a given input x, and Plus(n1,n2,n) is the 
addition procedure that adds two numbers nl and n2, 
returning the result n. Clearly, the addition of nl and 
n2 must be performed after F(x+1,n1) and F(x,n2) have 
been evaluated. Therefore, Plus(n1,n2,n) is functionally 
dependent on F(x+1,n1) and F(x,n2). 
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Definition 2.2.2 Internal dependency: 


We say there exists an internal dependency between 
q; and q, if there is a variable x€v(q;)nv(q;) and x is 


output in both q, and q;- 


Example 2.2.2: 


Assume a company is hiring people who must have 
engineering degrees plus five years or more experience 
each and with salary requirements no more than 50K a 
year each. The top level program could be written as: 


candidate(x) « degree(x), engrg(x), exp(x,5), 
salary(x,50) 


with a goal « candidate(x). Clearly, x is an output 
variable and internal dependencies exist between any 
pair of atoms in the program clause body, because they 
all share variable x. 


Next, the concept of the coupling effect is 
introduced. This effect makes two variables bearing 
different variable symbol names equal and can be 
viewed as one variable. 


Definitions of the coupling effect: 


1. Two variables are said to be coupled if they 
become identical non-ground terms after a 
unification. For example, x and y are 
coupled if p(x,y) is unified with p(h(w),h(w)) 
or p(z,z). 


. Two variables are said to be coupled if their 
corresponding terms share one or more 
common variables after a unification. For 
example, x and y are coupled if p(x,y) is 
unified with p(z,f(z,w)). 


. Two variables are said to be coupled if they 
are unified with a pair of coupled variables. 
For example, if z and w are known as 
coupled, then x and y are coupled when 
p(x,y) is unified with p(z,w). 


We can define coupled term in the same way. The 
phrase "coupled term" was introduced by Chang, 
Despain and DeGroot [2]. 


Definition 2.2.3 Coupling dependency: 


We say that there exists a coupling dependency 
between q. and qj if xev(q,), yev(q)), and x and y are 
output variables and are coupled. 


Coupling dependency is a variation of the internal 
dependency. Coupling dependency cannot be discovered 
only by checking the given clause, hence it is also called 
hidden internal dependency. 


Example 2.2.3: 
Consider the program: 


Cl: p(x) + P,(x,x) 
C2: p,(y,z) * p(y), Pa(z) 


All variables are assumed to be output variables. It is 
obvious that neither Cl nor C2 has functional or 
internal dependency. However, when Pp is called, 


p,(x,x) is unified with p,(y,z) and y, z are coupled as a 
result. Therefore Po(y) and p.(z) have a coupling 


dependency. The coupling effect in C2 is introduced by 
a higher level goal in Cl. 


Definition 2.2.4 Run-time dependency: 


We say there exists a run-time dependency between 
q, and q; if xev(q,), yev(q)); and x and y are output 
variables and are coupled at run time by a top level 
goal. 

Run-time dependency is a coupling dependency 
introduced at run time, due to a goal with coupled 
terms (eg. p(x, x)). It cannot be detected directly from 
the program, and hence, is different from coupling 
dependency. 


Example 2.2.4: 


Consider the program: 


P,(y.z) © poly), pa(z) 


where the goal is ¢ p(x, x), and x, y, z are output 
variables. In this example, Pa(y) and p,(z) seem to be 
independent but are coupled at run time by x in the 
goal * p,(x, x). 

This is just a variation of Example 2.2.3 with a 
higher level goal ¢ p,(x ,x) unspecified at compile time. 


Among the four types of data dependencies, 
internal dependency, coupling dependency and run-time 
dependency are similar to each other in the sense that 
all three are defined over output variables. They either 
denote some output variables shared by two subgoals or 
some output variables coupled with each other, and thus 
we also call them output dependencies as opposed to 
functional dependencies, which are defined over both 
input and output variables. 

With the above definitions, it is clear that 
functional dependency represents inherent sequentiality 
and implies a partial ordering of subgoals. This partial 
ordering is "necessary" in the sense of implementation 
and not the logic. On the other hand, output 
dependency leads to many possible orderings, one more 
efficient than another. Together, these dependencies 
describe the limitation of the AND-parallelism and they 
are important in detecting the available parallelism. 


Below, some techniques of detecting such limitation are 
proposed. 


3. Automatic Parallelism Detection Techniques 


In this section we study techniques of designing a 
parallel compiler which detects AND-parallelism 
automatically. This compiler has three phases: in the 
first phase, it analyzes the input-output mode of 
arguments in a given program; in the second phase, it 
finds out the data dependency relations of the subgoals 
in each clause; and in the third phase, it determines an 
efficient execution ordering based on the data 
dependency and other supplemental criteria. This is 


described in detail below. 


3.1 Automatic input-output mode analysis 


Because the "use" of a relation is not deterministic, 
the binding transfer direction of a variable in the 
relation is not always fixed. However, if some binding 
transfer direction is proven possible (or impossible) in a 
given program, the binding transfer path can be traced 
and the binding transfer directions for other variables 
can be inferred. Several heuristic rules for inferring 
whether or not a given variable is input are proposed 
below. In describing the rules, consider a variable X 
and a subgoal q in either the head or the body of a 
clause C. By notation "Xeéq" we mean "X occurs in q", 
and by "q<r>=<input>" we mean the rth position 
of q must be an input position. 


The rules are: 


Arithmetic and _ logic operations rule: 
arguments that must be instantiated as 
required by PROLOG systems should be 
input. For example, both X and Y are 
inputs in expressions "X * Y" and "X < Y." 
. Body producer-consumer rule: if X¢CT, 
XeC , and X has two or more occurrences in 


1. 


C’, then at least one position containing X 
must be input, and at least one must be 
output. 

. Head producer-consumer rule: if XeCt, 
X¢C and X has two or more occurrences in 
Ct, then at least one position containing X 
must be input. 

- Don’t care rule: if a clause D is called by C, 

and D has a don’t care symbol (*_ * in 

PROLOG), then the corresponding position 

X in C must be input. 

Variable consistency rule: if XeCt and 

XeC , and if all occurrences of X in C” are 

inputs, then at least one occurrence of X in 

Ct must be input. 


6. Position consistency rule: if predicate q in 


Ct is also in C’, then C is called a recursive 
clause, and the corresponding position of all 
q’s must be consistent, i.e., if position r in 
any such q is known as input then q<r> 
must be input. 

7. OR-bundle rule: if a set of {C,} is an OR- 


bundle -- i.e., all C* have the same predicate 
name and the same arity, and if a position in 
any Ce is an input, then the position of all 


OF must be input. 


8. Coupling rule: if variable X occurs in an 
atom q (either in C* or C’) more than once 
and both are output variables, then X is a 
coupling variable. 


The analysis of input-output modes for a given OR- 
bundle based on these rules is shown in Algorithm 1 
below: 


q Input-Output Modes Analysis Algorithm 1: 


Input : A set of clauses which is an OR-bundle. 

Output : All possible input-output modes of each 
arguments in the clauses. 

Method : 


1. Mark any known modes for variables in 
given clauses. 

2. Select any variable with its mode unknown, 
apply Rules 1 through 8 to this variable. 
Mark <i> for variables that must be input. 
If there is at least one input in a set of 
positions, state "p,<1> or... or P<r> = 
<1." 

3. When no rule is applicable, mark <o> for 
those must be output terms and mark 
<i/o> for the rest. 


<End of algorithm 1> 


Based on this algorithm, the entire logic program 
can be analyzed. We may view the logic program as a 
directed graph, each clause is a node and each node has 
pointers to the nodes it calls. There is a cycle if a clause 
calls itself -- ie. a recursive call. A root node is a 
clause that is not called by any other clause except the 
top level goal, and a terminal node is a unit clause that 
does not call any other clauses. A descendent node of 
node p is a node pointed to by p, but not p itself. 
Starting from the terminal nodes, we can analyze the 
clauses from bottom up until the root is found. This is 
shown in the next algorithm: 
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q Input-Output Modes Analysis Algorithm 2: 


Input : A logic program. 
Output : All possible input-output modes of each 
arguments in the program. 


Method : 


1. Determine the calling sequence of the logic 
program. Select a clause that does not have 
any descendent clause unchecked. 

2. Use Algorithm 1 to analyze the input-output 
modes of arguments in that clause. 

3. Go to Step 1 if there is any clause remains 
unchecked, exit otherwise. 


<End of algorithm 2> 


Example 3.1.1 "append": 


C1: append([], L, L) 
C2:  append([{x|L1], L2, [x|L3]) « append(L1, L2, L3) 
Analysis: 


The calling sequence is trivial: 
append 4 


Apply Rule 3 tod x in C2: 
append<1> or append<3> = <i> 
Apply Rule 3 to L in Cl: 
append<2> or append<3> = <i> 
Apply Rule 7: 
[case i] append<3> = <i> 
[case ii] append<1> = append<2> = <i> 
other cases append<1> = append<3> = <i> 
or append<2> = append<3> = <i> 
are subsumed in [case ij. 
therefore, append<1,2,3> = <i/o,i/o,i> or <i,i,i/o> 
and the analysis is done. 


Example 3.1.2 "matrix multiplication": 


The following matrix multiplication program 
example is taken from Conery’s thesis [5] (pp.122-123 
and p.80): 


Cl: mma, b, c) € transpose(b, bt), mmt(a, bt, c) 

C2: mmt([], _, [l) 

C3: mmt([allan], b, [c1|cn]) « mme(al, b, cl), 
mmt(an, b, cn) 

C4: mme (_, [], ll) 

C5:  mme(a, [b1|bn], [cl]en]) « ip(a, b1, cl), 


mmce(a, bn, cn) 


C6: ip({], J 0) 

C7: ip([allan], [b1|bn], c) « ip(an, bn, x), 
cisx+al*bl 

C8: transpose({[}|_], 1) 

C9: _transpose(m, [cl|cn]) « columns(m, cl, rest), 
transpose(rest, cn) 

C10: columns(|], [{], []) 

C11: columns({[c11]c1n]|c], [c11]x], {[clnfy]) 

+ columns(c, x, y). 
Analysis: 


the calling sequence is: 


mm 


‘ns 


transpose 4 mmt 4 
| | 
columns 4 mmc 4 

| 
ip 4 


we may start from either “ip" or “columns", say “ip": 
ip (C6 & C7): 
Apply Rule 1 to the last atom of C7, 
x, al and bl are input variables. 
Apply Rule 2, x in "ip" in C7 is output. 
Apply Rule 6, since 
al and bl in C7* are input variables, so 


an and bn in C7 are input variables, 
ip<1,2,3> = <i,lo> 
Similarly, we obtain the following step by step: 
mme (C4 & C5): since ip<1,2> = <i>, 
mmce<1,2,3> = <i,i,i/o> 


mmt (C2 & C3): since mmc<1,2> = <1,i>, 
mmt<1,2,3> = <i,i,i/o> 


columns (C10 & C11): 
columns<1,2,3> = <i,i/o,i/o> or <i/o,i,i> 


transpose (C8 & C9): 
transpose<1,2> =<i,i/o> 


mm (Cl): . 
mm<1,2,3> = <i,i,i/o> 
and the analysis is done. 


The above analysis shows that if "mm" is the top 
level goal, the first two positions must be input, and the 
third can be either input (verify if ¢ is the product of a 
and b) or output (obtain c as the product of a and b). 
Also, since variable "bt" is an output from "transpose" 
and an input to “mme," we know the _ producer- 
consumer pair must be <transpose, mmc>. This 
cannot be inferred by Conery’s ordering algorithm. In 
fact his algorithm will obtain a wrong ordering in this 
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example (see p.122 in [5]). 

By performing the input-output analysis we can 
actually obtain all possible usages allowed in the 
program. For example, if the "transpose" in Example 
3.1.2 is a top’ level goal, the second position of 
"transpose" need not be an output as in the case when 
"mm" is the top level goal. In this case, "transpose* 
can also be an input such that "transpose(b, bt)" means 
verify if bt is the transposition of b. This usage is 
allowed by the logic of the program. 

Although it is difficult to prove that this algorithm 
guarantees to find all terms which must be input, it is 
clear that if an input term is erroneously taken as 
output, the penalty is only a reduced parallelism. The 
algorithm is very precise when applied to many example 
programs nevertheless. 


3.2 Determination of data dependencies 


The second phase of parallelism detection is the 
determination of data dependencies. Since all data 
dependencies are defined over input and output, once 
the input-output analysis is done, the detection of data 
dependencies becomes easy. Even the run-time 
dependency may be “guessed" at the compile time, 
although there are certain limitations. 

Detection of functional dependency and internal 
dependency is trivial when the input-output modes are 
known, because both dependencies can be readily 
identified by their definitions based on these modes. If 
the mode of a term is "mixed," then both possibilities of 
input and output should be considered. 

In the case of coupling dependency, a recursive 
check of subgoal invoking is required to find the 
"source" of coupling effect. This can be done by adding 
one procedure to the input-output mode analysis 
Algorithm 1: mark <c> on each coupled term found by 
the coupling rule; also mark <c> on each term related 
to known coupled terms found by the variable 
consistency rule and the position consistency rule. 

In the case of run-time dependency, if there are 
only a few possible combinations of modes for the top 
level goal, then the compiler can figure several possible 
run-time dependencies which can be verified easily at 
run time. “ Below, we list some criteria useful in 
detecting the run-time dependency at compile time. By 
a potential output term is meant a term that is either an 
output term or a mixed term. The criteria are: 

(i). When there are none or one potential output 
term in a goal, no run-time dependency can 


exist. 
(ii). If there are two potential output terms, two 
possibilities exist -- one is that they are 


coupled at run-time and the other is that 
they are not. Both cases can be analyzed at 
compile time each leading to one possible 
ordering selected at run-time. 


Since the number of possible mode combinations 
grows exponentially with the number of potential output 
terms r, this approach is not justified when r is large. 
However, most real world programs seem to have only a 
few potential output terms in the goal clause, and the 
compile time analysis is usually sufficient. 


3.3 Subgoal ordering techniques 


In the case of functional dependency, partial 
ordering is obtained at the time the producer-consumer 
dependency is detected. But in case of output 
dependencies, there is flexibility in determining the 
ordering, and which ordering leads to the highest 
efficiency is yet to be determined. Therefore we need 
some supplemental criteria, namely groundability and 
subtree complextty, for this purpose. 

Consider an output shared variable X in a clause C 
as shown below: 


p(X) © 4, (X), ao(X), ---» 4,00) 
Because of the output dependency due to X, only one 
atom can be selected from q,(X) through q,(X) for 
processing. Assume q(X), with the substitution 0. = 


{X/t}, is the one selected. 
If #, is a ground substitution, variable X in other 


q;'s (1<j<r, ji) is changed from output to input after 
q. is executed, and the data dependencies due to X 
between these q's are removed. As a result, q's can 
then be processed in parallel. Conversely, if 9. is a non- 
ground substitution, the remaining q;'s are still data 


dependent and they cannot be processed in parallel. 
Therefore, the key to the "efficiency" is to find a q, 


such that it has a ground substitution. If no such q 
exists, a q, with @. that has more ground substitution 
components should be selected prior to another q. which 


has fewer such components. The measure of how close a 
substitution is to the ground substitution is informally 
called groundability. 

Consider the following clause as an example: 


grandparent(X,Y) « parent(X,Z), parent(Z,Y) 
with a goal + grandparent(X,peter). A left-to-right 
ordering will find all parent-child pairs first, then test 
for parent(Z,peter) for validity, rather than trying to 
find the parent of parent of Peter. Obviously, the 
better ordering is to do parent(Z,peter) first, as implied 
by its better groundability. 

In the case that more than one q, have the same 
groundability, the ordering of these q;s might still 
matter. Assume that both q,(x) and q,(x) have ground 
substitutions, q, returns the binding in k, steps, q, 
returns in k, steps and k,>>k, (ie., q, has a much 
larger computation tree than qo)- If q, is selected first, 
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we can start to process the remaining q,’s in parallel 
after k, steps. Assuming that the longest time taken by 
these q;'s is k;, then the total time is (k, + k;). 
selected first, the total time is then max(k, 
ky 


ratio 


). Suppose k; is in the same order as k,, then the time 


between q,-first computation to 4q,-first 
computation is about 2 to 1. Therefore the second 
measure regarding how soon can a given subgoal return 
the answer, called here subtree complezity, will also be 
taken into account in our ordering algorithm. Here, 
subtree means the subtree of the AND/OR computation 
tree, and complexity means the inference steps required 
to solve a subgoal. While there is no easy way to obtain 
the exact computation complexity of a subtree prior to 
its execution, the complexity may be estimated for 
program clauses that have special patterns such as the 


following: 


(a). Tail-recursion: 
p({HT], ...)  9(H, ...), p(T, -.). 
assume input list has a length n and q is 
some terminal node, then the complexity is 
O(n) when the complexity of q is less than 
that of p. 
(b). Head-tail recursion: 


p([H|T], ...) « p(H, ...), p(T, ..). 


assume input list has a length n, then the 
complexity is log n) and O(n). 


With the help of the groundability and the subtree 
complexity techniques, we can obtain an efficient partial 
ordering of subgoals in the case of output dependency. 
The subgoal ordering of the whole logic program can 
then be determined via these techniques as well as the 
result of phase 2. This is described below: 


q Ordering Algorithm: 


Input : A logic program. 
Output : Partial ordering of the subgoals in each 
clause of the program. 


Method : 
1. Perform input-output mode analysis 
algorithm for the program. 
2. Find functional, internal and coupling 
dependencies. The partial ordering of 


subgoals in functional dependency case is 
determined by producer-consumer order. 

3. For other types of dependencies, determine 
the groundability for each atom. Let sg. and 
sn, be the number of ground and non-ground 


output variables in atom q;, then: 


(a). 


for those q; whose sn, = 0, assume 


on 


oO 


. Remove all 


.If in step 3, k40, assume x,, 


there are k1 such q,’s, assign priority 
number 1 through kl to them in an 
order from larger sg. to smaller Sg., 
then 

(b). for those q, whose sn, = 1, or sn, = 2 
with these two terms coupled, assume 
there are k2 such q,’s, assign priority 


(k1+1) through (k1+k2) to them, then 
(c). for those q. whose sn, = 2, assume 


there are k3 such q,’s, k3 is small (say | 


1), assign priority (k1+k2+1) through 
(k1+k2+k3) to them for two cases: two 
variables never coupled -- ordering O s 


and when coupled at run-time -- 
ordering O,, then 


(d). for the rest of q.’s, assume there are k4 
such q,’s, we only assume a no run-time 
coupling case as default case, assign 


priority (k1+k2+k3+1) 
(k1+k2+k3+k4) to them. 


through 


If the same priority numbers result in the 
above step, perform subtree complexity 
algorithm-- less complexity receives less 
priority number (or higher priority). 

but the highest priority 
groundable output variables from atom in 
step 3. This means if a variable can be 
instantiated by some atom, the variable 
becomes input after the execution of that 
atom and will not count toward 
dependency). Put all removed atoms as well 
as atoms that have input terms only into a 
set called "last set" which is the last to be 
processed. 

The compile time partial ordering is done. 


. If in step 3, k30, say k3=1, assume x and 


y are two output variables, we attach a run- 
time test statement as below: 


(if (x couples y) then ordering O, 
else orering O,) 
-) X, are 
involved output variables, then we have: 

(if no coupling among [x,, ..., x,] 


then take the default ordering 
else complete the ordering at run-time) 


<End of algorithm> 


3.4 Performance analysis by example 


Consider a logic program that has one non-unit 


clause: 
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candidate(X) « degree(X), engrg(X), exp(X,5), 
salary(X,50) 


and many unit clauses serve as a database which 
includes 100 items of “degree(X)," 50 items of 
"engrg(X)," 10 items of “exp(X,5)" and 25 items of 
"salary(X,50)," as shown below: 


degree(John) engrg(Bob) exp(John,5) salary(John,50) 


degree(Bob) engrg(Mary) exp(Mary,3) salary(Mary,35) 


degree(Peter) engrg(Peter) exp(Peter,5) salary(Peter,50) 
<..100 items> <..50 items> <..10 items> <..25items> 


Assume the goal clause is  candidate(X). Now 
we employ the ordering technique: 


1. X is an output variable, all four subgoals 
have internal dependency with each other. 

2. There is no functional dependency. 

3. All four subgoals have sn; = 0, sg. = 1. 


4. Subgoals exp(X,5), salary(X,50), engrg(X), 
degree(X) have priority 1, 2, 3 and 4 
respectively. 

5. Subgoals _—salary(X,50), — engrg(X) 
degree(X) are put into the "last set.* 

6. Set {exp(X,5)} forms the first set, and the 
rest form the second set. 


and 


According to the ordering, the subgoal exp(X,5) will 
be selected first. Ten items are retrieved from the data 
base and are fed into the other three subgoals as their 
input binding sets. These three subgoals can be started 
as soon as the first answer from exp(X,5) is available. 
Assume that there is only one name, "Peter," that 
matches the goal, although "Bob" matches both 
"degree(X)" and “engrg(X)." For simplicity, assume all 
unifications take a time step, the above scheme takes 12 
parallel unification steps using up to 100 processing 
units in this particular case. If we use NSV scheme with 
arbitrary ordering -- starting from "degree(X)" say, it 
may take 102 parallel unification steps using the same 
amount of processing units, much worse than the 
scheme proposed in this paper. 


4. Conclusions 


We had analyzed the AND-parallelism in logic 
programming by means of input-output and data 
dependencies, and we had also devised the automatic 
input-output mode detection technique which is then 
used to derive high efficiency subgoal ordering. Because 


most of the detection works can be done at compilation 
time, the resulting parallel machine need not be very 
complicated. Each processing unit in the parallel 
machine needs to do only unification and simple 
housekeeping as the run-time tasks are already 
minimized before a program is executed. Based on this 
parallelism analysis, a simple and efficient VLSI 


architecture for parallel logic programming was 
proposed in [19]. 
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ABSTRACT 

Parallel depth-first searches are widely used to solve com- 
binatorial optimization and decision problems in artificial intelli- 
gence and operations research. These problems are represented 
by OR-trees and AND/OR-trees. The performance of parallel 
depth-first searches may be difficult to predict due to the non- 
determinism and anomalies of parallelism. In this paper we have 
derived the performance bounds of parallel depth-first searches 
with respect to optimization problems represented as OR-trees 
and have verified these bounds by simulations. These bounds 
provide the theoretical foundation to determine the number of 
processors to assure a near-linear speedup. The conditions to 
cope with parallel-to-parallel anomalies are also investigated. 
For decision problems represented by AND/OR-trees, such as 
evaluating logic programs, we have studied an ordered depth- 
first search that rearranges nodes in each level of the AND/OR 
tree to minimize the expected search cost. 


1. INTRODUCTION 

Combinatorial-search problems can be classified into two 
types. The first type is decision problems that decide whether at 
least one solution exists and satisfies a given set of con- 
straints [21]. Theorem-proving, expert systems, and evaluating 
a logic program belong to this class. The second type is combina- 
torial extremum-search or optimization problems that are 
characterized by an objective function to be minimized or max- 
imized and a set of constraints to be satisfied. Practical problems 
such as finding the shortest path, planning, finding the shortest 
tour of a traveling salesman, job-shop scheduling, packing a 
knapsack, vertex cover, and integer programming belong to this 
Class. 

The non-terminal nodes in a search tree (or graph) can be 
classified as AND-nodes and OR-nodes. An AND-node 
represents a problem (or subproblem) that is solved only if all 
its descendant nodes have been solved. while an OR-node 
represents a problem (or subproblem) that is solved only if any 
of its immediate descendants is solved. Based on these two kinds 
of nodes, a combinatorial search can be classified into an AND 
tree, OR-tree, and AND/OR-tree search [25]. Note that a general 
dataflow graph contains AND-nodes and OR-nodes that relate 
the descendant nodes, as well as other nodes that relate the 
ascendant nodes. 

In this paper we will concentrate on evaluating problems 
that arise in nondeterministic computations, namely, those prob- 
lems that are represented as OR-trees or AND/OR-trees. As an 
AND-tree represents deterministic computations and all nodes in 
it must be evaluated, it will not be discussed here [14]. Due to 
space limitation, we will only present results on the depth-first 
search strategy. Results on other strategies with respect to OR- 
trees can be found elsewhere [15, 16]. 

An OR-tree is a state-space tree in which all non-terminal 
nodes are OR-nodes, while an AND/OR-tree is a problem- 
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reduction representation that consists of AND-nodes and OR- 
nodes. Many AND/OR-tree search procedures, such as AO’, 
SSS", and dynamic programming, can be formulated as a general 
branch-and-bound (B&B) procedure [8,19]. which is a well- 
known OR-tree search method. Likewise, evaluating a logic pro- 
gram can be represented as an OR-tree or AND/OR-tree 
search [7]. 

Both combinatorial OR-tree and AND/OR-tree search pro- 
cedures can be characterized by four constituents: a branching 
rule, a selection rule, an elimination rule, and a termination con- 
dition. The first two rules are used to decompose problems into 
simpler subproblems and to appropriately order the search. The 
last two rules are used to to eliminate unnecessary subproblems. 
Appropriately ordering the search and restricting the region 
searched are key ideas behind any search algorithms. 

The rules to guide the search and to prune unnecessary 
searches may differ for optimization and decision problems. In 
optimization problems, a lower bound of the objective value for 
each nonterminal node can be used to guide the search and to 
prune nodes that cannot lead to a better solution. Dominance 
tests, such as a—B pruning, can also be adopted as elimination 
rules. In decision problems, it was found that the ratio of the 
success probability of a subproblem to the estimated overhead of 
evaluating the subproblem is useful to guide the 
search [21, 1,12]. The elimination rules are more restricted in 
decision problems, such as evaluating a logic program. Pruning a 
subproblem with a smaller success probability or a larger search 
cost may remove a possible (and possibly a unique) solution. In 
this case only when a terminal node is found to be true or false, 
AND-pruning or OR-pruning rules can be applied [12]. 

There are three basic selection strategies, namely, depth- 
first, breadth-first, and best-first searches. A generalized heuris- 
tic function can be used to unify these three kinds of search stra- 
tegies and resolve ambiguities in the heuristic function [4, 10]. 
To resolve the ambiguity on the selection of subproblems, dis- 
tinct heuristic values must be defined for the nodes to allow ties 
to be broken. A path number can be used to define an unambi- 
guous heuristic function. The path number of a node in a tree is 
a sequence of (h+1) integers representing the path from the root 
to this node, where h is the maximum number of levels of the 
tree [10, 15]. For example, the path numbers of nodes A, B,.C, 
and D in Figure ic are 0000, 0100, 0200, and 0300, respectively. 
Note that the nodes having equal path numbers never coexist 
simultaneously in the search process. For a depth-first search, 
the generalized heuristic function is defined as 


h(P;) = (path number, level number) (1) 


Although a best-first search expands fewer nodes than a 
depth-first search, it requires a secondary memory to maintain 
the large number of active nodes, hence the total time, including 
the time spent on data transfers between the main and secondary 
memories, to solve a problem may be longer than that required 
by the depth-first search. Simulations have shown that the best 
OR-tree search strategy depends on the accuracy of the 
problem-dependent lower-bound function [24]. Very inaccurate 
lower bounds are not useful to guide the search, while very 


accurate lower bounds will prune most unnecessary expansions. 
In both cases the number of subproblems expanded by depth- 
first and best-first searches will not differ greatly, and a depth- 
first search is better as it requires less memory space. 

Extensive studies have been conducted on OR- 
parallelism [2, 12, 15, 18], but very few studies have been done 
on analyzing the speedups and efficiency of OR-parallelism. Due 
to the nondeterminism, combinatorial OR-tree and AND/OR-tree 
searches are quite different from conventional deterministic 
numerical computations. Simulation results have revealed that 
using more processors in parallel depth-first searches might 
degrade the performance, even when the communication over- 
head is ignored [10]. The prediction of performance and 
methods to cope with anomalous behavior are important prob- 
lems to be studied in designing multiprocessors for parallel 
depth-first searches and will be addressed in this paper. 

To take advantage of the search efficiency of best-first 
searches while avoiding their memory overhead, an informed 
depth-first search can be used [20]. In this strategy best-first 
search is performed locally and depth-first search globally. A 
special case is one in which all sibling nodes are ordered accord- 
ing to heuristic values of the siblings (a more accurate definition 
will be given in Section 3). We will show that this ordered 
depth-first strategy is very effective to evaluate logic programs 
represented as AND/OR trees. 


2. PARALLEL DEPTH-FIRST OR-TREE SEARCHES 

To predict the number of processors needed to assure a 
near-linear speedup in a parallel depth-first search, we will 
derive the bounds on computational efficiency. The results in 
this section indicate the relationship among the number of itera- 
tions required in a parallel depth-first search, the number of pro- 
cessors used, and the complexity of the problem to be solved. 


2.1. Model of Efficiency Analysis 

In analyzing the performance bounds, a synchronous model 
is assumed, that is, all processors must finish the current itera- 
tion before proceeding to the next iteration. This performance 
results form a lower bound to that of asynchronous models. 

The parallel computational model used here consists of a 
set of processors connected to a shared memory. In each itera- 
tion, multiple subproblems are selected and decomposed. The 
newly generated subproblems are tested for feasibility, elim- 
inated by (exact or approximate) lower-bound tests and domi- 
nance tests, and inserted into the active list(s) if not eliminated. 
In this model eliminations are performed after branching instead 
of after selection as in Ibaraki’s algorithm [5] to reduce the 
memory space required. 

We have proved that, for best-first searches, the perfor- 
mance is not largely affected by whether the active subproblems 
are kept in a single shared list or multiple lists [23,15]. How- 
ever, for depth-first searches, the performance will be problem- 
dependent when multiple lists are used. In this paper the perfor- 
mance bounds are derived under the assumption that one list is 
used and that the nodes with the smallest heuristic values are 
selected in each iteration. 

Since subproblems are decomposed synchronously and the 
bulk of the overhead is on branching operations, the number of 
iterations, which is the number of times that subproblems are 
decomposed in each processor, is an adequate measure in both the 
serial and parallel models. The speedup between using k, and kp. 
k»>k,, processors is thus measured by the ratio of the number 
of iterations when k, processors are used to that when kz proces- 
sors are used. Once the optimal solution is found, the time to 
drain the remaining subproblems from the list(s) is not 
accounted for, since this overhead is negligible as compared to 
that of branching operations. 

The results proved in this section show the performance 
bounds of parallel depth-first OR-tree searches for solving 
optimization problems. The proofs of these theorems require the 
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following definitions on essential nodes. A node expanded in a 
serial depth-first search is called an essential node, otherwise it is 
called a non-essential node. The speedup of a parallel depth-first 
search depends on the number of essential nodes selected in each 
iteration. An iteration is said to be perfect if the number of 
essential nodes selected is equal to the number of processors, oth- 
erwise it is said to be imperfect. The incumbent at any given 
time in the search process is the best feasible solution obtained at 
that time. The incumbent is continuously updated until an 
optimal solution is found. We denote T,(k.e€) and Ty(k.€) as the 
number of iterations required to find a single optimal (or subop- 
timal) solution using k, k21, processors in a best-first and 
depth-first search, respectively, where € is an allowance function 
specifying the allowable deviation of a su»optimal value from 
the exact optimal value. When an approximate solution is 
sought, i.e. €>0O, during the search of an OR-tree, an active node 
P; is terminated if 


|) ee 
(Pi) 1+€ 


where z is an incumbent obtained at that time. 


2.2. Parallel Depth-First Searches 

The following theorem shows that the performance of 
parallel depth-first searches depends on the problem complexity 
and the number of distinct incumbents found during the search 
process. 


e20, z20 (2) 


Theorem 1: For a parallel depth-first OR-tree search with k pro- 

cessors, €=0, and a generalized heuristic function of h(P,) =(path 

number, level number), then 

T,'(1,0)— 
k : 


T,(1,0 a 
al ),k 1 


€T4(k,O)< : : 


+1 [((c+1}*h—c] (3) 


where h is the height of the OR-tree, c is the number of the dis- 
tinct incumbents obtained during the serial depth-first search, 
and T,'(1,0) is the number of essential nodes in a serial best-first 
search with lower bounds less than the optimal-solution value. 
Proof: The sequence of iterations obtained during a serial depth- 
first search can be divided into (c+1) subsequences according to 
the c distinct monotonically decreasing incumbents obtained. 
Let the c feasible solutions and their corresponding parents be 
denoted by F,,...,F,, and P,,...,P,. Further, assume that 
F,, ..., F, are obtained in the i, th, .... i, th iterations, respectively. 
Hence iterations from 1 to i, belong to the first subsequence, and 
iterations from i,+1 to ij4,; belong to the (j+1)’th subsequence. 

We now consider the j'th 1<j<c, subsequence. Let 0 ,;,(x) 
be the level with the minimum level number in which some 
active essential nodes, whose heuristic values are between h(P;_,) 
and h(P;), reside in the x'th iteration. For levels less than 
0 min(X), all active nodes, whose heuristic values are between 
h(P}_,) and h(P;), are non-essential. We show that Iteration x is 
imperfect only if all essential nodes, whose heuristic values are 
between h(P;_,) and h(P;), in 0 ,in(x) are selected for expansion. 
Suppose that Iteration x is imperfect, the selected non-essential 
node must have heuristic value larger than h(P,), because other- 
wise this node would have to be eliminated by the feasible solu- 
tion Fy (Fo is the initial feasible solution obtained by a heuris- 
tic method). Thus after Iteration x is carried out, 0 ,;,(x) must 
be increased by at least one. Consequently, after at most h 
imperfect iterations, F;, must be found. 

During the last subsequence of iterations, since the optimal 
solution has been generated, all iterations are imperfect only if 
less than k nodes are selected in each iteration. In other words, 
an imperfect iteration implies that all currently active nodes are 
selected and expanded, and only descendants of these nodes can 
be active in the next iteration. Hence no active node remains 
after at most h imperfect iterations in the last subsequence. The 
previous analysis shows that at most (c+1)*h imperfect 


* Dominance tests will not be discussed in this paper due to space limitation. 


iterations can appear in a parallel depth-first search. Since at 
least one node in each iteration in the parallel case belongs to ®!, 
the set of nodes expanded in the _ serial depth-first 
search [10, 15], the upper bound of T,y(k.0) can be derived as 


Tg(1.0)—(c+1)°h 
k 


In the above discussion, the expansion of the root is counted in 
each of the (c+1) subsequences. Since the root is only expanded 
once, the above upper bound should be compensated by the addi- 
tional number of times that the root is expanded (Eq. (3)). 

The lower bound on T,(k.0) can be proved easily because 
all essential nodes in a serial best-first search with lower bounds 
less than the optimal solution must be expanded in the parallel 
depth-first search. 0 


Ty(k,0) < + (c+1)*h 


For problems such as integer programming and 0-1 Knap- 
sack problems, all feasible solutions are located in the bottom- 
most level of the OR-tree. In this case the following corollary 
shows that all essential nodes of a serial depth-first search must 
be expanded in a parallel depth-first search, and a tighter lower 
bound is obtained. 


Corollary 1: In searching an OR-tree using a parallel depth-first 
search and a heuristic function of (path number, level number), 
if € = 0 and all feasible solutions are in Level h, then 


T,(1.€) -1 


+1 
k 


< Tylk.e) (4) 


where h is the maximum number of levels of the OR-tree. 
Proof: The proof is omitted due to the space limitation [13, 10]. 


The bounds in Theorem 1 are tight in the sense that we can 
construct examples to achieve the lower- and upper-bound of 
computational times. These degenerate cases occur rarely. 
Although c, the number of distinct incumbents, is unknown 
until the solution is found, c is usually small and can be 
estimated when integral solutions are sought. It has been 
observed that c is less than 10 for vertex-cover problems with 
less than 100 vertices. For most integer programming problems, 
c=1. In these cases the range on T,(k,0) is tight, and a near- 
linear speedup can be achieved in a large range of k. 

Let w be T,(1.0)/h. w can be viewed as the ‘average 
width” of an OR-tree, which only consists of essential nodes. 
Eq. (3) can be rewritten as 


T4(1,0) > k*w (5) 
T.k.0) ~ wtc+1I)(k—-1) 


From Eq. (5), it is easy to see that if w >>k and c is small, then 
the speedup is close to k: whereas if w<<k, then the lower- 
bound speedup is close to w/(c+1). 

In Table 1, the theoretical bounds derived above are com- 
pared with the simulation results of parallel depth-first searches 
to solve two 35-object knapsack problems. In generating the 
knapsack problems, w(i), the weights, were chosen randomly 
between 0 and 100 with a unform distribution, and the profits 
were set to be p(i) = (w(i) + 10). This assignment is intended to 
increase the complexity of the randomly generated problems. 
The results demonstrate that the bounds on parallel depth-first 
searches are tight, hence its performance can be predicted quite 
accurately. Table 1 also shows that the speedup depends 
strongly on w. In Case 1 w*2023, and a near-linear speedup of 
0.88k is achieved with 256 processors. In Case 2 w188, and a 
speedup of 0.29k is obtained with 256 processors. Note that 
when the number of processors is large, the number of essential 
nodes in each imperfect iteration of the parallel depth-first 
search is usually larger than one. In contrast to the upper bound 
in Eq. (3), which was derived with the assumption of one essen- 
tial node in each imperfect iteration, T3(k.0) may be much 
smaller than the upper bound. Simulations have also revealed 


Proc. bound | iterat. | bound 


Case 1 

| 1 ={| 70790 | 70790 | 70790 | __ 1.000 | 
| 2 || 35395 _| 35630 | 35787 | 1.987 
| 4 || 17698 | 18044 | 18285 | 3.923 | 
| 8 || 8349 | 8884 | 9534 | 7.968 | 
| 16 || 4425 | 4460 | 5159 | 15.872 _ 
| 32 || 2213 | 2247 | 2971 | 31.504 
| 64 || 1107 | 1143 | 1877 | 61.934 | 
| 128 || 554 | 592 | 1330 | 119.578 


Case 2 


| 1 |] 6566 | 6582 | 6582 | 1.000 | 
| 2 || 3283 | 3488 | 3513 | 1.887 
| 4 |] 1642 | 1940 | 1978 | 3.393 | 
| 8 | 821 | 1161 | 1211 


Table 1. Comparisons between theoretical bounds and simulation 
results on parallel depth-first searches for knapsack 
problems with 35 objects. (T,'(1.0)=T,(1.0). During 
depth-first searches, c=22 in Case 1 and c=12 in Case 2.) 


that for a number of OR-tree search problems, Ty(k,0) may be 
very close to T,(k,0). 

Analogous to the proof of Theorem 1, the upper bound on 
Ty(k.e), €>0, can be derived. To find the lower bound on 
T,y(k.é), let f, be the optimal-solution value and MINT,(é) be the 
minimum number of nodes to be expanded in the approximate 
best-first search. MINT,(€) represents the number of nodes 
whose lower bounds are less than f,/(1+€), since these nodes 
must be expanded in the best case. MINT,(€) may be estimated 
from the distribution on the number of subproblems with 
respect to lower bounds. From the above analysis, we get 


MINT,(€)—1 Tq(1.€) K-14 


k +1 STy(k.e)< + k 


[(c+1)}*h—-c]} (6) 


2.3. Coping With General Parallel-to-Paralle] Anomalies 

Some results on coping with serial-to-parallel anomalies 
have been published elsewhere [10, 11,15]. We now present 
results on coping with parallel-to-parallel anomalies of depth- 
first OR-tree searches based on the performance bounds derived 
in the last section. When comparing the efficiency between using 
k, and ky processors, 1<k,<k>, a k2/k,-fold speedup (ratio of 
the number of iterations in the two cases in our model) is 
expected. However, simulations have shown that the speedup 
can be (a) less than one (called a detrimental anomaly) [6, 17, 9]; 
or (b) greater than k>/k, (called an acceleration anomaly) [6, 9]; 
or (c) between one and k>/k, (called a deceleration ano- 
maly) [6,22,17,9]. So far, all known results on parallel OR- 
tree searches showed a near-linear speedup for only a small 
number of processors. 

Anomalies are studied with respect to the assumption that 
all idle processors are used to expand active subproblems. In 
fact, detrimental anomalies cannot happen if some processors can 
be kept idle in the presence of active subproblems. The number 
of processors to be kept idle is problem dependent and is very 
difficult to find without first solving the problem. 


Some anomalies on parallel depth-first OR-tree searches are 
illustrated here. A single list of subproblems is assumed. The 
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behavior of using multiple lists is analogous to that of a central- 
ized list. An example of an acceleration anomaly with an 
approximate depth-first or best-first search is shown in Figure 
la. When three processors are used, the optimal solution is 
found in the second iteration, and P, and Pz are eliminated. If 
two processors are used, subtrees T, and Ts, have to be expanded. 
T(2,0.1)/T(3,0.1) will be much larger than 3/2 if T, and T; are 
very large. Figure 1b illustrates a detrimental anomaly under 
an approximate best-first or depth-first search with ¢€=0.1. 
When two processors are used, f(Ps), the optimal solution, is 
found in the fourth iteration. Assuming that the lower bounds 
of nodes in T3 are between 8.2 and 9, all nodes in T; will be 
eliminated by lower-bound tests with Pg since [9/(1+€)]<8.2. 
When three processors are used, P3 is expanded in the third itera- 
tion. Ps, Pg, and P, are generated and will be selected in the next 
iteration. If T, is large, T(2,€) < T(3.€) will occur. A detrimen- 
tal anomaly may occur even when lower-bound tests are inac- 


tive and is illustrated in Figure 1c. A similar example can be | 


derived for acceleration anomalies. 

In the last section. we have derived the performance 
bounds with respect to depth-first OR-tree searches. From these 
results, we can develop the relative efficiency between using k, 
and kz, 1<k,<ky3, processors. First, we derive a sufficient condi- 
tion to assure the monotonic increase in computational efficiency 
with respect to the number of processors. To simplify the 
sufficient condition, the following bounds on T,(k.0) are used. 


T, (1.0) T4(1,0) 


; < Ty(k.0) < 


+ (c+1}*h 


Corollary 2:° Let rg’ = T,'(1.0)/T41.0) < 1. In a parallel 
depth-first search that satisfies the assumptions of Theorem 1, 
Ty(k,,0) < Ty(k,,0) when 


T4(1,0) (c+1)k,k> k, 
Sea eee d '>—, 1<k,<k 7 
h 2 i= Kk; an Tq ke 1<K2 (7) 
where c is the number of the distinct incumbents obtained dur- 
ing the serial depth-first search. 


From Corollary 2, we can conclude that the existence of 
parallel-to-parallel detrimental anomalies in depth-first searches 
depends on T;,'(1,0), ry’, and c. If rg'1, c is small, and T,'(1,0) 
is very large, then Eq. (7) will be satisfied. Our simulation 
results reveal that for some problems, such as the 0-1 knapsack 
and vertex-cover problems, T,(1,0) is close to T,'(1.0), hence 
tqg~1. Moreover, if the feasible-solution values must be 
integers, then c is often small. For this kind of problems, detri- 
mental anomalies can be prevented for parallel depth-first 
searches when T,'(1,0) is large and kz is relatively small. How- 
ever, the range of parallel processing within which no detrimen- 
tal anomalies occur for depth-first searches is smaller than that 
for best-first searches [13]. 

From Theorem 1, we can also derive a necessary condition 
for acceleration anomalies with respect to k, and kz processors. 


Corollary 3:” In a parallel depth-first search that satisfies the 
assumptions of Theorem 1, Ty(k,.0)/Ty(k2,0) > k2/k, only if 


Ta(1.0)—T,1.0) |> fk—1—-Ck, [C+ 1)¢h—c] 1<k,<k, (8) 


If all solutions are located at the bottommost level of the OR- 
tree, then the corresponding necessary condition is simplified by 
Corollary 1 as 


k>—-1 


(c+ }h — c| >> 1<k,<kp. (9) 
= 


Obviously, the necessary condition in Eq. (8) is readily 
satisfied, and Tg(k,.0)/T4(k2,0) may be much greater than k2/kj. 


** The proof is omitted due to space limitation and can be found else- 
where [13, 16]. 
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(b) Detrimental 
approximate lower-bound tests. 


(a) Acceleration anomalies with 


lower-bound tests. 
3. T,4(2,0.1) 


anomalies with 


T,(2,0.1) és 53 T,(3,0.1) > T,(2,0.1); 
T,G,0.1) © 2’ T3301 ~ 2 T4(3,0.1) > T4(2,0.1). 


? D 
3 3 4 4 5 
g=10 gil2 Coan te car Go Qe 
4 4 4 6 
g= g= yg = 
WOK Orn 


(c) Detrimental anomalies without lower-bound tests in a depth-first or best- 
first search, T(4,0)=5, T(5,0)=6. 
(Number inside node is the evaluation order using four processors; number 
outside node is the evaluation order using five processors.) 


Figure 1. Examples of anomalies. 


Usually, if k, and k> are close to each other and h is large, then 
acceleration anomalies may occur quite often. 

When a suboptimal solution is sought, the following corol- 
lary shows the required sufficient conditions. 


Corollary 4: In parallel depth-first searches that satisfy the 
assumptions of Theorem 1 with the exception that €>0, 
Ta(k>.€) < Tq(ky.€) when 


Ty(1.€) _ (ct+i)k,k k 
h rgk> — k, k> 
where rg = MINT,(€)/T4(1.€). Tak ;,.€)/Tg(k2.€) > k2/k, when 


1<k,<k, (10) 


T,(1,€)—MINT,(€) | > 


If all feasible solutions are located at the bottommost level of 
the OR-tree, the necessary condition to allow acceleration 
anomalies is the same as that stated in Eq. (9). Further, a 
weaker sufficient condition to eliminate detrimental anomalies 
can be derived from Corollary 1. 


T4(1,€) (c+1)k,k> 


EA oa 12 
h k> ae k, ( ) 


3. ORDERED DEPTH-FIRST SEARCH FOR EVALUATING 
LOGIC PROGRAMS 

In our previous paper [12] we have developed an optimal 
search strategy to evaluate logic programs modeled as 
AND/OR-trees using the heuristic information p(x), the success 
probability of a subgoal (or clause) x, and c(x), the estimated 
overhead of evaluating the subgoal (or clause). The heuristic 
information to guide the search is defined as follows. 


_ p(x) (x is descendant 

®,(x) aan stan OR-nede) (13) 
_1— p(x) (x is descendant 

(x)= C(x) of an AND—node) (14) 


The logic program is first transformed from the AND/OR-tree 
representation into a two-level AND/OR-tree. The root of the 
transformed tree is an OR-node and represents the selection of 
clauses, and its descendants are AND-nodes and represent 
different solution trees in the logic program. The descendents of 
the OR-node are ordered according to decreasing values of ®,, 
and the descendants of the AND-nodes are ordered according to 
decreasing values of ®,. 

Although the above strategy minimizes the expected search 
time, there are two implementation problems. First, the 
transformed AND/OR-tree significantly expands the number of 
nodes in the original AND/OR-tree. In fact, the number of 
potential solution trees is a hyper-exponential function of the 
height of the tree. To apply the above search strategy on the ori- 
ginal AND/OR-tree, a global list is required to maintain the 
order of all possible solution trees, and the storage overhead is 
prohibitively large [12]. Second, if two solution trees T, and T> 


have nearly equal ®, or ®,, then exchanging the search order of | 


T, and Tz may not significantly improve the expected search 
overhead. As an example, suppose that the success probabilities 
and the estimated overheads of all solution trees rooted at a non- 
terminal node are uniformly distributed between 0.01 and 0.99 
and 1 and 10 units of cost, respectively, and that there are a mil- 
lion possible solution trees from this node. Suppose further that 
two solution trees can be viewed as having nearly equal ®, or ®, 
if their difference is less than 0.001. Then, approximately, every 
thousand solution trees have nearly equal ®, or ®,. Obviously, 
it is unnecessary to store the exact order of all solution trees. 

In this section we will address two problems. First, given 
an ordered depth-first search strategy and assuming that all 
sibling nodes in the AND/OR-tree are independent, what is the 
order to search the nodes in each level of the AND/OR-tree to 
minimize the expected search time? Second, for a logic program 
with shared variables and clauses, how should the subgoals and 
clauses be ordered to minimize the average search cost of a 
depth-first search? 


3.1 Assumptions 

In a logic program, if there are n clauses whose heads match 
(sub-)goal A, then they can be ordered according to the given 
heuristic values. Likewise, if there are m subgoals in the body 
of a clause B, B:— B,,...,B,, then the m subgoals can also be 
ordered. 

The assumptions made in the search strategy are described 
here. 
(1) For a given representation of the AND/OR-tree, a depth-first 
search is used. When nodes in each level are ordered accord- 
ing to the heuristic values, the search is called an ordered 
depth-first search. 
A producer-consumer model is used to bind values to vari- 
ables. A variable is a producer if it has not been bound to 
any value, otherwise, it is a consumer. For each variable not 
defined in the head, only its leftmost occurence can be the 
producer, as a depth-first search is used. All other 
occurences of this variable in this clause are consumers. For 
example, in the clause A(x,y):-B(x.z)C(z.y)D(x,y), variable 
z in subgoal B must be a producer, while variable z in 
subgoal C is a consumer. Depending on whether a variable 
defined in the head is a producer or a consumer, the variable 
in the corresponding subgoal will be a producer or a consu- 
mer. For example, if x is a producer in A, then x in B is a 
producer, while x in D is a consumer. We use a subscript ‘+ 
to indicate that the mode of a variable is a producer and a 
‘—" to indicate that its mode is a consumer. As an example, 


(2) 
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A(x,.y_) :- Bx,.z,)C(z_.y_)D(x_.y_). When a variable in a 
subgoal is a consumer, it is necessary to verify in this 
subgoal whether the subgoal is TRUE or FALSE for such a 
binding of value. In contrast, when a variable in a subgoal 
is a producer, it is necessary to find a binding of value to the 
variable such that this subgoal is TRUE. 

The probability of a subgoal to return TRUE and the aver- 
age minimum overhead to determine whether a subgoal is 
TRUE or FALSE are independent of the bound values. 

The overhead to test whether a subgoal in a clause is TRUE 
or FALSE for a given binding of values to variables or to 
generate a binding of values to variables is assumed to be 
independent of other subgoals in this clause, provided that 
the modes of its variables are unchanged. Likewise, the 
overhead to verify the head of a clause is independent of 
other clauses with the same head when the modes of its 
variables are unchanged. These assumptions are valid when 
results in one subgoal or clause are passed to other subgoals 
or clauses through the binding of values to variables. 

The probability that a subgoal in a clause is TRUE for a 
given binding of values to variables is assumed to be 
independent of other subgoals in this clause. Similarly, the 
probability that the head of a clause is TRUE is independent 
of other clauses with the same head. These assumptions are 
not valid in general logic programs because subgoals have 
shared clauses and variables, but are made here to simplify 
the model. 


3.2. Optimal Ordering of Depth-First Searches in AND/OR- 
Trees 

In this section we discuss a special case in the optimal ord- 
ering of depth-first searches for AND/OR-trees, assuming that 
the success probabilities and expected overheads of all nodes are 
independent of each other, and that a node, once evaluated, will 
not be evaluated again. This special case exists in a logic pro- 
gram when it does not have any logic variables and shared 
clauses. For each node in the AND/OR-tree, suppose that it has 
n descendent nodes, then there are n! possible evaluation orders 
for a depth-first search. Our objective is to select the optimal 
order of descendents for each node in the AND/OR-tree such 
that the average overhead to verify the root to be TRUE or 
FALSE is minimized. 

Various heuristic functions can be used to arrange the order 
of descendent nodes. Examples include the success probability, 
the lower bound on cost, and the number of immediate descen- 
dents. The following theorem shows that ®, and ®, (for AND- 
nodes and OR-nodes, respectively) are the heuristic functions to 
order the search such that the expected search cost is minimized. 


(3) 


(4) 


(S) 


Lemma 1: Suppose that node K is an OR-node (resp. AND-node) 
with n (resp. m) immediate descendent AND-nodes (resp. OR- 
node) ordered as Kj, ....K, (resp. Kj, .... K,,), and K, is searched 
before Kj,, in a depth-first search. Let p, and c, be the success 
probability and search cost of node Kj, and q;=(1—p,). If all p,s 
and cjs are independent of each other and p,/c, < pj41/c;4, (resp. 
qi/e; < qi+i/ci4,), 1<i<n, then the expected search cost can be 
reduced when K;,, is searched before K;. 

Proof: Let C and C’ be the expected costs of searching the descen- 
dents of node K in the order Kj, .... K, and that in the order with 
K; and K,4; interchanged. Assume that node K is an AND node. 
Then 


n k-1 
c=) qj/°Cx and (15) 
k=1 | j=1 
det fot i-1 n (k—-1 
c=) IIq; "cy + IIq; "(ci41 tqinicy) + D IIq; *, (16) 
k=1 | j=1 j=1 k=i+1 | j=1 


Subtracting Eq. (15) from (16) yields 
i-1 


IIa 
k=1 


C-C= *(pici+1 — Pi+1C) > O 


The proof when node K is an OR-node is analogous. 0) 


Some special cases of this ordering strategy have been 
observed by Simon and others [21, 3, 1]. 


Theorem 2: Assume that a depth-first search is used to search an 
AND/OR-tree, that the probabilities of success and search costs 
of all sibling nodes are independent of each other, and that a 
node, once evaluated, will not be evaluated again. The ordered 
sequence in which all OR-nodes, x;s, are ordered by decreasing 
p(x;)/c(x;) and all AND-nodes, y;s. are ordered by decreasing 
q(y;)/cCy;) will minimize the expected search cost over all possi- 
ble ordered sequences, where p(x), q(x), and c(x) are the success 
and failure probabilities and average search cost for node x. 
Proof: Without loss of the generality, assume that the root (in 
Level 0) is an OR-node and that each OR-node (resp. AND-node) 
has n (resp. m) immediate descendent AND-nodes (resp. OR- 
nodes). For the n AND-nodes (resp. m OR-nodes), there are n! 
(resp. m!) possible oredered sequences, $j, ..., Sp! (resp. S}. ..., Smt). 
Let cianp (resp. c/o) be the minimum expected cost of the jth 
AND-node (resp. OR-node) in sequence s, over all possible 
ordered sequences of descendents of this node. Let c,or be the 
minimum expected cost of a depth-first search of the root over 
all possible ordered depth-first searches of the given AND/OR- 
tree. Since the expected search cost of a node is the cost of 
searching the subtree rooted at this node to return TRUE or 
FALSE, it is independent of the search order of other sibling 
nodes. Hence, if all nodes in the k’th level have been ordered 
optimally, then this optimal order remains unchanged when 
determining the optimal order in levels smaller than k. That is, 
the principle of optimality is satisfied. The minimum expected 
cost of the root r can be found from a dynamic programming 
formulation. 


n |j-l 
Cor =~ min dx [Cjanp | Where ap 
SE (sy....Sq' | jaa |k=1 
. m jv—1 
clann= min | DIT pa |evior 7 
Sy€ [8}.....8q!) v=1] {w=] 


where p; and q} are, respectively, the success and failure proba- 
bilities of the k'th node in the i th ordered sequence s;. Cy'or can 
be evaluated in a similar fashion as in Eq. (17). Eq’s (17) and 
(18) can be solved by a bottom-up evaluation. 

For any nonterminal OR-node (resp. AND-node), K, since 
all its immediate descendents K, ...., K,, (resp. K,,....K,,) are 
independent of each other, then from Lemma 1 and applying 
adjacent pairwise interchanges, the optimal search order should 
satisfy p(K,)/c(K,) > p(Kj,4,)/c(Ki4,;) (resp. q(K,)/c(K,) > 
q(Ky4,)/c(Kj4,)). CO) 


The above ordering strategy only holds when all nodes are 
independent. In general, a logic program has shared variables 
and shared clauses. Hence, the subgoals and clauses have depen- 
dent search costs and success probabilities. Moreover, a subgoal 
may be searched more than once because a given binding of 
values to variables may succeed with this subgoal but fail with 
other subgoals. In the next section, we will discuss a heuristic 
method to find an efficient search order. 


3.3. Ordered Depth-First Search of Logic Programs 

To find an appropriate order of depth-first search in a logic 
program, the main problem is to develop a function to compute 
the expected search cost and success probability of a clause or a 
subgoal, assuming that the costs and success probabilities of all 
its immediate descendents in the AND/OR-tree representation 
are known. The difficulty lies in the shared variables and clauses 
in different subgoals of a logic program. The search cost of a 
subgoal may depend on the modes of its variables and cannot be 
evaluated as in Eq. (15). For a subgoal with a producer variable, 
it is necessary to generate one (or all) binding of value for the 
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given variable; whereas a subgoal with a consumer variable has 
to test whether the given binding is TRUE. The latter cost is 
usually larger than the former one. The cost functions are more 
complicated when there are multiple variables. Here, a subgoal 
can have a combination of producer and consumer variables. 

Owing to the distinction between producers and consumers 
and that a clause may be used with their variables set in 
different modes, the success probabilities and costs must be 
defined for all combinations of modes of variables. For example, 
there are four success probabilities and four expected search 
costs for clause with head A(x,y), namely, pa(x,.y4), pa(xs.y_), 
pa(xys), paCxy_), caGxyyy), caGxyy), calxy,), and 
ca(x_.y_), where a subscript ‘+’ indicates that a variable is a pro- 
ducer, and ‘—' indicates that it is a consumer. Let L be the set of 
variables in a subgoal, and Ly and L_ be the subsets of producer 
and consumer variables. For a clause with head A(L,,L_), all 
variables in L_ have been bound (called a binding-set) before 
this clause is searched, whereas all variables in L, must be 
bound after the subtree rooted at clause A has been searched. 

In Figure 2 we have shown a Prolog program to query 
granddaughter(*,*). In Table 2 the average search costs for vari- 
ous modes of variables X and Y in granddaughter are shown. 
For different modes, the orders in which the depth-first search 
should be performed may be different. We have shown the 
order that minimizes the search cost for two of these combina- 
tions of modes. The structures for the other two combinations 
are different. The values in Table 2 illustrate that the difference 
in costs between the best and the worst orders can be a factor of 
one to seven. 

For node A(L), pa(L,,L_) is defined as the probability to 
successfully generate a binding-set of L, under the condition 
that the given binding-set of L_ is TRUE, namely, 


pa(L,.L_) = pa(L_) pa(Ly Be, (19) 


mother(theresa,martha). 
mother(jane,martha). 
mother(michael,mary ). 

mother(susan, jane). 

mother(edward, jane). 

wife(john,martha). 

wife(paul,mary ). 

wife(michael, jane). 

female(theresa). 

female(susan). 

female(X):-wife(__,X). 

father(X,Y ):-mother(X,Z),wife(Y,Z). 
parent(X,Y):-mother(X,Y). 
parent(X,Y):-father(X,Y). 

grandparent(X.Y ):-parent(Z, Y ),parent(X,Z). 
granddaughter(X.Y):-female(Y).grandparent(Y,X). 


Figure 2. Minimum-cost Prolog program on family tree with 
granddaughter(—,+) or granddaughter(—,—) as the goal. 


Modes of 
X,Y in 


Minimum | Maximum | Mean | Standard 
. |Deviation 


Table 2. Average Costs of evaluating granddaughter(X,Y) in 
Figure 2 for all combinations of bindings of variables 
and all possible solutions returned. (Each traversal of 
a subgoal or clause has unit cost. Each producer vari- 
able only produces one binding at a time.) 


pa(L_) is the probability that the binding-set of L_ on A is true. 
pa(L,!L_) is defined as n/(n+1), where n is expected number of 
binding-sets of Ly in subgoal A for a given binding-set of L_. In 
this case we are approximating the distribution on the number of 
distinct binding-sets of L, for a given binding-set of L_ as a 
geometric distribution with parameter p. For such a geometric 
distribution, its expected value is p/(1—p), which implies that 
p=n/(n+1). In the special case when all variables in L are 
producers, then p,(L,) = m/(m+1), where m is the total number 
of generated binding-sets. 

For A(L,,L_), its expected cost, c,(L,,L_), is defined as the 
expected cost of generating a successful binding of variables in 
L,. given the binding of variables in L_. If all variables in L are 
consumers, then c,(L_) is the expected cost of testing whether a 
binding-set is TURE. : 

For clarity, we illustrate a heuristic method to compute the 
various costs. In this method all probabilities are assumed to be 
independent. For a clause A(x.y) :— B(x.z),C(z.y) with known 
costs and probabilities for subgoals B and C, the expected cost of 
A can be computed by modeling the test process as an absorbing 
Markov chain [26]. If one solution is sought, then the absorbing 
Markov chain in Figure 3a is used. The two sink nodes (sp and 
S,;) represent the states of success and failure. After a finite 
number of steps, the process must enter one of these absorbing 
states. To find the expected cost. we need to calculate the 
expected number of times that the process is in transient states s> 
and s3. In this example P, the transition matrix, Q, one of its 
submatrices denoting the process in the transient states, and R, 
another submatrix denoting the transitions from the transient 
States to the absorbing states, are 
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0100 1 0 O pe 0 qo 

= = “O= R= 0 (20) 
O q. O po R Q q3 0 P3 
p3 90 q3 0 


Let n, be the expected total number of times that the pro- 
cess is in state s,, and Mi[nj] be the mean of n, when the chain is 
started in s,. From the theory of absorbing Markov chains [26], 
N = {Mi[n|]} = (—Q)7?. N is called the fundamental matrix. In 
our example 


1 p2 
N = |27P243 1~P2ds (21) 
q3 1 


1—p2q3 1—Pp2q3 


As a result, the expected cost is (co+poc3)/(1—p2q3). where c; is 
the cost associated with state s;. If B is searched before C, then 
A has expected cost 


Cp(x4.24) + pplxy.z4)*ec(z_.y4) 


Ca(xa.y4) = (22) 
ea 1 — pp(x4.24)*qc(z_.y4) 
If subgoal C is searched before B, then A has expected cost 
cclz4.y4) + pclzy,y4)°cp(x4,z_) 
nGgy ees (23) 


1 — pc(z,.y4)*qp(x,,.z_) 


Comparing ca, and cy’, the order with the smaller cost is used. 
Expected costs of clause A with variables in other modes can be 
computed similarly. 

When all solutions in a subgoal have to be found, the pro- 
cess can also be modeled as an absorbing Markov chain. Figure 
3b shows the absorbing Markov chain for the above example. 

To compute the success probability of a clause, if b,; is the 
probability that the process starting in transient state s; ends up 
in absorbing state s;, then from the theory of absorbing Markov 
chains, {b;,} =B=NxR. In our example, bj 9 = p2p3/(1—pzq3). 
and b2; = q2/(1—p2q3). If subgoal B is searched first, then the 
success probability of node A is 
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P2 P3 P2 
82 83 82 83 
” q3 q2 1 
51 fail (4) | 81) sat (b) 


Figure 3. Example to compute the search cost and probability 
using an absorbing Markov chain. 


a pal x4.Z4)pc(z_.y+) 
Paley) = py 
1 — pgxy.24 )qclz_y4 


In general, if a subgoal has k variables, then 2* combina- 
tions of probabilities and costs corresponding to all combinations 
of modes of the variables have to be found. 

The above example illustrates the use of an absorbing Mar- 
kov chain to order the search of descendents of an AND-node, 
which represents the evaluation of a clause. In contrast, to order 
the descendents of an OR-node, which represents the selection of 
multiple clauses with the same head, it is observed that once a 
descendent of an OR-node has been searched for a given 
binding-set, it will not be searched again. Unlike descendents of 
AND-nodes, there is no backtracking involved for a given 
binding-set. According to Theorem 2, descendents of an OR-node 
should, therefore, be searched in decreasing order of ratios of 
success probability to cost. The cost and probability of an OR- 
node can be computed in a similar fashion as in Eq. (15) when it 
has at least one consumer variable. When all its variables are 
producers, the average cost is taken as the average cost of each of 
its descendents weighted by the fraction of the total number of 
binding-sets that can be generated. 

The basic idea in a systematic method to determinate an 
appropriate ordering of the subgoals and clauses is to associate 
with each subgoal and clause a table of the expected costs and 
probabilities for all combinations of modes of variables, and to 
use the appropriate costs and probabilities depending on the 
modes set for the variables. The best order with the minimum 
expected cost is chosen from all possible permutations of descen- 
dents. The number of permutations may be large. In this case 
heuristic information, such as the number of variables in a 
subgoal, can be used to eliminate inefficient candidate permuta- 
tions. Note that the cost of each node in the AND/OR-tree 
representation depends only on the costs and probabilities of its 
descendents, provided that the descendents only depends on each 
other through shared variables. Here, the selection of the best 
order in a given level does not influence the computation of costs 
in levels above. That is, the computation of the minimum cost 
Satisfies the principle of optimality, and the optimal order can be 
found by dynamic programming. In practice, subgoals are gen- 
erally dependent on each other through shared clauses, which 
results in Over-estimation of the costs. The proposed scheme is 
still applicable as a heuristic method to arrange the order in the 
search process. Statistic sampling has to be used to estimate the 
cost and probability of a node after the order of its descendents 
is determined. This reduces the accumulation of errors as nodes 
in higher levels of the AND/OR-tree are ordered. 

A final point on the ordering of nodes in the AND/OR-tree 
representation of logic programs is that different orders may be 
found depending on the modes of the variables. Either an ‘aver- 
age’ order may be used or multiple program statements may be 
generated for different cases to reflect the preferred order. 


(24) 


4. CONCLUSIONS 
In this paper we have studied the computational efficiency 
of parallel and ordered depth-first searches to solve optimization 


and decision problems. The performance bounds and conditions 
to cope with anomalies in searching optimization problems 
represented as OR-trees have been derived and verified by simu- 
lations. Speedups have been found to be related to the problem 
complexity and the number of incumbents obtained during the 
search process. For a problem with a high complexity and a 
small number of incumbents, such as integer programming prob- 
lems, a near-linear speedups can be achieved with respect to a 
large number of processors. 

An ordered depth-first search strategy has been studied 
with respect to decision problems represented as AND/OR-trees. 
When the success probabilities and costs of sibling nodes are 
independent of each other, and a node, once searched, will not be 
searched again, the sibling nodes should be ordered according to 
ratios of probability and cost to minimize the expected total 
search cost. Due to shared clauses and variables in a Prolog pro- 
gram and that backtracking is allowed, it is difficult to find the 
optimal depth-first search order. An absorbing Markov chain to 
model the effects of backtracking and a dynamic programming 
method to order the search have been developed. 
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ABSTRACT 


Given an array of size nm with entries either marked or 
unmarked, we compute the number of marked entries and assign to 
each (marked entry) a unique index between one and the number of 
marked entries. If there are k marked entries, then our algorithms 
require O(logk) time, which is independent of n, using  proces- 
sors and a CRCW model. 


1. Introduction 


As the study of parallel algorithms matures, it is important to 
develop efficient basic techniques that can be used as building 
blocks for more complex algorithms. One basic algorithm is to 
compute the size of a subset and pack it into an array. That is, given 
an array with entries either marked or unmarked, compute the 
number of marked entries and assign to each (marked entry) a 
unique index between one and the number of marked entries. 


Assume that there are k marked entries in an array of size n. 
We are given 7 processors and a CRCW model of parallel computa- 
tion. That is, each processor can read or write any cell in shared 
memory in a single cycle. If two or more processors try to write to 
the same cell during the same cycle, then one arbitrarily wins. Actu- 
ally, our algorithms require a slightly weaker model in which two 
processors are allowed concurrent write of common values only. 


Using a simple binary summation tree, it is easy to solve this 
problem in O(log”) time. On the other hand, using a simple lock- 
ing mechanism, it is also easy to solve the problem in time O(k); 
each processor is assigned a unique cell of the array and if the entry 
is a one, then the processor locks a counter variable. When the lock 
is granted, the processor just increments the counter by one and 
unlocks the variable. 


We desire a solution whose running time is independent of n 
and is logarithmic ink. Such a solution already exists, however, it 
uses randomness and assumes the stronger model of CRCW. 
Rudolph and Steiger [2] give a parallel algorithm that terminates in 
O (logk ) time with high probability. Moreover, they use only k pro- 
cessors. We present a deterministic solution to this problem with 
the same time bounds. Note that a trivial consequence of our result 
yields an algorithm for computing parity. 
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The outline of the paper is as follows. We first describe an 
algorithm for computing the the "prefix maxima" of an array. This 
is then used to develop an O (loglogn +logk) time algorithm. For 
k2logn this algorithm gives the desired result. We then present a 
second algorithm for the case of k<logn. Running the two algo- 
rithms in tandem gives the final result. 


2. Prefix Maxima 


Given an array A[1..] of numbers, we define prefix max- 
ima, MAX ,(A ), to be a set of n values, the i-th of which is the max- 
imum of the first i elements of A: 


MAX, (A ) = Maximum {A [1], ..., A [i]} : 1Si Sa. 


A simple sequential scan algorithm solves this problem in 
O(n) steps; this section gives a parallel algorithm using n proces- 
sors and terminates within O (loglogn ) parallel steps. 


Lemma 1: The prefix maxima vector can be found in O(1) parallel 
Steps using n(n —1)/6 processors and a CRCW model of parallel 
computation. 


Proof: Note that we can compute all MAX ;(A) in parallel. For the 
i-th prefix of A we do the following: 


First, i(i-1)/2 processors perform all distinct comparisons 
between A [x] and A [y], 1<x<y<i, in one parallel step. Next, the 
Same number of processors check, for each j, 1<j<i, whether 
A[j]= Maximum { A [1], ..., A[i]} by parallel computation of an 
AND function. Writing the correct value of MAX,(A) requires one 
more step. 


The above procedure can be performed in parallel for all 
1, 1Si Sn, with F (7) processors, where 


F(n)= 3" i G-1)/2 =n(n7-1)6. 


If, on the other hand, there are fewer than a cubic number of 
processors, but at least a linear number, i.e. n <p <F(n), an adap- 
tation of the max computation of Valiant [4] (and as implemented by 
Shiloach and Vishkin [3]) can be used. The following algorithm 
evaluates the array M[1.n] for the input A such that 
M [i] =MAX,(A): 


Procedure PreMax(p,A,n,M) 

Step 1: If p 2 F(n) then use lemma 1 to compute M[l..n] in 
one parallel step and return. 

Step 2: Divide the array A [1..7] into minimal number / of dis- 
tinct segments S,,..S, which satisfy the following condi- 
tions [3]: 

(i) IS;|-15; <1 for 1<i<j<l 
(i) Yj F(IS\|) Sp 

Step 3: Let b,, e;, n;, be the indices of the beginning, ending, 
and size of each segment S,. 

Step 4: For eachi in [1../] do in parallel 

PreMax(F (n;), A[b; .. e;],2;,M [b; ..e;]) 
Step 5: Let A’ [1 .. /] be the maximum value in each segment: 
For each i in[1..1] do in parallel: A’ [ij=M[e;] 
Step 6: Apply the algorithm recursively on A’: 
PreMax(p,A’,!,M’) 
Step 7: Merge results from steps 4 and 6: 
For eachi in{2../] and j in[b, ..e;|doin parallel 
M [j] := Maximum (M [j], M‘[i-1]) 


SNE 


It is not hard to see that this algorithm terminates in 
O (loglogz ) parallel time: Step 1 requires constant time as shown in 
lemma 1. Finding / and then computing 5, and e; for alli: 1SiS/ in 
steps 2 and 3 require O (1), i.e. constant, parallel time. The recursive 
call in step 4 is terminated in constant time since there are enough 
processors for each segment to compute all the prefix maxima in 
constant parallel time as in step 1. The exact allocation of processors 
in step 4 may require some computational work, but it can be done 
by the first / processors in constant time. Steps 5 and 7 again take 
constant parallel time. 


Note that step 7 correctly evaluates M [1..n]: At the end of step 
6, each M [j] contains the maximal value of A [b jlA [e AC MTj-1] 
is the largest of M [e ,]..M[e ji Since these M values are the maxi- 
mal in their segment, then at the end of step 7, M [/ ] is the biggest of 
A[1..j]. 


The algorithm requires constant parallel time except for the 
recursive call in step 6 on the auxiliary array A’. We need to bound 
the number of recursions. The recursion terminates when the 
number of processors, p, is larger than the number of items cubed. 
Let T(a,n) denote the parallel time for running the algorithm on 2 
values using &m processors. From the definitions it is easy to check 
that 


Lj US;|)S nl" for all 1<n. 
We can bound the value of /</,) where n a] i =p Thus, /, the size 
of the array in the next iteration would be smaller than n/a . 
Since, the number of items drops by a factorofa@ ~ and the number 


of processors remains constant, then the new ratio between proces- 
sors and items will be a . Thus we get: 


T(a,n) = 1+T(0 nie) 


fe cae ae 2 
The recursion is guaranteed to terminate when © is bigger than n , 
and this will occur within 


(loglogn a logloga) 
loglog(3/2) 


= O(loglogn) steps. 
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If 1<a<2, the above equation is ill defined. We can force & to 
be 2 by using virtual processors: some or all of the processors will 
act like 2 virtual processors and thereby raising the processor to item 
ratio and only doubling the time. A more careful and much more 
tedious counting shows that the number of steps required even in the 
case of O=1 is O (loglogn ). 


For psn divide the array into p almost equal segments. 
Assign one processor to each segment and, in parallel, compute the 
prefix maxima on each segment sequentially. Then compute the 
prefix maxima of the maximum values of each segment; there are p 
of them. Computing the prefix maxima of the full array can be done 
as in step 7 in the previous algorithm. The running time is 
O (n/p )+0 (loglogp ). 


3. Computing the Subset Size in Parallel 


The Subset size problem is to find the number of elements 
in a set which satisfy some predicate. This can be formulated as 
follows: Given an array A[1...] with binary values, compute how 
many entries contain the value 1. The subset will be denoted by S$ 
and its size by k. In addition, each entry with a value of 1 is to be 
assigned a unique integer in the range | ton. 


We present two algorithms for solving this problem. One is 
efficient when the number of set elements is large (SubSet_I) and the 
other (SubSet_II) is efficient when it is small. Both these algorithms 
can be executed in parallel to ensure an efficient algorithm for any 
set cardinality. 


3.1. Algorithm Subset _I 


As usual we will assume p2n. First we reduce the prob- 
lem of finding subset size to the prefix maxima problem. Then, we 
form a linked list of the subset elements. Finally, we apply recursive 
doubling to compute the size and assign ordinal numbers to each 
element. 


The following algorithm outputs &, and the array R[1..4] 
where where D [R [i }] is the i-th element of D that contains a 1. The 
arrays A [1...+1] and P [1...+1] are used as temporary storage. 


Procedure SubSet_I(p ,D ,n,R,k). 
Step 1: Compute the auxiliary array A [1...+1] as follows: 
For each i in[1..1+1] do in parallel: 
If i>1 and D [i—1]=1 Then A [i] :=i-1 
Else A [i] :=0 
Step 2: Apply the previous algorithm to produce the linked list 


PreMax(p A ,n,P ) 


Note that P [i] will come to contain the max value of all the 
elements in cells A[j], where j <i. This value is also the 
index of the first cell containing a 1 to the left (smaller 
index) of D [i]. 

Step 3: Apply recursive doubling to the linked list represented by 
the array P[1..n]. The head of the list is stored in P [n+1]. 
This will result in the array R of the requested form and will 
yield the value k. 


It is easy to see that SubSet_I requires O (loglogn tHogk ) time 
using processors: Step 1 requires O(1) time, step 2 requires 
O (loglogn ) time as previously shown, and step 3 can be done in 
logarithmic time of the length of the linked list, ie. O (logk). 


3.2. Algorithm SubSet_II 


When kSlogn the loglogn term in running time of SubSet_I 
dominates. Fortunately, when & is this small, we can create the 
linked list faster. The basic idea is to compact the set elements into 
a smaller array so that more processors can be used to quickly solve 
the problem. Algorithm SubSet_I] gives the precise details of the 
computation of k and R [1..k] as before. The algorithm uses the tem- 
porary arrays: D’, D” , R’, R” , size of which will be determined at 
run time. 


Procedure SubSet_ II (p ,.D .n,R ,k) 
Step 1: If p2F (n) then apply SubSet_I to D: 


SubSet_I(p,D ,n,P ,k) 


Step 2: Find the maximal / for which p2F(/). 
Step 3: Split the array D [1..n] into / disjoint consecutive seg- 
ments S which may vary in size by no more than 1. 
Step 4: Compute the array D’ [1..1] as follows: 
for each j in[1..J] andi in[1.. IS; |] do in parallel: 
If S {i]=1 Then D’[j]:=1 Else D’[j]:=0 
Step 5: Apply algorithm SubSet_I to D’: 


SubSet_I(p ,D’ n,P’ ,k’) 


Step 6: Compress the array D into D’, using the ordinal 
number of non empty segments, as found in step 5. 
D” [1..kb* [(n /I y I, will contain only the segments § ; of D 
before step 6, that have at least one element set to 1 in them. 

Step 7: Apply SubSet_ II recursively to D” : 


SubSet_II (p ,D” ,* [(n/1) |,R”,k) 


Step 8: Map R” into R to fit the original indexes of D (before 
the compression in step 6. 


As before steps 2 and 3 take a constant parallel time. Step 4 is 
really a computation of an OR function which can be done in paral- 
lel in constant time. In step 6 and 8 we make a compressed copy of 
an array and compute the opposite transformation. Since the ordinal 
number of each segment copied is given in the array P’ , those steps 
can be done in parallel using n processors in constant time. 


In step 1 and 5 we apply algorithm SubSet 1], there are always 
enough processors to do the linking phase in it in a constant time. 
The counting in SubSet I takes O(logk) in step 1, and O (logk’) in 
step 5. Since k < k we can conclude that the total running time of 
SubSet_ II is O logk) except for the recursive call in step 7. 


Since F (1) S$ 1° and p2n we have l<n "3 ot step 2. The algo- 
rithm is applied recursively in step 7 to an array sized k ’ [(n/1) |. 
Denote the size of the array after the i recursive iteration by n;. It 
is clear that n; <n for all i>0, now n p=n,21,Skny . noSkn ha 
A simple induction gives: 
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The recursive calls are guaranteed to stop if n,<ne Sp. For 
k<n’<n° the recursion will terminate in the 
log(1/3 — 3B) —1og(2/3) step. The total running time of the algo- 
rithm is therefore O (C glogk ) were C 8 is a constant depending only 
on B. For example taking B=0.07 would give C ,=5. 


Note that asymptoticly logn <n P for any B. This algorithm can 
be used in those application where the k is known in advance to be 
smaller than logn. When no estimate can be made about k, one 
would run the two previous algorithms in parallel, and terminate as 
soon as one of them terminates, requiring a total time of O (logk ) for 
any size k. 


For p <n creating the linked list and computing the size of the 
subset can be done by dividing the array into p segments. Each pro- 
cessor will make a linked list and will count the number of elements 
in his segment. After that Use the SubSet_I and SubSet_II algo- 
rithms for chaining the linked lists of all segments which contain 
one or more elements from the subset. The total running time is 
therefor O (n/p )+0 (logk ). 


4. Conclusion 


We view the algorithms presented as building blocks to more 
complex ones. One application is when there is a multiway predi- 
cate, and subsets of the processors are to go off and solve indepen- 
dent subproblems. Before they can work together, however, they 
must know how many processors are cooperating, it is advantageous 
if each has a unique index. 


The algorithm can be improved somewhat, but only by a con- 
stant factor, by using a technique to compute the maximum of n 
items, all in the range 1 to n with m processors in constant time. 
(Note that in this paper we require about n processors to compute 
the max.) This result can be found in a fuller version of this paper 
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Abstract 

The problem of selecting the K-th smallest element of a 
set of N elements distributed among d sites of a 
communication network is examined. A distributed reduction 
technique for selection is a distributed algorithm which 
tranforms this problem to an equivalent one where either K 
or N (or both) are reduced. A collection of distributed 
reduction techniques is presented; the combined use of these 
algorithms yield new solutions for the selection problem for 
both point-to-point and shout-echo networks. In particular, 
O(d + 5 log(K/d)) and O(log «) upper-bounds on the 
distributed selection problem are derived for point-to-point 
networks having a spanning star graph (e.g., complete 
networks) and shout-echo networks, respectively, where K < 
A = Min{K,N-K+1} < K and 5 <d. These results represent 
an improvement on the existing bounds for those networks; 
furthermore the existing selection algorithms for other 
point-to-point networks (e.g., rings, meshes, trees) can be 
made more efficient by using one of the proposed reduction 
techniques as a pre-processing phase. 


1. Introduction 

The classical problem of selecting the K-th smallest 
element of a set F drawn from a totally ordered set has been 
extensively studied in serial and parallel environments. In a 
distributed context, it has different formulations and 
complexity measures. 

A communication network of size d is a set 
S={S,,... 
non-shared memory as well as processing and 


» Sq} of sites, where each site has a local 


communication capabilities. 

A file of cardinality N is a set F={fj,...,fy} of 
records, where each record fe F contains a unique key k(f) 
drawn from a totally ordered set; for convenience, k(f;)<k(f) 
shall be denoted by f<f,. 

A distribution of F on S is a d-tuple 
X=<X),....X%4y> where X,¢ F is a subfile stored at site Ss; 
X;, 0 X= @ fori #j, and U; X; = F. 
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Order-statistics queries about F can be originated at 
any site and will activate a query resolution process at that 
site. Since only a subset of F is available at each site, the 
resolution of a query will in general require the cooperation 
of several (possibly all) sites according to some 
predetermined algorithm. Since local processing time is 
usually negligible when compared with transmission and 
queueing delays, the goal is to design resolution algorithms 
which minimize the amount of communication activity rather 
than the amount of processing activity. 

The distributed selection problem is the general 
problem of resolving a query for locating the K-th smallest 
element of F. The tuple <N, K, N[1], ..., N[d]> is called the 
problem configuration, and A = Min{K,N-K+1} is called 
the problem size, where N[i] is the cardinality of X,. Any 
efficient solution to this problem can be employed us a 
building block for a distributed sorting algorithm [9]. 

The complexity of this problem (i.e., the number of 
communication activities required to resolve an 
order-statistics query) depends on many parameters, 
including the number of Sites, d, the size N of the file, the 
number N[i] of elements stored at site S,, the rank K of the 
element being sought, and the topology and type of the 


network. 
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Solutions to this problem have been developed for 
both the point-to-point and the shout-echo models of 
computation. 

In point-to-point networks, associated with S is a 
set L c SxS of direct communication lines between sites; if 
(S, Spe L, S; and S; are said to be neighbours. Sites 
communicate by sending messages; a message can only be 
sent to and received from a neighbour. The couple G=(S,L) 
can be thought of as an undirected graph; hence, 
graph-theoretical notation can be employed in the design and 
analysis of distributed algorithms in the point-to-point model. 
Different solutions and bounds exist for the distributed 
selection problem in the point-to-point network depending on 
the topology of the network [2,6,15]. In particular, 


| Frederickson [2] presented algorithms which require O(d 


(log d)? log N) messages in a ring, O(d (log d)!/? log N) _ 


messages in a mesh, and O(d log(2N/d)) messages in a 
complete binary tree. All these bounds apply to the worst 


case; the analysis of the expected communication complexity 


has been carried out in [11,15]. 

In shout-echo networks, each site can broadcast a 
message and reply to a received message; the process of 
broadcasting a message and receiving a (possibly distinct) 
reply from all sites constitutes a basic communication activity 
called shout-echo: Several papers investigating the selection 
problem in this network have appeared [1,5,7,8,10-14]; an 
O(log A) bound has been established by Marberg and Gafni 
[5] who also prove a Q(log n,) lower bound if the messages 
are constrained to carry only file elements, where n, is the 
second largest number of elements stored at the sites. 


In this paper, a selection procedure based on various 


reduction techniques is proposed, and used to obtain efficient 
solutions both in the shout-echo and in the point-to-point 
models. The algorithm is based on the serial technique 
developed by Frederickson and Johnson for selection in an 
array with sorted columns [3], and employs some of the 


existing distributed selection algorithms as subroutines. 


Using the proposed algorithm, the following upper-bounds 
on the distributed selection problem are obtained: 
1) 9.64 5 log(«/d) + O(d) messages in networks having 
a Spanning star graph (e.g., star networks, complete 
networks); 
2) 6.82 logK + O(1) basic communication activities ina 
shout-echo network; 
where « < A and 5 < p = Min{A,d} are, respectively, the 
rank of the desired element and the number of sites under 
consideration in the problem configuration produced from the 
execution of Procedure SET-UP (see section 2.1). These 
results improve the existing bounds of 19.28 p log A - 9.64 p 
log d + O(d) and 10.64 log A + O(1) for these networks, 
respectively. Note that, while always «k < A, sometimes K<< 
A. Consider for example the configuration 
<10032,5937,10000,20,5,5,2>; in this case K=33 while 
A=Min{K,N-K+1}=4096. 

Furthermore, it is observed that selective application of 
the proposed reduction techniques can also yield slight 
improvements to the existing bounds for rings, meshes, and 
complete binary trees established in [2]. 

The paper is organized as follows. In the next section a 
collection of reduction techniques is presented and analysed; 
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the combined use of these techniques yields a new selection 
procedure. In section 3, this procedure is employed to obtain 
the new bounds in the shout-echo networks. Finally, in 
section 4, the application of some of the reduction techniques 
to point-to-point networks is discussed. 


2. A selection procedure | 

In this section, a selection algorithm is proposed which 
can be used to obtain efficient solutions both in the 
shout-echo and in the point-to-point models. 

The algorithm consists of four phases: set-up, cut, 
filtration, and termination; the first three phases are 
reduction techniques. The set-up phase (procedure SET-UP) 
enforces a set of constraints on the problem configuration; 
this might require a modification of the entire problem 
configuration. The cut phase (procedures CUT1 and CUT2) 
has the overall effect of further reducing the problem 
configuration so that the cardinality of the set of elements 
among which the sought element is to be found is linear in 
the rank of this element. Finally, the filtration phase 
(procedure FILTER) iteratively reduces the problem under 
consideration to one that can be efficiently solved at just one 
site by the termination phase. Assume, without loss of 
generality, that the original problem is finding the K-th 
smallest element (a "min" problem). The symmetrical 
problem of finding the K-th largest (a "max" problem) can be 
treated in a similar manner. 


2.1 Set-up phase 

A problem configuration <N, K, N[1],...,.N[d]> is 
said to be regular if N[iJ<SA for 1<i<d, where 
A=Min{K,N-K+1} is the problem size. The goal of the 
SET-UP procedure is to tranform the initial problem 
configuration to a regular one. 

Let C=<N,K, N[1],..., N[d]> be the initial problem 
configuration and, without loss of generality, let A = K. If 
this configuration is not regular, then for all sites S; with 
N[iJ>K the N[iJ]-K largest elements can obviously be 
removed from consideration without altering the problem 
integrity (i.e., the K-th smallest element will have the same 
rank among the remaining elements); this observation has 
also been made in [3,5]. After this simplification, the 
resulting configuration will be C'=<N',K, N'[1],..., N'[d]> 
where N'[i]=Min{N[i],K} and N'=2, N'fi]. 

It is possible, however, that this new configuration is 
not regular once we change the problem type (i.e., select the 


(N'-K+1)-th largest instead of the K-th smallest element). 
For example, it may occur that A'=Min{K,N'-K+1 }<K and 
that N'[1]>A' for some i; in this case, the N'[i]-A' smallest 
elements can still be removed from consideration from that 
set without altering the problem integrity (i.e., the 
(N-K+1)-th largest among the original element will be the 
A'-th largest of the remaining elements). Notice that such a 
modification will change the value of N' and thus possibly 
K’; hence, the new configuration might again be not regular. 
In the following, the variable T; will represent the type 
of the problem at iteration j: Tj="max" (“min") will indicate 
that the K; largest (smallest) element is to be determined; the 
function other on problem types is defined as follows: 
other{''max"}J="min" and other["min"]="max". 


PROCEDURE SET-UP 
initialization: 
No:= N; Ky:=K; Ni o=NUl; Tp="min"; 
j-th iteration (j21): 
K; = Min{N; -K,_,+1, Kj_1}; 
if Kj= j-1 then Tj:=T, else T;:=other[T;_1]; 
Nj j= Min{N; ; 1, K;}; 
<Meaning: remove from consideration 
the K;-N; jel largest (if T;="min") or 
smallest (if Tj="max") elements still 
under consideration at site S;> 
Nj:= 2; Nj: 
Bi=l{ Nj j-1>K;}1; 
if B;>0 perform the next iteration, otherwise 
execute the termination step. 
termination: 
d:= Min{d,K;}. 
<Meaning: If d>K; and Tj="min" 
the smallest 
still 
consideration at each site; among these 


("max"), consider 
(largest) element under 
elements, identify the d-K; largest 


(smallest) and remove from 


consideration the corresponding sites> 
END SET-UP 


It will now be shown that the procedure will iterate a 
constant number of times. The j-th iteration will be called a 
flip if K=N;1- K,_+1<Kj); i.e, if the type of the problem 
is changed. 
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Let j21. If the (j+1)-th iteration is not a 
flip, the procedure terminates at that 
iteration. 


Proof. If the (j+1)-th iteration is not a flip, K, =K2N; ;; 
hence B; ,=0 and the algorithm terminates. 


Property 1 


Property 2 Let j>1. If the (j+1)-th iteration is a flip, 


then Be=1. | 

Proof. Let the (j+1)-th iteration be a flip; then B;>0, 
otherwise the algorithm would have terminated at the j-th 
iteration. Let Bi>1, and let S, and S, be two sites with 
N, j-1>K; and Ny j-1>K;; then, N2 N, ; + N, j= 2 K,. 
That is, N.-Kj+12K,+1>K; which would imply K, 41 =Min 
{K;, N-Kj+1} = K; contradicting the fact that the (j+1)-th 
iteration was a flip. Thus, R=1. ® 


Let j>1. Then the j-th and the (j+1)th 
iterations cannot both be flips 


Property 3 


Proof. By contradiction, let both the j-th and the (j+1)th 
iterations be flips; that is 

and 

By property 2, 8. =1, that is, only one site, say S,, had 
been reduced in the (j-1)-th iteration. Since N, j= Kj. 
site S, will be reduced also in the j-th iteration; and since 
iteration (j+1) is a flip, then (by property 2) B=1; that is, 


.S, will be the only site to be reduced in the j-th iteration. 


Thus, Nj=N._1-(N, ja Kp= N;_)-Kj_)+Kj; by substituting 
this expression for N; in inequality (2), it follows K, > N; 
-K; +1= N;_,-Kj.,+K; -K; +1 =N.)-Ky +1 contradicting 
relation (1). 


Theorem 1 Procedure SET-UP performs at most 


three iterations. 
Proof. Let it perform at least two iterations. If the second 
iteration is not a flip, (by property 1) the algorithm 
terminates at the end of this iteration; on the other hand, if 
the second iteration is a flip then (by property 3) the third 
iteration is not a flip and (by property 1) the procedure 
terminates after the third iteration. 


Let k and 8 be, respectively, the rank of 
the desired element and the number of 
sites under consideration in the problem 
configuration produced by the 
Procedure SET-UP. Then, there will be 
at most «5 < «2 elements still left under 
consideration. 


Proof. The configuration obtained from Procedure 


Theorem 2 


SET-UP is regular; i.e., at most K elements are still under 


considerations at each site. Since in the termination step of 
the procedure 6 < « sites are kept under consideration, the 
theorem follows. 


2.2 Cut phase 
Let <N, K, N{l1]J, .... N{dJ> be the problem 


configuration. If this configuration is the result of the 


SET-UP phase, then N[i] < K = «x, lSi<d,andd=8<k. 
The CUT phase is the distributed translation of the CUT 
procedure of the algorithm by Frederickson and Johnson for 
selection in matrices with sorted columns [2]. This phase is 
composed of two sub-phases. The initial sub-phase CUT1 
reduces the number of elements under consideration to at 
most Min{K log d, K log n}, where n = Max{N[i] | 1<i< 
d}; sub-phase CUT2 guarantees that at most 4.5K elements 
remain. 

The j-th iteration of procedure CUT 1 1s applied only to 
a subset, &), of the sites. The ij = [ 2i(K+1)/d | - th smallest 
element from each of these sites is collected. If a site has less 
than ij elements then +e will be used. Among these values, 
the median is found and all values at all sites greater than or 
equal to the median are removed from consideration. The 
next iteration is then applied to half the sites participating in 
this iteration. | 

Using the values F produced by CUT1 (0 <j <q), 
each subfile X,, of site S., can be broken down into the 
subsections Daj ={x|xe X,, Snlij] <x< Sli; ,i]}; 
where S,[w] denotes the w-th smallest element at S, and 
i to the initial Sali] subsection 
element. Procedure CUT2 operates on this set of initial 
subsection elements {S,,[ij] | 1<_m <d, 0 <j < q}; among 


ip=1. Assign the weight F 


these, it finds an element x* whose overall rank is at least K 
and at most 4.5K, using a weighted selection algorithm 
(e.g., see [4]). Finally all elements larger than x” are 
removed from consideration. 


PROCEDURE CUTI1 
(to be performed only if N>4.5K) 


1. Initialization 


= 9 1p _— 1. 


2. Stage j (isjsq) 


2.0 IF[2iK + 1)/dl>n or |E&\<1 THEN 
go to the termination step. 


2.1 i :=[2K+1yd] 
2.2 Let V! denote the set of all Si{i], the ith 


ranked element of site S;, where S; € Eel. 
Find the median v of V! 


2.3. Send the median v to all sites. 
Upon receipt of this median v, each site 
eliminates all elements not smaller than v. 


2.4 Let & ={S,|S,e & and s,[iJ<v}. | 
Perform the next iteration by répeating step 2. 


3.Termination 
Let 1j4i= 0+ 1 


END CUT1 


The K-th element of the original 
problem is also the K-th element of the 
result of CUT1. 


Proof. At step j, V! contains at least d/2/-! elements, each 
having rank at least 2\(K+1)/d. Thus the median of Vi has 
overall rank at least (2/(K+1)/d)(d/2/) = (K+1); that is, the 
K-th smallest element must be smaller than the median of. 


Property 4. 


Vj. Since we only remove elements which are greater or 
equal than the median, the lemma follows. 


Property 5 The number of stages of. procedure 


CUT1 is at most Mind logd | logn }). 
Proof. There are two conditions for termination in step 
2.0. The first condition, [2i(k+1)/d]> n, guarantees that 
at most j < [logn | iterations are performed. Since |&)| < 
L\éi1\/21, for j = 1, the second condition, |Ei-1|<1, ensures 
at most [logd | iterations. ¢ 


PROCEDURE CUT2 
(to be performed only if N>4.5K) 
Let V° denote the ipth element from every site S, € €° 


1 Select x" from the elements U; 2 oq! by a 4K-th 
weighted selection, where an element x e€ Vi 
has weight F ane ij. 


2 Send the element x" to all sites. Upon receipt of 
this element x", each site eliminates all elements 


larger than x”. 
END CUT2 
Property 6 If K is the rank of the element to be 


selected, then CUT2 retains at most 


4.5K elements, and the Kth element 
among these elements is also the Kth 
element of the input problem to CUT2. 


Proof. By Lemma 2 in [3] ¢@ 


2.3 Filtration phase 


The filtration stage is the last reduction technique used 


by our algorithm. It comprises of a sequence of iterations; 
each iteration reduces the number of elements under 
consideration by a constant factor until the number of 
candidate elements is less than or equal to a desired number, 
N,. The choice of N, depends on the type of network under 
consideration. Let <N, K, N[{1], ..., N[d]> be the problem 
configuration. If this configuration is the result of the CUT 
phase, then N<4.5K. 

The chosen filtration procedure is the technique 
developed in [5]. It should be noted that the distributed 
translation of the serial procedure REDUCE in [3] could also 
be used as a filtration technique; the choice here of the former 
technique rests on the fact that it guarantees a larger fraction 
of elements to be removed at each iteration. A full description 
and analysis of this technique can be found in [5]; an 
informal description is provided below. 


PROCEDURE FILTER 
(Iterate until either the sought element is found or NSN,) 


1. Among the medians of the elements still 
under consideration at each site, find the 
(N/2)-th weighted element x using 
weighted selection. 


2. Determine the overall rank r of x. Ifr=K 
then terminate (x is the sought element). 
Otherwise, eliminate from consideration 
the elements not greater than x (if r<K) or 
not smaller than x (if r>K), and adjust K 
(if r<K) and N. If NSN,, terminate, 
otherwise proceed with the next iteration. 

END FILTER 


Property 7 The number of iterations required by 


procedure FILTER to reduce N 
elements to Nr elements is 
2.41log(N/N,). 


Proof. Since at least 1/4 of the elements are removed from 
consideration after each iteration (see [5]), the property 
follows. ¢ 
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2.4 Termination phase 

Once the filtration phase is concluded, if the sought 
element has not yet been found, the N, elements still under 
consideration can be collected at a site where the final 
selection is performed. 

To perform this phase efficiently, the actual value N, (a 
parameter in the FILTER procedure) will depend on the type 
of network under consideration. As shown later, the choice 
for shout-echo is log « (where « is the value determined after 
the set-up phase), and for point-to-point networks the choice 
is d, where d is the original number of sites. 


3. Shout-echo networks 

In shout-echo networks, each site can broadcast a 
message and reply to a received message; the process of a 
site broadcasting a message (the "shout") and receiving a 
reply from all other sites (the "echo") constitutes a primitive 
communication activity, called a shout-echo. Marberg and 
Gafni presented a selection algorithm (procedure FILTER in 
this paper); they also assume that a preliminary reduction. 
(equivalent to the first iteration of the set-up phase followed 
by the termination stage) has already been performed (i.e., 
N < A). Their algorithm, in the worst case, requires 
4.82logA? + log A = 10.64 log A shout-echos [5]. 

An improvement in this bound for selection in 
shout-echo networks can be obtained by using the selection 
procedure described in the previous section. All phases can 
be easily implemented using shout-echo primitives.The 
procedure can actually be simplified in this implementation; 
in fact, it is not necessary to perform the specific element 
elimination steps of CUT1 (steps 2.2 - 2.4), in the 
shout-echo model: by taking & = S, for all j, CUT2 may be 
performed immediately. However, an increase of local 
storage space for file elements from O(d) to O(d log d) is 
needed at the initiating site, where d is the number of sites 
still under consideration when CUT is executed. By 


_ choosing N¢ = logk in the FILTER procedure, the following 


result yields. 


The number of shout-echoes used in the 
selection algorithm is at most 6.82logK 
+ O(1), where KSA is the problem size 
obtained after the set-up phase. 


Theorem 3 


Proof. The initialization and termination stages, as well as 
each iteration of procedure SET-UP can be implemented 
using only a constant number of shout-echoes. Since the 
procedure SET-UP can be performed in.a constant number of 
iterations (by Theorem 1), SET-UP requres O(1) 
shout-echoes. After this procedure is executed, the size of the 
problem is KSK; the number of sites still under consideration 
is 5< «; and the maximum number of elements at any one site 
is nS K. 

During each iteration of CUT1, the collection of the 
elements VJ, 1<j<q, and the broadcast of the median can be 
done with one shout-echo. Since the number of iterations in 
CUTI1 is bounded by Min{flog6 lJ logn |} < log«, the total 
number of shout-echoes needed to perform CUT1 is at most 
logk+O(1). Since the only communication activities needed 
in CUT2 are the collection of the elements V® and the 
broadcast of x", only a constant number of shout-echoes are 
needed to perform this procedure. Therefore at most 
logk+O(1) shout-echoes are needed to perform the procedure 
CUT. By Property 6, the CUT procedure allows at most 
4.5« elements to remain. By Property 7, the number of 
iterations used by the FILTER procedure is at most 
2.4llogK+O(1). Since each iteration uses 2 shout echoes, 
the number of shout echoes used by the FILTER procedure 
is 4.82 logx. Since CUT takes at most logx+O(1) shout 
echoes, and transferring the remaining logK elements will 
take no more than log« shout echoes, therefore the number 
of shout echoes used in the selection algorithm is at most 
6.82 logk+O(1). 


4. Point-to-point networks 

In sec. 4.1, a point-to-point selection algorithm for the 
 $tar graph shall be presented; this algorithm can obviously be 
employed in any other network having a spanning star graph 
(e.g., complete networks). In sec. 4.2, the application of the 
SET-UP procedure to the existing algorithms for rings, 
meshes, and (complete binary) trees shall then be examined. 


4.1 Star networks 

In point-to-point networks, the communication 
primitive is the transmission of a message toa neighbouring 
site. For networks whose topology is a star (or contains a 


spanning star subgraph), the shout-echo algorithm by | 


Marberg and Gafni [5] can be transformed so to employ 
19.28p log A - 9.64p log d + O(d) messages, where p = 
Min{K,d}. An improvement in this bound is obtained by 


executing all four phases (as for shout-echo networks) and 
choosing N-=d in the filtration phase, where d is the original 


number of sites. 
Theorem 4 The number of messages used in the 
selection algorithm in a star network is 
at most 9.64 5 log(K/d) + O(d) where 
KSA is the rank of the desired element 
and 6<p is the number of remaining 
Sites, respectively, in the problem 
configuration produced from the 
execution of the Procedure SET-UP. 


Proof. Since the procedure SET-UP can be performed in a 
constant number of iterations (by Theorem 1), SET-UP can 
be performed in O(d) messages. During each iteration of 
CUTI, the collection of the elements Vi, 1<j<q, and the 
broadcast of the median can be done with 2|)-}| messages. 


‘Since [Ej] < |E-1\/2 by step 2.4 and |E°| = 5<d by step 1, 


CUT1 uses at most O(5) messages. Furthermore, since the’ 
only communication activities in CUT2 are the collection of 

the elements of set V° and the broadcast of x", at most O(5) 

messages are needed for procedure CUT. 

By Property 6, the CUT procedure allows at most 4.5« 

elements to remain. Therefore by Property 7, the number of 


‘iterations used by the FILTER procedure is at most 


2.41log(K/d) + O(1). Since each iteration of FILTER uses 
45 messages, the number of messages used by the FILTER 
procedure is bounded by 9.64dlog(K/d). As the transmission 
of the remaining d file elements in the termination phase 


requires at most O(d) messages, the theorem follows. ¢ 
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4.2 Other topologies 

The existing algorithms for selection in other networks 
[2,6] can be made more efficient if the SET-UP procedure is 
used prior to their execution. 

The initialization and each iteration step of procedure 
SET-UP can be easily implemented in rings, meshes and 
complete binary trees using O(d) messages; the termination 
step, however, should be implemented using the existing 
topology-dependent algorithms for selection when there is 
only one element at each site (e.g., [2)). 

This implementation of procedure SET-UP will thus 
require O(d (log d)>) messages in the ring topology to reduce 
the number of elements, N, to at most «5 < x2. If the 
existing ring selection algorithm is then applied to the 
remaining elements, then at most O(d(logd)*log«) messages 
are needed. The resulting O(d (logd)*log(Max{k,d}) 
improving the existing bound of O(d(logd)*logN). 


Similarly in the mesh, SET-UP will take O(d(logd)>) 
messages to execute, thus enabling the existing algorithm to 
terminate in O(d(logd)!“logk) messages; the resulting O(d 
(log d)!/2log(Max{«,d}) improves on the existing 
O(d(logd)!/“logN) bound. 

In the complete binary tree, SET-UP will require O(d) 
messages to reduce N to at most Kd, whereupon the exsting 
algorithm is called to complete the selection in O(dlogk).The 
resulting O(dlogk) bound improves on the existing O(d 
log(2N/d)) bound whenever «K<N/d. 


5. Conclusions 

The problem of selecting the K-th smallest element of a 
set of N elements distributed among d sites of a 
communication network has been examined. A collection of 
distributed reduction techniques has been presented; the 
combined use of these algorithms has been shown to yield 
new solutions for the selection problem for both 


point-to-point and shout-echo networks. The complexity of . 
these solutions has been analysed and it has been shown to" 


represent an improvement on the existing bounds. 

There are still many open problems. For example, an 
efficient algorithm for arbitrary tree networks has not yet 
been developed. Another important open problem is to 
determine a lower bound for the distributed selection problem 
in point-to-point networks; the only existing lower bound is 
for complete binary trees [2]. 
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Constructing a Balanced m-way Search Tree 
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Abstract : We present parallel algorithms 
for constructing balanced m-way search 
trees. These parallel algorithms have 
time complexity 0O(1) for an n processors 
configuration. 


1. INTRODUCTION 


The idea of using tree structure to 
represent symbol tables, dictionaries or 
directories has been extensively 
studied[K]. In all these structures we 
have a collection of records that are to 
be manipulated with regard to a certain 
key field in the record. Common opera- 
tions on these structures are SEARCH, 
INSERT and DELETE. The tree _ structure 
supports efficient INSERT, SEARCH and 
DELETE operations. In some implementa- 
tions the operations are designed so that 
a balanced tree is maintained through the 
process. Another approach is to periodi- 
cally rebalance the tree. In our discus- 
sion we refer to these structures as dic- 
tionaries. The operations INSERT,SEARCH 
and DELETE will be referred to as basic 
dictionary operations. 


Many types of balanced m-way search 
trees are reported in the literature 
(K,HS]. In our discussion we refer to the 
following: 
Def 1.1 : An m-way search tree, T, is a 
tree in which all internal nodes are of 
degree < m. If T is empty then T is an 
M-way search tree. When T is not empty it 
has the following properties: (1) Tis a 
node of type Ao, (K,,A,),(K5,A5), oe ar 


(Kya An-1) where the A;, O< i <m are 


subtree of T and the 
(2) 


1 < i < m-1 (3) All key values 


pointers to the 
K;, 1 < i < m are key values. 


i i+l’ 


in subtree A; are less than value Ki. 


in the 


(5) 


Oo < i<m-1 (4) All key values 
subtree A,_, are greater than Kn-1° 


The subtrees A,,0 < 1 < m-l are also 


m-way search trees. 
Def 1.2 : A balanced m-way search tree is 
an m-way search tree with minimal height. 


While many parallel architectures 
have been proposed and studied ,we deal 
directly with only the MIMD model. We 
assume there is a large common memory 
that is shared by all processors. Any 
processor can access any word in common 
memory, but access the same memory word 
simultaneously is not allowed. 
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Recently, Moitra and Iyengar [MI] 
explored a technique of transforming a 
sequential algorithm for balancing a 
binary search tree into an efficient 
parallel algorithm. Furthermore, they 
have shown that the resulting parallel 
algorithm has a time complexity 0O(1) when 
a tree with N elements is balanced with N 
PEs. An O(log N) time set up overhead is 
incurred when a new N is considered. In 
this paper, we generalized the results of 
[MI] to a general m-way search tree. We 
also improve the computation such that no 
setup overhead is required. 


This paper is organized as follows. 
In section 2, we review some properties 
of m-way search trees. In section 3, we 
develop the general m-way search tree 
rebalancing/construction algorithm. In 
section 4, we discuss the implementation 
of these algorithms on an MIMD machine. 


2. M-WAY SEARCH TREES 


Dictionary searches are more effi- 


cient when they are done on a balanced 
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tree. In order to keep the tree balanced, 
the insertion and deletion algorithms are 
designed to leave the tree balanced. 
Because of this requirement it is more 
convenient to consider balanced m-way 
search trees that are not necessarily of 
minimal height (e.g. B -trees). These 
trees lend themselves to easier splitting 
and combining. While this is true in the 
serial case we show that in the parallel 
case the complete m-way tree proves to be 
an efficient choice. 

Def 2.1: A level labeling of an m-way 
search tree is a labeling in which nodes 
are numbered serially, top down left to 
right. The root node is always labeled l. 
Lemma 2.1 : When the nodes of a_ complete 
m-way tree are labeled by level labeling 
then the m children of node i (if they 
exist) are labeled (i-l)*m + 2 + jj 
,0 <j < ml. 


Def 2.2 : In an m-way tree T, the 
(i,j) 'th node refers to the jth node from 
the left on the ith level, if it exists. 
We refer to this indexing method for m- 
way search trees as two dimensional 
indexing. 7 
Def 2.3 : Inorder traversal of an m-way 
search tree is defined by the following 
recursive procedure: 
procedure MINORDER (T) 
{* T is an m-way tree as defined 
def.1.1 *} 
if T <> nil 
then call MINORDER (Aj); 

for I= 1 tomM 


in 


begin 
if no K; Key then exit; 


visit(K;); 


MINORDER (A,) 

end; 
end. {* MINORDER *} 
Def 2.4 : An index can be associated with 
each key in an m-way search tree. If this 
index corresponds to the order in which 
MINORDER visits the keys we call the 
indexing, Inorder indexing. 
Lemma 2.2 Suppose node I is at level r of 
a full m-way search tree of height n. Let 
the inorder index associated with key 


-X.= i for 


- Then X. i 


Ky in I be xX; i+] 


and J be two 


at level r of a 
Let 


Lemma 2.3 a) Let nodes I 


adjacent sibling nodes 


full m-way search tree of height n. 
I.X,., be the inorder index of the last 
key in node I and J.xX the inorder 
index of the first key in “node J. Then 
(n-r) 


1 


ad I.X 


a 2*m 


JX, 
b) The above claim is true 
adjacent nodes at level r. 
Theorem 2.1 Consider a full m-way search 
tree of height n. The inorder indexes for 
keys in node i at level r (node (r,i)) 


for any two 


are: (i-1)#m(M7Ft1) 444m (M°P) | 


peiem™ 1), 1 <5 <0 


where 


Corollary 2.1 A key with inorder index t 
is at level r of a full m-way search 


tree if and only if t mod m (N-F+1) # 0 


and t mod m‘"-*) 0. 

Corollary 2.2 Let t be the inorder 
associated with a key at level r of an 
m-way tree of height n and let q = 


L t/n =>) . Keys with the same r 
value (Cor.2.1) and gq value are in the 
same node, in the m-way search tree. 
Moreover the two dimensional indexing of 
this node will be (r,q+1l). 

Corollary 2.3 Let t be the inorder 
assoclated with a key at level r of an 
m-way tree of height n. The position of 
the key within the node is given by s, 


s = | (t moa miB7Ft1)) pg (Pr) |, 
1<s <m-l. 
Lemma 2.4 Let n be highest level in a 


m-way tree. The inorder indexes 
(n,p) are 
(n,v) is 


(n-1)_ 


complete. 
associated with keys in node 
(p-1)*m + 3, where 1 < j <u, 
the last node in this level, v < m 


u m-1 for all nodes except (n,v). 


index . 


index . 
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‘PE calculate the 


3. ALGORITHMS FOR REBALANCING AN M-WAY 
SEARCH TREE. 


We associate a PE with each node in 
the m-way balanced search tree.Let each 
inorder indexes for 
keys that should reside in the node. We 
consider a full m-way search tree and 
assume that each PE knows the height and 
degree of the tree. 


ALGORITHM 1 
(* Assume that there are (m -1)/(m-1) 
PEs. The number of keys that,are to be 
associated with this tree are m'-1.*) 
(I): (* each PE computes the two dimen- 
sional index of the node it is associated 
with, i.e. there is a mapping from i, the 
PE index, to (q,r),1< i< J] n/(m-1) |, 


req<m(P), perch) 
for each PE i do 
begin 


j < | log, i | (* i is the PE index*) 
if i > (m09t2) 21) 7 (m-1) 


then r <— j+2;,. 
q < i-(m(I*2) 1) 7 (m-2) ; 
else r <— jt+l}. 
q < i-(m!-1)/(m-1); 
end. 
(II): (* Each PE represents a node (r,q). 
Node (r,q) has m-1 inorder indexes asso- 


ciated with it. Each inorder index is 
associated with a unique key. The inorder 
index for key K, in node (r,q) is held in 


X 1<s<m-l. *) 


ss! 

for each node (r,q) do 
for s < 1 to m1 do (h-r) 
Xe <— ((q-1)*m+s) * m ; 


(III): (* The m pointers for node (r,q) 


are stored at Ag, O0<s < m1 *) 


for each node (r,q) do 
for s < 0 to m-1l do 
if r < h then 
A, < node ((q-1) *m+s+1,r+1) 


else 
A, <— null 


Theorem 3.1 Algorithm 1 correctly con- 
structs the required full m-way search 


tree within 0(1) tim if O(n) PEs 
available. 


We now consider the general case 
where the number of keys can be any posi- 
tive integer. Algorithm 2 will produce a 
complete m-way search tree for the given 
number of keys. Only the last node in the 
tree (greatest level label) can have less 
then m-l keys associated with it. We 
assume that each PE knows the number of 
keys and the degree of the search tree. 
These values might be passed to the PE as 
procedure parameters. 


are 


ALGORITHM 2 
(* The input keys have inorder indexes in 


the range [1 : n], n can be any positive 


integer *) | 

(I): Compute the two dimensional index of 
each node as in (I) of algorithm l. 

(II): (* Compute the parameters,u,v and 
w, of the tree. They stand for the 
number of keys in the last node, the 
number of the full nodes in the last 
level and the inorder index of the right 
most key in the last level. *) 

for each PE i do 

begin 

h< | log,nt i 


if n = m't1)_1 then (* full tree *) 
execute II and III of algorithm 1 and 
stop ; h 
c << n-(m'-l) ; 
u <— c mod (m-1) ; 
v <= | c/(m-1l) | ; 
if u = 0 then w <— v*m -1 else w <— 
vemtu } 
end. | 
(III): (* Compute the inorder indexes 
that are directly influenced by the last 
level. These will be associated with 
nodes in the left part of the tree *) 
for each node (q,r) do | . 
begin 
s <1; 
loo 2 
Ceup <— ((q-1)*mt+ts) * m (B r+1) ; 
if temp > w or s = m then exit ; 
X, < temp ; 
s < stl; 
forever 
end | 
(IV): (* Compute the inorder indexes for 
the rest of the tree *) 
for each node (q,r) with r < h+l do 
begin 
s =< m-l ; 
loop 
temp <- ((q-1)*m+s) # m(B-T) 4 ¢ ; 
1f temp < wor s = 0 then exit ; 
X, <— temp ; 
s°< s-l ; 
forever 
end. 
(V): (* Compute pointer values*) 
for each node (q,r) do 
begin 
1f u = 0 then q<v else q, < vtl ; 


(* node (q, ,-h+1) contains w *) 

G <— L (a)-1)/m | + 1; 

(* (q,,h) is the parent of (q, /h+1) *) 
j< (q,-1) mod m ; 

* A. 
( J 


for s < 0 to m-1 do 
if r<hor (r =h and gq < qo) 


of (q5,h) points to (q,,ht+1) *) 


or (r = h and gq = qd, and i < j) 


then A, <— node ((q-1)*m+s+1,r+1) 
else A" < null ; 
end. 


Lemma 3.1 The following are true 
complete m-way search tree : 
right most node of 
node (v,h+1) when u = 0 and node(v+1,h+t1) 
otherwise. (ii) The last element of the 
right most node at level htl is w. 
Theorem 3.2 Algorithm 2 correctly con- 
structs the required complete m-way 
search tree within 0(1) time if O(n) PEs 
are available. | 


for a 
(i) The 
level h+1l is 


4. IMPLEMENTATION AND CONCLUSIONS 


In section 3, we have presented an 
optimal parallel algorithms for rebalanc- 
ing or constructing balanced m-way search 
trees. If n keys are to be associated 
with the tree then the construction can 
be carried out in O(1) time using O(n) 
PEs. A more detailed discussion can be 
found in [DPI]. When the dictionary is 
stored in external memory. The storage 
structure is chosen so that the number of 
I/O operations are minimized. An m- way 
search tree is a popular choice. m is 
selected to fit the physical characteris- 
tics of the external storage [HS] 


The above observations can be 
translated quite effectively to practice 
in our MIMD environment. The system can 
initiate any number of searches ina 
"pipelined" fashion. Each search is con- 
ducted using only one PE. leaving one 
machine cycle between consecutive 
requests. Search results can be obtained 
in a pipeline interval of o0O(1). While 
some PES are conducting searches other 
PEs are free to perform other tasks. 


Our solution is applicable for a 
general purpose machine environment. The 
m-way search tree is kept in external 
storage. At any point of time k PEs are 
available, 0 < k< P (P is the maximal 
number of PEs available on the a 
machine.). In such a machine the operat- 
ing system can be instructed to allocate 
only one processor for a search operation 
and as many PEs as available or required 
( whichever is the minimum), in case a 
new tree has to be constructed or an 
existing tree rebalanced. 
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ABSTRACT 


This paper presents a stream-oriented 
parallel processing scheme for relational database 
operations (relational operations). By combining 
the stream processing principle with demand-driven 
control, this scheme exploits the pipeline 
concurrency inherent in a query and makes it 
possible to execute relational operations with 
limited resources. This paper presents a basic 
algorithm for this stream-oriented parallel 
processing scheme and discusses its features and 
effectiveness. 


i. Introduction 


The relational data model [1] proposed by E. 
F. Codd has received much attention as a practical 
data model based on a set theory. However, many 
problems remain to be solved in applying the model 
to practical applications. In particular, 
improvement of the performance in executing 
relational operations is essential in implementing 
a relational database system. 

Most conventional parallel processing 
schemes for relational operations were designed 
to improve the execution performance of an 
individual relational operation, such as the join, 
selection or projection operations ([4],[{6],[7], 
[8] et al.). 

This paper presents a novel parallel 
processing scheme called the stream-oriented 
parallel processing scheme (stream scheme). This 
scheme is based on the stream processing principle 
[2], and exploits the parallelism inherent ina 
query. The stream scheme is attractive not only 
for the design of a dedicated database machine but 
also for the design of a distributed query 
processing system [10]. In particular, when memory 
and processor resources are limited, this scheme 
acts to the best of its ability making efficient 
use of those resources. The conventional schemes 
do not alleviate problems due to the complexity of 
resource management, such as memory overflow when 
executing relational operations. For example, 
processing performance decreases drastically when 
the amount of main memory available is less than 
that required to process the operand relations. By 
combining the stream processing principle with 
demand-driven control, the stream scheme enables 
relational operations to be executed with limited 
available memory and to effectively exploit the 
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parallelism of pipeline processing. 
This scheme can also be applied to query 
processing in a sequential machine. In such a 


case, relational operations are described as 
concurrent processes, and each relational 
operation for query processing is executed in 
quasi-parallel. 

In the stream scheme, since the scheme to 
execute an individual relational operation is not 
restricted, efficient conventional schemes can be 
employed as the internal processing scheme for a 
relational operation. 

The main features of the stream scheme are as 
follows: : 

(1) Intermediate data does not cause memory 
overflow even in query processing in which the 
join operation, Cartesian-product operation or 
union operation creates a large intermediate 
relation. 

(2) Concurrent pipeline processing is performed 
between relational operations in a query. The 
scheme exploits parallelism by adapting to 
existing system resources. 

(3) This scheme enables a relational operation to 
be activated when a subset (several tuples) of 
each operand relation is ready to be processed. As 
a result, the response time required to obtain the 
earlier parts of resulting tuples of a query can 
be shorten. 


2. An Algorithm for Relational Operations 


A query can be thought of as a tree whose 
nodes represent a set of relational operations. 
The leaves of the tree only reference source 
relations of the database. In the stream scheme, a 
query is executed as shown in Fig.l. We define the 
consuming node of intermediate tuples produced by 
a node as the “upper node"(upper relational 
operation node), and the producing node of 
intermediate tuples as the "lower node"(lower 
relational operation node). Each node has one or 
two input buffers. When a relational operation 
node receives a demand from the upper node, it 
accesses a page in its input buffer and then 
executes the relational operation until it 
completes the production of one resulting page of 
tuples in the output buffer. The output buffer is 
then treated as the input buffer for the upper 
node. Here, "access a page” means that a node is 
ready to perform a relational operation on a page 
in its input buffer. 
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Individual relational operation nodes do not 
create a whole intermediate relation for a single 
demand. Each one creates only one page of tuples 
for a single demand. Individual buffers do not 
require the capacity to store an entire 
intermediate relation. The size of each page is 
set to half of the corresponding buffer size, 
that is, each buffer has two storage areas so that 
two pages can be stored. As the sizes of buffers 
assigned to relational operation nodes are 
different, the page sizes are also different. 
Immediately after a node begins to access a page 
in one of two areas of its input buffer, it sends 
a demand to the lower node to refill the other 
area of the input buffer with the subsequent 
page. In this algorithm, it is assumed that the 
double buffering mechanism is supported in every 
buffer. That is, whjle the node accesses a page in 
one area of its input buffer to execute a 
relational operation, the lower node can store a 
page in the other area of the same buffer at the 
same time. 

As a result, 
is performed between 


concurrent pipeline processing 
relational operation nodes. 
By using demand-driven control, unary relational 
operations (the selection, restriction and 
projection operations), and binary relational 
operations (the join, union, intersection, 
difference and Cartesian-product operations) can 
be concurrently executed with limited memory 
resources. In particular, this algorithm shows 
attractive advantages in executing the join, union 
and Cartesian-product operations, which are the 
most time- and resource-consuming operations. 


0 demand 


buffer3 


Database 


demand buffer4 


2.1 Unary Relational Operations 


A unary relational operation node has one 
input buffer and one output buffer. Unary 
relational operations are executed by the 
following algorithm. 


Step (1) When the unary relational operation node 
receives a demand from the upper node, it 
repeatedly executes Steps (2) and (3) until one 
page made up of the resulting tuples is created 
and stored in the output buffer. 


Step (2) A single page is accessed in one area 
of the input buffer. At this time, the other area 
of this input buffer becomes available and a 
demand is pre-issued to the lower node to refill 
the other area of that buffer with the subsequent 
page. As a result, pipeline concurrent processing 
is implemented between this unary relational 
operation node and the lower node. 


Step (3) The relational operation is executed on 
the page that has just been accessed in Step (2), 
and the resulting tuples are stored in the output 
buffer. Once a page in the input buffer is 
manipulated in this step, the page is deleted from 
the buffer. If the output buffer is filled with a 
page of resulting tuples, the execution is 
suspended at this point and the node waits for the 
next demand from the upper node. Otherwise, Steps 
(2) and (3) are executed repeatedly. If the page 
being manipulated is the last one of the operand 
relation, the execution of this relational 
operation is terminated. 


For the projection operation, if the primary 
key attribute is not included as an operand 
attribute, it is necessary to eliminate duplicate 
tuples. Therefore, the resulting tuples in the 
output buffer of the projection node must not be 
deleted even after the upper node have completed 
manipulating those tuples. That is, the projection 
node must reserve the output buffer area to 
provide storage for all of the tuples resulting 
from the projection operation. 


relational operation 


node 
the tuples 


that have been 
already produced 


buffer 


(double buffering 
mechanisr is supported) 


Pol direction of strean 


Fig. l 


An Overview of the Stream Scheme 
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2.2 Binary Relational Operations 


In binary relational operation nodes, the 
operation is carried out by comparing each page of 
the outer-relation with all of the pages of the 
inner-relation. That is, the nested-loop algorithm 
is applied to the page-level granularity. The 
comparison algorithm applied to the tuple-level 
granularity within a page is not restricted. 

The sort-merge, nested-loop or hashing 
algorithms can be used to compare the tuples of 
two different pages. Even dedicated 
multiprocessors can also be utilized to compare 
two operand pages. Each binary relational 
operation node has two input buffers and one 
output buffer. One of the input buffers is used 
for storing outer-relation pages and the other 
for storing inner-relation pages. Binary 
relational operations are executed by the 
following algorithm. 


Step (1) When the binary relational operation node 
receives a demand from the upper node, it executes 
Steps (2), (3) and (4). 


Step (2) One outer-relation page is accessed in 
one of two areas of the input-buffer with the 
subsequent outer-relation page. Then, a demand is 
pre-issued to the lower node which produces 
outer-relation pages to refill the other area of 
that buffer. By issuing the demand in advance 
before executing the relational operation, the 
production of the following operand page in the 
lower node overlaps with the execution in this 
relational operation node. 


Step (3) One inner-relation page is accessed in 
one of two areas of the input-buffer. Then, a 
demand is pre-issued to the lower node to refill 
the other area of the input-—buffer. 


Step (4) The relational operation is executed by 
comparing the outer-relation page accessed in (2) 
with the inner-relation page accessed in (3). The 
resulting tuples are stored in the output buffer. 
When one of two areas of the output-buffer is 
filled as the result of the demand received in 
(1), the execution is suspended until the next 
demand is issued from the upper node. Otherwise, 
(3) and (4) are executed repeatedly. If the inner- 
relation page being compared with the outer- 
relation page is the last page and the outer- 
relation page is not the last one, then the lower 
node which creates inner-relation pages is 
initialized and control returns to (2). (This 
means the reproduction of the inner-relation by 
the lower node. The inner-relation is reproduced 
in order to compare all inner-relation pages with 
each outer-relation page. If all of the outer- 
relation pages are stored in the input buffer, 
this reproduction is not required.) If both of the 
pages being compared with each other are last 
pages, the execution of the relational operation 
is terminated. 


3. Design Considerations 


3.1 Stream Processing 
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This section discusses the basic concepts and 
properties of the stream scheme. In Fig.l, the 
relational operation nodes (node-1l, node-3 and 
node-4) for a query can be executed in parallel 
because these relational operations are 
independent. Furthermore, the concurrency among 
relational operation nodes which are vertically 
connected can also be exploited in the stream 
scheme. 

A stream is an ordered sequence of data 
which are arranged according to the order of 
production. In this scheme, each element of a 
stream corresponds to a tuple or a page ina 
relation. Therefore, the ordered sequence of 
tuples or pages is manipulated as a stream. In 
processing relational operations, it is not 
necessary to wait for the production of complete 
operand relations. Processing of relational 
operations can start immediately after an 
earlier part of the stream is constructed. 

Criteria for exploiting parallelism are given 
here. We will consider a subquery consisting of 
two join operation nodes as shown in Fig.2. It is 
assumed that the two operations are executed in 
different processors and each operation is 
executed by using the nested-loop algorithm for 
tuple-level granularity in comparing two operand 
pages. For the sake of simplicity, it is also 
assumed that communication overhead between 
processors does not exist. This is a reasonable 
assumption because execution process of a 
relational operation can overlap the communication 
process. Furthermore, the execution time is much 
longer than communication time when a single 
processor is used in each node. If a 
multiprocessor is allocated to each node, the 
communication overhead may affect the parallelism 
of stream processing. 
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Pipeline Concurrency in a Subquery 


If the join-1 node completes producing the 
next operand page while join-2 node executes the 
comparison of two operand pages currently stored 
in the input buffers, pipeline delay will be 
eliminated. That is, in the join-2 node, the 
suspension of the relational operation execution 
which occurs due to the absence of the next inner- 
relation page causes pipeline delay. The 
criterion is given by the following formula. 
"jsfl" represents the join selectivity factor of 
the join-l node. The number of tuples of an 
intermediate relation is jsfl1*(the number of 
tuples in the outer-relation)*(the number of 
tuples in the inner-relation). "bs1" and "bs2" are 
the sizes (the number of tuples) of buffers to 
store an outer-relation page and an inner-relation 
page in the join-2 node. 

1) The number of times the tuples in an inner- 
relation page and the tuples in an outer-relation 
page are compared in the join-2 node is 


bs1*bs2. (1) 


2) The number of times of comparisons in the 
join-l node to produce an inner-relation page for 
the join-2 node using the nested-loop (1/jsfl is 
the average number of times of comaparisons to 
generate one tuple by the nested-loop) is 


bs2/jsfl. (2) 


3) The condition to continue processing in the 
join-2 node without suspension is 


(1) = (2), that is, 
bsl = 1/jsfl. (3) 


In general, the criterion to exploit the 
maximum parallelism between two relational 
operation nodes is 


buffer size for an outer-relation page 
= 1/join selectivity factor of the lower node. (4) 


If this criterion is satisfied between two 
nodes in two different processors, highly 
concurrent processing can be realized. 

If the sort-merge algorithm or the hashing 
algorithm is used in the comparison between an 
inner-relation page and an outer-relation page, 
approximately the same criterion as that used for 
the nested-loop algorithm is given. If a multi- 
processor is used to implement the nested-loop 
algorithm in each node and the communication 
overhead among processors is not considered, the 
criterion is 


bsl = P2/(jsf1*P1). (5) 
(Pl and P2 are the numbers of processors allocated 
to the join-1 and join-2 nodes, respectively.) 


3.2 Demand—driven Control 


The stream constructed by the pipeline 
processing can be controlled according to the 
buffer capacity by demand-driven control. The 
stream flowing on the pipeline is limited and the 
computation on the stream can be carried out in a 


limited resource environment. That is, buffer 
overflow does not occur even if the join, 
Cartesian-product or union operations create a 
large intermediate relation. However, if a whole 
outer-relation is not stored in the input buffer 
in processing a binary relational operation, the 
inner-relation must be reproduced as described in 
Subsection 2.2. In this case, a source relation, 
or an intermediate relation produced by applying 
selection and projection operations prior to the 
binary relation, must be re-accessed. Therefore, 
when the reproduction of an intermediate relation 
causes a heavy overhead, the input buffer for 
outer-relation pages should have enough capacity 
to store the whole outer-relation. In such a case, 
we may change Step (4) of the algorithm described 
in Subsection 2.2 as follows: 


To avoid having to reproduce the inner-relation, 
all of the outer relation pages are produced all 
at once in Step (2). At this time, the outer- 
relation pages overflowing from the input buffer 
are stored in secondary storage. In Step (4), each 
of the outer-relation pages is compared with the 
inner-relation page, which was accessed in Step 
(3), until the output-buffer is filled. When the 
output buffer is full, the execution is suspended 
until the next demand is issued from an upper 
node. If the outer-relation page being compared is 
the last page, control returns to Step (3). If 
both pages being compared are last pages, the 
execution of the relational operation is 
terminated. 


3.3 Design Approaches 


Two approaches to the design of a relational 
database system based on the stream scheme can be 
considered. One approach is based on developing a 
distributed query processing system on 
conventional computers connected to a local area 
network. In general, in the local area network 
environment, the number of processors used for 
database processing is restricted. Also, the 
capacity of main memory allocated to each 
processor is restricted. The stream scheme is 
attractive in such a resource-—limited environment. 
This approach is presented in [10]. 

The other approach is based on developing a 
dedicated machine for the stream scheme. One 
candidate for a dedicated machine is to design a 
dataflow machine with demand-driven control 
mechanism. It is well known that the dataflow 
computation is suitable for the exploitation of 
the parallelism inherent in a functional program. 
Relational operations can be described in a 
functional programming language providing the 
specifications of eager and lazy evaluation 
mechanisms[9]. Furthermore, the eager and lazy 
evaluation mechanisms are useful for realizing 
pipeline processing and demand-driven control1[11]. 
Therefore, by specifying these mechanisms in 
describing the programs of relational operations, 
the stream scheme is implemented on a dataflow 
machine supporting eager and lazy evaluation 
mechanisms. 
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4. Performance Considerations 


4.1 Effects of Stream Processing 


The stream scheme was experimentally examined 
under the UNIX 4.2BSD on the Sun-2 
workstation[14]. The experimental environment for 
stream processing is summarized as follows: 


(1) Each relational operation node is described as 
a coroutine, and pipeline processing among 
relational operation nodes is simulated. 

(2) A whole outer-relation is retained in an input 
buffer, that is, an inner-relation do not have to 
be reproduced. 

(3) In the join node, the sort-merge algorithm is 
used to compare an inner-relation page with an 
outer-relation page. An inner-relation page and an 
outer-relation page are first sorted on joining 
attributes, respectively. After that, these pages 
are joined by using the merge operation. It is 
assumed that the sort-merge algorithm is 
implemented by a sequential process. 

(4) In this environment, the situation in which 
the demand-driven control is not employed can also 
be simulated. We call this situation "non-demand- 
driven case". In the non-demand-driven case, 
though relational operation nodes are activated 
on the basis of the page-level granularity in the 
Same manner as the demand-driven case, the 
production of pages is not controlled by the 
demand-driven mechanism. Therefore, although 
pipeline concurrency can be exploited, 
intermediate pages may cause buffer overflow. 

(5) The communication time between processors is 
set to 16.6 msec/2k bytes. It is assumed that 
communications are performed exclusively. That is, 
when two processors are communicating, no other 
processors can communicate with each other. 

The sample query shown in Fig.3 is chosen 
from the benchmarks described in [13]. This query 
consists of two join operations and two selection 
operations. The parameter settings of source 
relation sizes, join selectivity factors (jsf), 
selection selectivity factors (ssf) and buffer 
sizes are shown in Table l. Tuples in all 
relations are 64 bytes long, each including two 4- 
byte integer attributes as joining attributes. 
All integer attributes have uniformly distributed 
values, but the range of their distributions 
varies in order to provide different join 
selectivity factors. 

The experimental results are shown in Table 2. 
As the join selectivity factors (jsfl and jsf2) of 
the join nodes(join-1, join-2) increase, the first 
response time (the time to obtain the first 
resulting page of a query) becomes shorter in the 
stream scheme. On the other hand, the response 
time (the time to obtain all of the resulting 
pages) becomes longer as the increase of the jsf 
because the number of tuples in the intermediate 
relation manipulated by the join-l1 node becomes 
larger. However, in comparing the case of 
jsfl=jsf2=0.002 with the case of jsfl=jsf2=0.001, 
though the number of tuples in the intermediate 
relation manipulated by the join-l node ‘doubles, 
the increase of the response time is smaller. This 
is because join-l and join-2 processings overlap 


due to pipeline effect. 
In comparing the experiment-1 (bufferl 


buffer3 = buffer5 = 10(tuples)) with the 
experiment-2 (bufferl = buffer3 = buffer5 = 
100(tuples)), the first response time in the 


experiment-l is shorter than that in the 
experiment-2. This is because the time to activate 
the join-2 and join-1 nodes can be shorten in the 
small buffer case. On the other hand, the response 


buffer 1 | 


aN SS 
electi 
SEES 


R2 R3 
Database Database 
Fig. 3 Sample Query 
Table 1 Parameter Settings 
source relation size : Rl = 1000(tuples) , 
R2 = R3 = 10000(tuples) 
tuple size : 64 bytes 
buffer size : buffer2 = buffer4 = 1000(tuples) 
bufferl = buffer3 = buffer5 = 10(tuples) 
(experiment-—1l ) 
bufferl = buffer3 = buffer5 = 100(tuples) 


(experiment-2) 


selection selectivity factor(ssf) : 


ssfl(selectionl) =ssf2(selection2) = 0.1 
((intermediate relation size) ssf*(relation size) ) 


join selectivity factor(jsf) : jsfl(joinl) = jsf2(join2) 
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( jsfl=jsf2) 


jsf 


intermediate relation size 
(result relation size of each operation ) 


1000 1000 100 10 
1000 1000 1000 
1000 1000 2000 


Table 2 Response Time 


Experiment-l Experiment—2 


(buffer-1,3,5 = 10(tuples)) (buffer-1,3,5 = 100(tuples)) 


Non-demand- Demand-driven Non-demand- 
| driven driven 
Dat 
7.4 


24 ‘ : 5.8 5.8 5.7 
6 ae 4.4 7.4 4.4 
Se, : 3.8 11.0 3.8 11.0 


FR Time : First Response Time (sec) 
R Time : Response Time (sec) 


time in the experiment-1l becomes longer than that 
in the experimental-2. This is because the 
communication overhead between two nodes cause 
overhead in the smaller granularity case. 
Furthermore, since the sort-merge algorithm is . non—demand—driven case 

‘ 4 10000 . 
used in comparing two operand pages, the total 
number of times of comparisons increases in the 
smaller granularity case. 

In comapring the response time in the demand- 
driven case with that in the non-demand-driven 
case in Table 2, the response times of both cases 
are almost the same. This result shows that the 
overhead due to the transfers of demands did not 
affect to the response time. In general, the time 
to transfer a demand between processors is much 
shorter than the time to transfer a page. 
Therefore, the transfers of demands do not cause 
heavy overhead. 1 

Another major interest is in comparing the 0.0001 0. 001 0. 002 
memory requirement in the demand-driven case with © ; < 
that in the non-demand-driven case. Fig.4 shows Join Selectivity Factor 
the memory requirement for the inner-relation Fig. 4 Memory Requirement (Experiment-1) 
pages of each join node in the experiment-—l, and 
Fig.5 shows that in the experiment-2. In the 
demand-driven case, a query can be executed with 
the fixed sizes of buffers. On the other hand, in 
the non-demand—driven case, the memory requirement 
for the buffer increases as increasing the join 
selectivity factor. Though the response times in 
both cases are approximately the same, the memory 
requirement in demand-driven case is much smaller 
than that in the non-demand-driven case. In 
particular, in the non-demand-driven case, when 
the time to produce a page in the lower node is 
much longer than the time to consume the page in 
the upper node, the input buffer of the upper node 
must have an enough capacity to store the large 
amount of the intermediate pages. Since’ the 
selection operation node (selection-=2) which is 
the lower node producing the inner-relation pages 1 
of the join operation (join-2 node) produces pages 0. 0001 0. 001 0. 002 
very fast, the input-buffer(buffer-5) of the join- ; _ 

2 node must store many inner-relation pages. In Join Selectivity Factor 

the join-l node, as the join-2 node produces the 
intermediate pages of the inner-relation of the 
join-1l node faster as increasing the join 
selectivity factor, the memory requirement becomes 


demand—driven case 


(tuples) 


1000 


100 


10 


Hemory Requirement 


demand—driven case 


non—demand—driven case 
10000 


(tuples) 


1000 


100 


10 


Hemory Requirement 


Fig. 5 Memory Requirement (Experiment-2) 
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larger. In those situations, the superiority of 
the stream scheme is distinguished. 


4.2 Comparisons with Conventional Schemes 


In this evaluation, it is assumed that each 
relational operation node is constructed with a 
multiprocessor. It is also assumed that the source 
relations, or intermediate relations whose sizes 
were reduced by applying selection or projection 
operations prior to the join operation are 
retained in the main memory. The performance of 
the stream scheme in executing the join operation 
is represented as follows: Tl and T2 are the 
numbers of tuples in the outer-relation and the 
inner-relation of the join node, respectively; Im 
is the number of tuples in the outer-relation of 
the upper relational operation node if the upper 
node is a binary relational operation. Otherwise, 
Im is set to B; B is the number of tuples that can 
be stored ina buffer(It is assumed that the size 
of each buffer is the same.); "jsf" is the join 
selectivity factor of the join node being 
evaluated; "jsf2" is the join selectivity factor 
of the lower join node; Pl is the number of 
processors allocated to the join node being 
evaluated; P2 is the number of processors 
allocated to the lower join node; P is P1+P2; Cm 
is the time required to broadcast a tuple to the 
processors; Ro is the comparison time between two 
tuples; Sw is the swapping time of a tuple; and Wr 
is the storing time of a resulting tuple. 

The total number of times tuples are compared 
is 


T1*T2*(Im/B). 
The memory requirement is 
3°*B, 


The swapping time of tuples in operand 
intermediate relations is 


0. 


The processing time of the join in the 
multiprocessor node is 


Cm*B + Ro*B/(jsf£2*P2) + Wr*B/P2 
+(T1/B)*(T2/B)*max(Cm*B+Ro*(B*B/P1)+Wr* jsf*B*B/P1, 
Cm*B+Ro*B/( jsf2*P2)+Wr*B/P2). 


In most conventional schemes for relational 


operations, the granularity for the operand data 
of each relational operation is set to a relation. 
In this case, after the relational operation node 
completely produces a whole intermediate relation, 
then the subsequent relational operation is 
activated. Therefore, parallelism of pipeline 
processing between relational operations is not 
exploited, and furthermore memory swapping is 
necessary between main storage and secondary 
storage when memory overflow occurs due to 
intermediate relations. If the granularity of data 
in activating a relational operation is set to a 
page, the advantages of pipeline concurrency 
inherent in a query can be utilized[3],[12]. An 


execution scheme that implements this pipeline 
processing by using the data-driven control 
mechanism has been proposed[5]. However, pipeline 
processing using data-driven control may cause 
heavy overhead when executing a query. 

A binary relational operation node is 
required to compare each outer-relation page with 
all of the inner-relation pages. Therefore, even 
if only one inner-relation page and only one 
outer-relation page remain to be produced, all of 
the other produced outer-relation pages and all of 
other produced inner-relation pages must be kept 
in storage. As a result, every node of a query 
may be forced to keep large amounts of 
intermediate data in storage. 

We compare the stream scheme with a typical 
conventional scheme which employs the relation- 
level granularity when activating a relational 
operation. We assumethis conventional scheme 
implements a relational operation using parallel 
nested-loop algorithm[4]. This algorithm is 
employed in many database machines. If it is 
employed, each tuple of the outer-relation is 
compared with every tuple of the inner-relation. 
In the conventional scheme, the total number of 
times tuples are compared is 


El T2. 
The memory requirement is 
T1+T2+jsf*T1*T2. 


The swapping time of tuples in intermediate 
relations is 


Sw*((T1-B)+(T1/B)*(T2-B) + (jsf*T1*T2-B)). 


The processing time of the join operation in a 
multiprocessor is 


Cm*T1 + Ro*T1*T2/P + Wr*jsf*T1*T2/P. 


The total number of times tuples are compared 
in the conventional scheme is O(T1*T2) and that in 
the stream scheme is O(T1*T2*Im/B). "Im/B" 
indicates the number of times the inner-relation 
is reproduced when the whole outer-relation is not 
retained in the input buffer. The stream scheme 
overcomes this overhead because of utilization of 
pipeline processing, and the prevention of memory 
overflow. 

As described in Section 2, memory overflow 
due to intermediate relations does not occur in 
the stream scheme. In the conventional scheme, on 
the other hand, if the intermediate relation 
becomes larger than the buffer size, memory 
Swapping must be done between secondary storage 
and the buffer. The O(T1*(T2-B)/B) number of tuple 
Swapping times is required. The number of 
Swapping times is almost the same as the number 
of times tuples are compared in the nested-loop 
algorithm. In general, since the tuple swapping 
time is much longer than the tuple comparison 
time, the memory overflow causes heavy overhead. 
That is, the superiority of the stream scheme is 
distinguished when memory overflow occurs in the 
conventional scheme. 
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The advantage of the stream scheme increases 
in proportion to the increase of the join 
selectivity factor. This is because the stream of 
tuples continuously flows through the pipeline 
among the relational operation nodes in the larger 
jsf case. On the other hand, in the conventional 
scheme, the size of the intermediate relation 
increases according to the increase of the join 


selectivity factor. As the result, the subsequent © 


relational operation node must munipulate a larger 
intermediate relation, and the processing time of 
the subsequent node becomes longer. Therefore, 
when the join selectivity factor is large, the 
stream scheme is more effective than other 
schemes. Furthermore, even when the join 
selectivity factor is small, 
required in the stream scheme to execute a query 
is almost the same as in the conventional scheme. 
In such a case, the number of reproduction 
times(Im/B) becomes less because Im becomes 
smaller as the join selectivity factor decreases. 
That is, when the join selectivity factor is 
small, the stream scheme executes a query in 
almost the same fashion as’ the conventional 
scheme. 


5. Conclusions 


We have presented a stream oriented execution 
scheme for relational database operations. This 
scheme is based on the stream processing 
principle. By combining stream processing with 
demand-driven control, this scheme exploits 
pipeline concurrency and makes it possible to 
execute relational operations with limited 
resources while avoiding the complexity of 
resource management. In this paper, we have 


the processing time - 


presented a basic algorithm for the stream- . 


oriented parallel processing scheme and have also 
discussed the features and the effectiveness of 
the stream scheme. 

As mentioned in Subsection 3.3, two 
implementation approaches for the stream scheme 
can be considered. A distributed query processing 
system based on the stream scheme is currently 
being developed in a local area network 
environment[ 10]. | 

We believe that concepts included in the 
proposed stream-oriented parallel processing 
scheme will contribute to the development of 
relational database systems. 
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Abstract 


A partition and replicate strategy for pro- 
cessing distributed queries referencing no frag- 
mented relation is sketched. An optimal algorithm 
is given to decide which relation is to be parti- 
tioned into fragments, which copy of the relation 
is to be used, how the relation is to be parti- 
tioned and where the fragments are to be sent for 
processing. The time complexity of the algorithm 
is O(N log m +r logr), where Nq is the total 
number of copies of the relations referenced by 
the query, m is the number of sites andr is the 
number of relations. Under the situations that 
all sites have the same processing speed and _ the 
cost functions for data transmission and local 
processing cost are linear, the time complexity of 
the algorithm is reduced to O(r |AS(Q)| log 
[AS(Q)| + r log r), where |AS(Q)| is the number of 
sites having a copy of a relation referenced by 
the query. 


1. Introduction 

Distributed query processing is an important 
factor for the performance of a system with the 
databases distributed in a network. Many distri- 
buted query processing algorithms [ApHY, BeCh, 
BeGo, BlLu, CDFG, ChBH, Chan1, Chan2, ChHo, ChHu 
ChLi1l, ChLi2, ChLi3, EpSW, GoSh, HeYa, KeYa, Luk2, 
SDD1V1, SDDIV2, Woka, Wong, WOHL, Yaos, YCTB, 
YLCC, Yu0zJ] have been proposed. Most of these al- 
gorithms assume that the data communication cost 
is dominant and make use of semijoins to reduce 
the amount of data transfer. While such an as- 
sumption is reasonable for long-haul networks 
where data communication costs are high, it may 
not be valid for fast local networks. In contrast 
the "fragment and replicate" (a fragmented rela- 
tion remains to be fragmented, while other rela- 
tions are replicated at the sites of the fragment- 
ed relation) query processing strategy was used in 
distributed Ingres [EpSW]. Its main feature is to 
allow parallel processing. However, for many 
queries, substantial data transfer is required be- 
fore parallel processing can take place unless 
many relations are duplicated in many sites. 


We have developed two algorithms. The first 
algorithm, the semijoin algorithm [YCTB, YCTBL], 
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is an extension to the SDD#*1 algorithm [SDD1V2] 
which assumes that the most important cost is the 
number of bytes transmitted between network nodes. 
The algorithm was extended to support fragmented 
and replicated relations. The other algorithm, 
the replicate algorithm [BrTY, YGCC] is derived 
from distributed Ingres LEpSW]. It assumes’ that 
local processing cost dominates network costs and 
uses fragmented relations to maximize the amount 
of parallelism in operations. 


We have done the performance analysis for 
these two algorithms [BrTY, TBCD]. We found that 
the replicate algorithm outperforms the semijoin 
algorithm in a fast local network environment. 
Thus, in a fast local network, it is preferable to 
use a fragment and replicate algorithm. However, 
if no relation referenced by a query is fragment- 
ed, it is necessary to decide the relation and its 
copy to be partitioned into fragments, how the re=- 
lation is to be partitioned and where the frag- 
ments are to be sent for processing. This paper 
provides answers to the above questions and gives 
an optimal algorithm in time O(N, mjiogm+r log 
r), where N, is the total number of copies of the 
relations reflerenced by the query, m is the number 
of sites andr is the number of relations. Under 
the special case that all sites have the same pro- 
cessing speed and the cost functions for data 
transmission and local processing cost are linear, 
the time complexity reduces to O(r |AS(Q)| log 
|AS(Q)| + r log r), where |AS(Q)| is the number of 
sites having a copy of a relation referenced by 
the query. 


2. Problem definition and notations 


A relation R; is partitioned into a number of 
fragments (Foy) if |[Rq| - E/Fij|. 


Suppose {R,, R»o,...Rp} are the relations to 
be joined as specified in a query and all of them 


are not fragmented. One of them, say R , will be 
chosen to be partitioned into d fragments, for 
some integer d. Then, the fragments will be as- 
signed to a set of d sites. The other relations, 
if not present at each of the d sites, will be re- 
plicated at those sites. The answer is the union 
of the joins of the fragments with the other real- 
tions at the d sites. The problem is to decide 


(i) the relation to be partitioned i.e. Re (ii) 


the number of fragments the relation is to be par- 
titioned i.e. d (iii) the size of each of the d 
fragments (iv) the sites where the fragments are 
to be assigned (v) the copy of the partitioned re- 
lation to be used, if multiple copies exist, such 
that the response time (in terms of both local 
processing cost and communication cost) is minim- 
ized. 


The following notations are used with respect 

to the given query Q. 

AS(Q): Accessed Sites; the set of sites containing 
at least a copy of a relation referenced by 
Q. 

NAS(Q): Non-Accessed Sites; the set of sites 
taining no relations referenced by Q. 


CS(R.); The set of sites each containing a copy of 


the relation R. to be partitioned. 
As in [YGCC, YGZT], the cost of joining relations 


Pe aiait 


con~- 


is. given by h(R1) + ... + H(Ry), 
where 
f(X) if there exists a fast access 
path(e.g. index) for the join- 
ing operation involving X 
n(X) = { 
n(X) otherwise, 


with f(X) << n(X). 
This is a rough approximation, with the 
plicit 
tive. 
Re,p): The weight of site p. This includes t 
cost to transmit required data into the site 
from other sites and the processing cost at 
the site but not including the costs associ- 


ated with relation ne which is chosen to be 
partitioned. The weight 


= 2 n(Dij)/Sp(p) + 

Diy in p,izf 

2 [t(Djj)/Tspeed + n(Dj3)/Sp(p) ] 
“ not in p,i-f 
where D;, is a copy of relation i at site j 
and Tspeed is the communication speed of net= 
work, and Sp(p) is the processing speed of 
site p; t() is the cost function for data 
transfer. The first term represents the lo- 
cal processing cost for those data already at 
site p, taking into consideration the pro- 
cessing speed of the site. The second term 
gives the local processing cost and the data 


im- 


WC 


assumption that the join is restric-. 


the . 


trasmission cost for the data to be 
transferred into site p. 
d-partition: A relation is partitioned into d 
fragments. 
par(R.): The cost function to partition Rp. This 
| fs assumed to depend only on the sike of the 


relation Re. 
PS(Re id): The set of the processing sites each of 
which is assigned one fragment of R if the 
partition is a d-partition, i.e. |PSCRe ay] 
= d. 


PT(PS(R. d),s): A d=-partition of Re by using the 


copy t site s (if multiple copies of 
the relation exist) and the processing sites 


PS(R. d) 7 


1022 


F(Re.d,p): The size of the fragment that will be 
assigned to site p for a d-partition of Re. 


Assign (Red yp 45): The cost to assign and to pro- 
cesS a fragment of size F(R, q,p) at site p, 


pPAPS(Re a) by using the copy of Rp at site s 
for partition. | 


t(F(R. d,p))/Tspeed + n(F(Rp,d,p))/Sp(p) 
if pss 


={ 

n(F(Re.d,p))/Sp(p) if pss. 
It is assumed that after the partitioning 
process, fast access paths, if exist earlier, 
will be lost. Thus, the cost function for not 
having a fast access path, n(), is used. 


cost(R, a,p,s); The total ( processing and commun- 
ication ) cost at site p when the copy of R 
at site s is chosen to be partitioned into f 
fragments, and the fragment of size F(Re a,p) 


is assigned to site p. 


= W(R 

p) + Assign(Rp,d,p,s) + par(Rp¢)/Sp(s). 
Note fat since each’ précessing site! has to 
wait for the completion of the partitioning 


of Rp, par(R¢)/Sp(s) is incurred at each pro- 
cessing site. 
f() 


All cost functions t(), h(), and par() 
are monotonically increasing functions i.e. 
if the relation/fragment is larger, a higher 
cost will be incurred for local processing 
and/or data transfer. 

Res (Rp gq s): The response time obtained at d- 
partition when the copy of Re at site s is 
chosen to be partitioned into d fragments and 
the processing sites are PS(R 

f,d). 
=MAX { cost(Rr,d,p,s) } 
ptPS(R yd) 
_ We are interested in minimizing Res(R, q,s). 


3. Partition strategy 


Given a relation and its copy to be parti- 
tioned, we want to determine (1) the processing 
sites (and therefore the number of partitions) and 
(2) how the relation is to be partitioned and as- 
signed to the processing sites. 


Proposition 1 below shows that an optimal way 
to assign fragments to processing sites is in such 
a way that each processing site will have the same 
total cost. Thus, it remains to determine the 
processing sites. We will show in Propositions 3 
and 4 that (i) if a site is not used as a processr 
ing site, then those sites with weights higher 
than or equal to that of the site should not be 
used as processing sites and (ii) if a site is 
used as a processing site, then those sites with 
weights smaller than or equal to that of the site 
should be used as processing sites. Based on 
results (i) and (ii), all sites are arranged in 
ascending order of weight. Consider site t. If 
site t is not a processing site, then sites (t+1) 
to the last site need not be considered; other- 


wise, site 1 to site t and possibly the next few 


sites will be processing sites. 


Proposition 1: Let Res(R 


,d,s) and Ree rade! be 
the response times obtained by the d-partitions 
PT(PS(Re d),s) and PT (PS(Rf 3d) 38) respectively. 
If thé d-partition PT PS(Re d),s) makes 


cost(Re 4,p,s) = cost(Rr,d,j,s) for every pair p,j 
< PS(Re.d) then Res(Rp,d,s) $ Resy(Rp,d,s). 


Definition: S is an optimal set of processing 
sites for the copy of relation Re at site t if the 
copy is partitioned into fragments and assigned 
for processing at each of the sites in S to yield 
the optimal response time. Each site s in S is an 
optimal processing site. 


Lemma 2: Let Res(Rp.4,s) be the response time ob- 
tained by d-partition PT(PS(Re d),s). 


If W(R,,4) + par(Rp)/Sp(s) < Res(Rp,d,s), j not in 
a d), then there exists a (d+1)-partition, 
PT(PS(R. a41),s), such that Res(Re,d+1,s) < 


Reeth d,s), where Res(R¢,d+1,s) is the response 
time Obtained by including site j as a processing 


site. 


Proposition 3: If site p is an optimal processing 
site and site j satisfies W(R, 5) W(Re,p) then 
site j is also an optimal processing site. 


Proposition 4: If site p is not an optimal pro- 
cessing site and site j satisfies W(Re 5) > 


W(R, p) then site j is also not an optimal pro- 
cessing site. 
Suppose Re is the relation to be partitioned. 


After the sites are arranged in ascending order of 
weights, the best solution is obtained by assign- 
ing the fragments of R. using Proposition 1 to the 
first d sites, for some d. We now seek to deter- 
mine d. 


One simple-minded way is to compute the 
response time for a given d by assigning the frag- 
ments of R. to the first d sites using Proposition 
1. Then dis increased by 1 and the process is re- 
peated until an increase in the number of sites 
yields an increase in the response time or the to- 
tal number of sites, m, is reached. For a_e given 
d, finding the response time for the first d sites 
takes O(d) time. Since the process may be repeated 
up tom times, the time complexity is O(m2). 


A more efficient process is as follows. We 
first consider the middle site, the m/2-th site. 
If this is not an optimal processing site ( deter- 
mined by Proposition 5), then it is sufficient to 
consider the first (m/2 - 1) sites. We then re 
peat the process by checking the m/4+th site. If 
the m/2=th site is an optimal processing site, 
then the 3m/4+th site will be examined. In other 
words, a binary search is performed on the set of 
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sites. The process of determining whether the i- 


th site is an optimal processing site will be 
shown to take at most O(i) time. Since the binary 
search process is an O(log m) process, the time 


complexity for determining the optimal number of 
sites is at most O(m log m). It remains to deter- 
mine whether a given site is an optimal processing 
site. 


Suppose the given site is the i-th site. 
Even if this given site is assigned a fragment of 
size zero (i.e. the site is not a processing 
site), the time incurred at this site is T= 
W(R a : 
is te eIOIGn Specs ech ce Ee ar cceding 
(i-1) sites is assigned some part of R 


f (the parts 
of R : 
assi 
Fs 


f igned to the sites need not be disjoint) 
such That i response time is T. The next propo- 


sition states that site i is not an optimal pro- 
cessing site if and only if the sum of the sizes 


of the fragments of Re assigned to the first (i-1) 
Sites is greater than or equal to the size of R 


f- 
Proposition 5: Suppose size(j) is the size of the 
part of Rp assigned to site j such that the total 
see of site Jj, cost(R, i-1,5,s) = W(Rp,i) + 
par Pe)/Sp(s) for each site \sjsicls Then site i 
1S not an,optimal processing site iff 
size = 2 size(j) 2 size(Re) (1) 


where size(R.) is the size of relation Rr. 


The process of determining the last process- 
ing site is given as follows. 


BINSEARCH(low,high,result) 

/* The relation to be partitioned is Re; the 

of Re to be partitioned is at site s. 

set of processing sites is 1 to result. 
If ( low > high ) then result = high 


copy 
The optimal 
*/ 


else { 
MID = ( low + high )/2 ; MID1 = MID - 1 
For i := 1 to MID1 
{ Assign part of Re to site i such that 
cost\R. MID1,i,s) 
= W(Rp MID) + par(Rp)/Sp(s) } 
MID1 


If = 


size(i) 2 size(Rr) 


then BINSEARCH(MID+1,high,result) 
/* search right half */ 
else BINSEARCH(low,MID~1,result) 
/* search left half */ 
} 


par- 


— .— ae eee 


titioned 


When there are more than one copy of the re- 
lation to be partitioned, we need to determine 
which copy should be used to obtain the best 
response time. Suppose we use the copy CO, at the 
site u that has the fastest processing speed among 
the sites that have a copy of the relation. Let 
the set of the optimal processing sites and the 


optimal response time for this copy be PS and OPy 
respectively. Proposition 7 below states that any 
other copy which is not in one of the optimal pro- 


cessing sites, PS, can not yield a better 


response time than OP, and should not be chosen 
for partitioning. 


Suppose there is a copy CO, at site v which 
is within PS | tr the set of the optimal process- 
ing sites obtained for CO, is not a subset of PS, 
then the response time obtained by using CO. ean 


not be less than OP, this is shown in Lemma 6. 
Thus, we do not need to consider any site outside 


PS, as a processing site for copy CO,. 


Lemma 6: Let Res(R.q,u) and Res(Rp,d',v) be the 
optimal response times obtained by the optimal 


partitions PT(PS(Rp,q),u) and PT(PS(Rp,d'),v), 
respectively, by using the copies co, and COy of 


Re at sites u and v, respectively. 


> 


If Sp(u) 2 Sp(v) and PS(Re dt) ¢ PS(Rp,d) then 
Res(Re dt,v) > Res(Rp,d,u). 


Proposition 7: Let Res(R. 4,u) and Res (Rp,d",v) be 
the optimal response times obtained by the optimal 


partitions PT(PS(R. qg),u) and PT(PS(Rr,d'),v), 
; Eos : f 
respectively, by using the copies CO. and COy of 


Re at sites u and v, respectively. 


> 


If Sp(u) 2 Sp(v) and v& PS(R ed) 
then Res (Re .d,u) S$ Res(Rr,d',v). 


When there are more than one copy of the re- 
lation to be partitioned, one method is to first 
use the copy at the site that has the fastest pro- 
cessing speed among the sites that have a copy of 
the relation, then discard some unnecessary copies 
by applying Proposition 7. The remaining copies 
are processed in descending order of processing 
speed of their residing sites. 
copy, the processing sites will be restricted to 
be the intersection of the processing sites of the 
previous copies since, by Lemma 6, no copy can 
yield a better response time by using any site 
outside the processing sites of a previous copy. 
We then compare all such cases and the copy yield- 
ing the least response time is chosen. Suppose 
instead we choose the copy CO, at the site u that 
has the fastest processing speed among the _ sites 
that contain a copy of the relation to be parti- 
tioned, and obtain the response time for this 
copy. Let the size of the fragment assigned to 


site Jj, where a copy CO. of the relation exists, 
be F., (If no copy of the relation exists ina 


processing site of CO , then, by Proposition ie 
the optimal solution for the relation is obé 


incur par(Re)/sp(j) -. par (Rr)/Sp(u) addi tional 
time at each processing site, but we save 
transmission cost of t(F;)/Tspeed for site j. If 


the copy CO. is chosen to be partitioned, we can 
reduce the response time by at most t(F5)/Tspeed - 


For each such . 
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( par(R )/Sp(j) - par(Re)/Sp(u) ). Inf 
; ast local 
area necworks. the phe oe ule cost is supposed 


to be small in comparision with processing cost. 
Thus, the copy at the site with the fastest pro- 
cessing speed should be a good choice. 


3-3 Discard of certain non-accessed sites 

The following result (Proposition 9) says 
that if a site j is an non«accessed site and the 
weight of this site for partitioning relation R 
is greater than or equal to the optimal response 
time obtained by partitioning R , then this site 
Should not be considered as one of the processing 
sites for any partitioned relation Re, where [Rp | 
= [Re | since it will result_in higher response 
time. Thus , if we order the relations in descend= 
ing order of size, a non-accessed site discarded 
in the processing of arelation is automatically 
discarded when a later relation is processed. 
Suppose site j is discarded. Any non-accessed site 
whose weight is greater than or equal to site j 
with respect to relation R satisfies the same 
condition as site j. Thus, it can also be discard- 
ed. 


Lemma 8: If j 4 NAS(Q) and |R | 2 |Rp| 
then W(R. ; ae oe 
f»j) s W(Rr,j). 


Proposition 9: 


Let Res(R ,d,u) and Res(Rr,d',v) be 
the optimal response times obtained by partition- 


ing relations Rp, and Rp, respectively. Let 


PS(R,d') be the set of the optimal processing 
sites for partitioning relation R 


2 uf . re 
If JANAS(Q), J€4PS(R a1), [Re] 2 [Rr| and W(Re,j) 
z Res(R 


: | 
then Res (Roary) 2 Res (Rrf,d,u) 


— ewe eR 


We now give an algorithm to find an optimal 


partition strategy. 


Let R(Q) be the set of 
query Q. 

The set of sites S = AS(Q) U NAS(Q) i.e. the union 
of the accessed and the noneaccessed sites of Q. 


relations referenced by 


Among the non-accessed sites in NAS(Q), let s_ be 
the site having the fastest processing speed. 
ALGORITHM PARTITION 

(1) Estimate the total time if the query is pro 


cessed in a single site t, t# AS(Q) U {s}. 
Set Bound = the best estimated response time 


obtained. 

(2) Arrange the relations in R(Q) in descending 
order of size. Let the arranged relations be 
Ri, 1sisn. 


(3) For i = 1 to n /* from the largest relation to 
the smallest */ 
3.1 From the remaining non-accessed sites, 
card those sites p with 
W(R, op) 2 Bound. /* 
site 


dis- 


Once a non-accessed 
is discarded for a re# 


lation, it is discarded for 
all smaller relations by 
Proposition 9 */ 

3.2 Arrange all remaining sites p in ascending 
order of weight W(R, , ) 

3.3 Discard those sites p with weight 
W(R. + 
whete?d perth bce paring che easbesk pro- 
cessing speed among the sites that contain a 
copy of Ri. 

/* Unlike 3.1 this discard is for 
the relation R., only */ 

3.4 Use binary search to detérmine the optimal 
processing sites for the copy of R; at site 
u. 

low = 1; high= the number of sites remained 
after step 3.3. 

BINSEARCH(low,high,result) 

3.5 Compute response time Res using the process- 
ing sites 1 to result. 

3.6 Order the remaining copies of R, that reside 
in a site j between 1 and result in descend- 
ing order of processing speed of their 
residing sites. (Let k,+ pe the number of 
these remaining copies). Let the arranged 
copies be CO, 1<tskj'. resultg = result. 
For t=1 to k.+ 
{ low = 1; high = result, _, 

If( 1s site(CO, ) < high) ) /* By Proposi=- 
tion 7, 1 site (CO, ) | the site 
where CO, resides, is beyond 
high, it can not yield a 
response time better than res,_, 
*/ 

{ BINSEARCH(low, high, result, ) 

Compute the response time res; for the 
processing sites 1 to result, . 

res = MIN { res, res, } 

} 

} 

3.7 Bound = MIN { Bound, Res }. CJ 


By Proposition 1, the optimal assignment is 
to make each processing site have the same total 
cost. Thus, for each optimal processing site i, 
-W(R. ; ; : 

i) + par(Rf¢)/Sp(s) + assign(Rr,d,i,s) = c 
wheré ec is the borat cost (and hence the response 
time) to be determined and F(R. a,i) is the frag- 


ment of Re to be assigned to site i. 
F(Rpd,i) 


= assign” !( ¢ -W(R i) - par(Rp)/Sp(s)) (2) 
where the assign unction is in terms of the 
transmission cost function t() and the join pro- 


cessing function n(). 
Since F F(rp.d,i) = Rel, 


we can solve for ec by substituting (2) into it and 


therefore F(R. 4 i) can be solved. We can solve 


Equation (2) for F(R,,4,i) as a function of c¢ in 
constant time. We néed to repeat this for d frag 
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ments. Thus, c can be solved in O(d) if the rela- 
tion is partitioned into d fragments and the sizes 
of fragments can also be computed in O(d). 


4. Time complexity 


In the algorithm PARTITION, it takes O(N 5 
obtain the best response time for Step (1) ryder 4S 
where N, is the total number of the copies of the 
relations. At Step (2), it takes O(r logr) to 
arrange the r relations. We know that the higher 
the processing speed of a non-accessed site the 
smaller the weight of the site. Thus, we can 
pre-arrange all sites in ascending order of weight 
( or descending order of processing speed) as if 
they are nonmaccessed sites. For a given site, 
check if the site is non-accessed or not and if it 
is, determine whether its weight is greater than 
or equal to the Bound. A binary search of sites 
can be employed here, since the sites are in as- 
cending order of weight. Thus, the non-accessed 
Sites with weight 2 Bound can be determined in 
O(log m) in Step 3.1, where m is the number of 
sites. For step 3.2, it takes 0(|AS(Q)| log 
|AS(Q)|) to order the accessed sites and then 
takes 0(|AS(Q)|+|NAS(Q)|) to merge the non- 
accessed and the accessed sites since the non- 
accessed sites are arranged in ascending order of 
weight. At Step 3.3, binary search can be used to 
discard sites it takes O(log m'), where m' is the 
total number of the remaining non-accessed and ac 
cessed sites. It takes at most O(m' log m') for 
the BINSEARCH algorithm for Step 3.4. At step 
3.5, it takes at most O(m'). Thus, the time com- 
plexity for Steps 3.13.5 is O(m log m). Suppose 


there are k : i 

; copies of relation Ry. At Step 3.6 
Steps 3.1 to 3.5 is repeated at most kK; Rants with 
time O(kim og m). We will repeat this process 
for every relation. -Thus, it needs O(NG m log m + 
r log r) for the algorithm PARTITION. 


5. Special situation 


Suppose both the transmission cost function 
and the join cost function are linear i.e. n(X) = 


a 
X and t(X) = ao X for some constants aj and ap. 
Then 7 - 


assign(R. a \j,s) 
aiF(Re,d,j)/Sp(j) + aoF(Rr,d,j)/Tspeed if jes 
a4F(Re,d,j)/Sp(j) if jes. 


where s is the site where the copy of R 
partitioned. We can rewrite it as 


assign(R, 4,j,s) = bjs F(Rr,d,j) where 


f is to be 


‘ 41/Sp(j) + ao/Tspeed if jes 
js =i 


2./Sp(j) if j=s. 


If we make use of this special property for 
the algorithm BINSEARCH, the complexity of this 
algorithm can be reduced to O(m) instead of O(m 
log m). 


In algorithm BINSEARCH, in order to determine 
whether site (d+1) is a processing site using Pro- 
position 5, we’ will set cost. (Re,d,j,s) a 


W(R, +1) + par (Rg )/Sp(s) and compute Fy(Rr,d,Jj) 
for every site j, 13jsd. If we substitute the as- 


sign function into cost. (Rr,d,j,8), we have 
008 TARE 2d,Jj 8) 

= W\Ke,j) + par(Rp)/Sp(s) + bjs Fy(Re,d,j) 
= W(Rp d+1) + par(r¢)/sp(s) 


If we solve for the size of the fragment, we have 
Fi(Re,d,j) = (W(Rp,d+1) = W(Re,j))/bjg for 1Sjsd. 


The summation of the sizes of the fragments is 


R(d) = & 


~ F  (W(Rp,d41) - W(RE,J))/djs. 


J= 


Fi (Re ,d,j) 


After obtaining R(d), we will continue to search 
the right half or the left half. 


Case 1: Search the right half 
In this case, we will determine whether 


(d'+1)*th site should be an Optimal processing 
site, where d'>d. We need to compute 


d' 
RAG) Sd. ey (Re 0" 59) 
j=1 
a' 
7 : , (W(Rg d'+1) ~ WRe,j))/djs 
d 
7B, W(Re,dt+t) © W(Re,5))/b55 
da’ 
eae (W(Rp,d'+1) = W(Rp,j))/b; 
jed+t f f J js 
d 
: ; , (W(Re,d'+1) © W(Re,d+1))/dj5 
d 
‘ ; , (W(Re.d+1) - W(RE,J))/bjs 
da! 
+ 5 ‘ : 
W(Rp,d'+1) - W(Rr,j))/b 
j=d+1 uARE : a 


= (WR, at41) - W(R¢,d+1) ) , 1/djs + R(d) 
a" | ; 


+2 t os : : 
jane +1) - W(R¢,j))/djs 


If V/b, ., 1gjsd, has been accumulated, the first 


term dan be done at constant time. What remains 


to be computed is the size of the fragment that 
Will be assigned to site j, dt1lSjsd', and their 
summation. (note that F,(pe,d',j) = W(Rp,d'+1) - 


W(Re 3) ) It needs only (d'-d) steps instead of d' 
steps. 


Case 2: Search the left half 
In this case, we will determine whether 


(d"+1)*th site should be an optimal processing 
site, where d" < d. We need to compute 


a" 


R(d") = » Fy (Rp,d" J) 
j= 
gt : 
: ; , (W(Rg.d"+1) - W(Re.5))/djs 
- 
iz na (W(Rp,d"+1) — W(Re,j) 
* WRe det) - W(Rp,d+1))/bj5 
. 
2 gnay (WORE ANT) ~ WORE,I))/0 55 
d 
. F , (WRp.d+1) = W(Rp,j))/dj5 


d 
* CWRe,ast) - W(Re,a"41)) E 1/b55 
J= 


+f i)j- rT] : 
jane, (WORE.J) ~ W(RE,a"+1))/dj5 
d 
= R(d) - ( W(Rp a+1) - W(Re,d"+1) ) eo 1/bjs 


d 
‘ ; (W(Re,j) - W(Re,d"+1))/bdj5 


=d" +4 
Again, if R(d) is obtained and 1/b. 1< 
i 


accumulated, then R(d") can be computed 
steps. 


isd, i 
A (dedi) 


Suppose there are m sites. In binary search, 
we start at the m/2-th site; this takes m/2 steps 
to compute R(m/2); then the 3m/4*th site is 
checked if the right half is searched or the m/4- 
th site if the left half is searched. But no 
matter which half is searched, according to the 
above argument, it takes only m/4 steps to compute 
R(3m/4) or R(m/4). Thus, it will take m/2, then 
m/4, m/8 steps and so forth for the sequence of 


searches. Since the binary search process ter= 
minates in (log m) steps,and 

log m 

z m/2i = m1 

i=] 


the algorithm BINSEARCH takes O(m) instead of O(m 
log m) time. 


Suppose not only the cost functions n() and 
t() are linear, but all sites have the same pro- 
cessing speed. Then, the time complexity is 
further reduced. 


When all sites have the same processing 
speed, those non-accessed sites will have the same 
weight associated with the relation to be parti- 
tioned. Therefore, by Propositions 3 and 4, if a 
non-accessed site should (not) be an optimal pro- 
cessing site, then all non=accessed sites should 
(not) be optimal processing sites. Thus, we only 
need to use those accessed sites and one non- 
accessed site to determine the optimal processing 
sites in algorithm BINSEARCH. The complexity of 
the algorithm BINSEARCH becomes 0({AS(Q)|+1). 
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Furthermore, we do not need to repeat the search 
for the optimal processing sites for every copy of 
the relation to be partitioned. The copy at the 
site which has the smallest weight among’ those 
sites having a copy of the relation to be parti- 
tioned, will yield the best response time. This 
will be shown in Proposition 12. 


Lemma 10: If a; > ap > 0 and x1 2 xp > 0 then ap 
X, + a, xp S$ a1 X14 + aD XO 

Proofs @g x1 + a1 x2 > a1 x17 a2 x2 

= \85 = ay) (x1 = x2) 

ane 3 1 - X2 

Lemma 11: If a, > a5 >0 and x} 2 xp > O then 
X1/a, + x2/a2 S$ x1/a2 + x2 /a}. 


Proof: X1/a, + xo/ao 
= \85 x1 + a1 xX2)/(a1 a2) 
S$ (a 
1X1 + a2 x2)/(ay a2) 
= Xi/ao + x2/ay CJ 


(Lemma 10) 


Proposition 12: Let t be the site having the smal- 
lest weight among all sites containing copies of 


the relation, Rp, to be partitioned. Let 


Res (R ,t) and Res(R »v) be the response tim 
for using the copy at fey t and for using a_ dif- 


ferent copy at site v, respectively. 
If all sites have the same processing speed, then 


Res (Rp p+ ,t) < Res(Rp,py,V)- 


Proof: 
Case 1: BP, > py 
By Vemma 6, Res (Re py ,v) > Res(Ryp,pz,t). 


Case 2: Py < 


S PRes (R 

uppos CS\hespt,t) > Res(Rr,py,v). Let 

Pit be the size of the fragment assigned to 

a’ processing site j, 1SJSp, , by partitioning 

the copy at site t. We know that it = ( 
Res (R t )-W(Rp,j )-par (Rp)/Sp(t) /d; 
since! Sti Sites’ have the eene processikg 
speed, Dt = bjv» jet and jev. Since bjy = 
dD. «for j*t and jev, we have that Fjv < Fjt 
ror jet and jev because Res (Re py,v) < 


Res (Rp p ,t) and Sp(t) = Sp(v). Since all 
sites have the same processing speed, Diy - 


Dt. Dyy = Det and bee < Dyt. We can have 


Fiv + Fyy < Fet + Fre 
Thus, 

Py Py Pt 
>» 


Fiy < 2. Fat S$ £ . Fat = [R 
jar FV * jar J® * Gay “5% IRe| 
a contradiction that the copy of Re at site 
v is partitioned. Thus, Res(R.\5.,v) can 


not be a than Res(Rp,p+,t). 


By Proposition 12, if we use the copy of the 
partitioned relation at the site with the smallest 
weight to determine the optimal processing sites, 
the optimal response time of partitioning the re- 
lation is obtained. This takes only 0(|AS(Q)]|). 
It takes 0(|AS(Q)| log |AS(Q)|) to arrange the ac- 
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cessed sites in ascending order of weight. Thus, 
it takes O(r |AS(Q)| log |AS(Q)|) at Step 3 of the 
algorithm PARTITION if there are r_ relations. 
Therefore, the complexity of the algorithm is re- 
duced to O(r |AS(Q)| log |AS(Q)| + r log r). 


Since the number of relations referenced by a 
query and the number of accessed sites are usually 
very small, the algorithm will run very fast. In 
situation where the join cost function is diffi- 
cult to estimate accurately, a linear function may 
serve as a first order approximation. The pro- 
cessing speed of a site may depend on many fact 
tors, such as the number of users at a given time, 
the query type, the job mix etc. In the absence 
of accurate strategies, the assumption that all 
sites have the same processing speed is applica 
ble. 


6. Conclusion 


In a fast local network, it is preferable to 
use a fragment and replicate strategy to process 
queries. However, if no relation referenced by a 
query is fragmented, it is necessary to decide 
which relation is to be partitioned into frag- 
ments, which copy of the relation should be used, 
how the relation is to be partitioned and where 
the fragments are to be sent for processing. An 
optimal algorithm in time O(N, m log m+riog r) 
has been given to provide answers to the above 
questions. Under special situations that all 
sites have the same processing speed and the cost 
functions for data transmission and local process- 
ing cost are linear, the time complexity reduces 
to O(r |AS(Q)| log |AS(Q)| + r logr). 
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A COMPILER-ASSISTED CACHE COHERENCE SOLUTION 
FOR MULTIPROCESSORS 


Alexander V. Veidenbaum 


Center for Supercomputing Research and Development 
University of Illinois at Urbana-Champaign 
Urbana, Illinois, 61801 


Abstract -- The existing solutions to multiproces- 
sor cache coherence problem are not suitable, in our 
opinion, for systems with a large number of processors. 
We propose a new solution in which a compiler gen- 
erates cache management instructions. Conditions 
necessary for cache coherence violation are defined. 
The structure of a program and its dependence graph 
are used to detect when these conditions become true, 
and the instructions to enforce coherence are generated. 
No communication between processors is required at 
run-time to enforce coherence. The correctness of the 
solution is proved. 


Introduction 


Several multiprocessor architectures have been 
built or are being built that can truly be called large- 
scale ([4], [5], [14], [2]). They can have several hundred 
processors sharing memory and all working on solving a 
single problem. Such multiprocessors are characterized 
by a long memory access time making the use of cache 
memories very important. However, a cache coherence 
problem makes the use of caches difficult. In this paper 
we discuss the proposed solutions to the cache coher- 
ence problem and why they may not be suitable for a 
large-scale multiprocessor. We propose a different solu- 
tion that relies on a compiler to manage the caches. We 
define conditions necessary for a data incoherence to 
occur. A restructuring compiler, such as Parafrase [9], 
can be used to detect when such conditions arise and to 
generate cache management instructions to enforce 
coherence. 


Existing solutions to the cache coherence problem 


Let us examine the proposed solutions to the cache 
coherence problem (see [18]) for use on large-scale mul- 
tiprocessors. The solutions can be divided into the fol- 
lowing groups: 


This work was supported in part by the National Science Founda- 
tion under Grants No. US NSF DCR84-10110 and DCR8-I-06916, 
and the US Department of Energy under Grant No. US DOE DE- 


FG02-85ER25001. 


0190-3918/86/0000/1029 $01.00 © 1986 IEEE 


(1) Solutions based on a single shared resource. 
These include a central directory scheme of [15] 
and a shared cache scheme of [17]. 
A large number of processors can not access the 
shared resource without severe performance degra- 
dation. That’s why these solutions are not extend- 
able to large-scale multiprocessors. 


(2) Bus-based solutions. 

An example of such a solution is that of Goodman 
[6]. 

These solutions are also using a shared resource, 
the bus, and could have been considered in the first 
group. We put them in a separate group because 
they seem to us capable of supporting a larger 
number of processors then the solutions in the first 
group, but not hundreds of processors on a single 
bus. 


(3) Cacheability attribute processed during virtual 
address translation. 
This scheme was used in C.mmp [3]. Again a cen- 
tral resource, shared page tables, can be identified 
in this solution. The overhead of changing the 
cacheabilty attribute seems large to us for a truly 
extendable solution. 

In addition, most modern processors contain 
a translation buffer (TLB) acting as a cache for 
page table entries. When a cache attribute needs 
to be changed all of the TLBs need to be updated. 
Therefore, we may have solved the data cache 
coherence problem, but we have created a TLB 
coherence problem. 

A variant of this solution has been proposed 
which allows the cacheability of read-only data. In 
this case the attribute is never changed during the 
execution of a program. This solution is restrictive 
and may not allow a large number of references to 
be cached. 


Multiprocessor architecture 


The architecture we are interested in is a shared- 
memory multiprocessor. It consists of a global shared 
memory, a global interconnection network, and proces- 


sors with private caches. The only way of exchanging 
data between processors is through the shared memory. 
~The block diagram of this architecture is given in Fig- 
ure 1. 


Figure 1. Multiprocessor architecture. 


Interconnection network 


We assume the interconnection network to be of 
the shuffle-exchange type. The network has a unique 
path from any input to any output port. The order of 
accesses by a given processor to a given memory loca- 
tion is preserved by the network. 


Cache organization 


We assume that a cache is a physical address 
cache, using the "Write-through" policy, and having a 
block size of one. We also assume that on context 
switch the system ensures the correctness of cache con- 
tents. Other details of cache organization will be 
presented in the following sections. 


Definitions 


The following notation is used in this paper: 


(1) S, - statement i of a program, 
(2) P, - processor j of a multiprocessor, 
(3) C; - private cache of processor P » 


(4) M[X] - the contents of a memory location of a 
variable X, 

(5) C,[X] - the value of X contained in the cache of 
processor P,, 


(6) X° — - astore of X, 
(7) x! - a fetch of X, 
(8) X* — - an access to X, i.e either a fetch or a store, 


In some cases we may qualify the access 
type by a processor that performed the access as 


| follows: xr 
(9) (yy 


- zero or more repetitions of Y. 


Data dependencies 


The definitions of this section follow those of [8] 
and [16] which should be consulted for any additional 
information on dependence analysis. 

For each statement S; we define two sets, IN and OUT. 


-- IN(S,) is a set of variables the statement uses. 


-- OUT(S,) is a set of variables the statement gen- 
erates. 


The following types of dependencies are defined 
between two statements S, and 8, if there is a flow of 
control path from S, to 5; and S. precedes S; on that 
path. 


-- S; is data flow dependent on S. 
IN(S;) ()OUT(S;) is not empty. 

-- oF is data antidependent on S, if the set 
IN(S;) (OUT(S;) is not empty. 

~- 5; is output dependent on S; 
OUT(S;) FOUT(S;) is not empty. 


-- 8, is control dependent on a conditional statement 
S; if its execution depends on the execution of S 
and the path chosen after that. 


if the set 


if the set 


If the two statements are both enclosed by DO loops a 
dependence may exist on some but not all iterations of 
the enclosing loops. Also, a dependence may exist 
between S; executed in one iteration and S. executed in 
another iteration of a loop. Such a dependence is called 
a cross-iteration dependence. The data dependence 
information can be represented by a graph on the state- 
ments of a program, called a data dependence graph. 


Loop types 


A program can have any of the following four loop 
types: 

(1) A DOALL loop - a loop which does not have any 
cross-iteration dependencies [11]. 

(2) An R(N,1) loop - a loop solving a first-order linear 
recurrence [8]. 

(3) A DOACROSS loop - a loop that has a dependence 
graph cycle but can be executed in a pipelined 
fashion [13]. 

(4) A serial loop - a loop with a dependence graph 
cycle where pipelined execution is not possible. 

The type of a loop is determined by the depen- 
dence graph of statements nested in it. The first three 
types of loops can be executed on multiple processors to 
achieve performance improvement. 


A memory reference sequence 


A memory reference sequence X for a memory loca- 
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tion of a variable X is an ordered sequence of accesses 
generated by a program as observed at the memory: 


x= xX (x) YY 
Any access X” generated by a program can be identified 
by a sequence number i giving the position of X” in X. 
We use a subscript i to indicate the sequence number, 
x. Let us assume that the sequence number is know a 
priori for any access. 


For a serial program the sequence is unique for a 
given set of input data. For a deterministic parallel 
program the sequence is unique except for possible per- 
mutations in any of the fetch subsequences. 


The model of computation we use assumes that the 
following two conditions are satisfied at any time T 
a 
when a processor generates X’; : 


Al. M[X] = x, where j <iand Vk, j<k <i: X,°* 
was a fetch, 


A2. If X”. is a store then Y j,j < i: x’, has been per- 
formed. 


In other words, the value in the shared memory is 
always current. All the stores preceding the current 
access have been completed but none of the following 
stores have been completed. This is necessary to keep 
things ‘simple" in the system where the only way of 
communicating between processors is through shared 
memory. These conditions a equivalent to enforcing the 
data dependence constraints between the statements. 
We assume that the ordering is enforced through the 
use of synchronization primitives. 


Cache incoherence 


We define an incoherence as a condition when: 
l. 
and 
2. M[X] #C,[X] at such time. 
An incoherence cannot occur if X, is a store. Note that 


we require a processor to try to fetch X, otherwise the 
fact that the memory and the cache have different 
values is not an error. 


‘e v f 
a processor P, performs a memory fetch X, 


Using the notion of the reference sequence let us 
define the necessary conditions for the cache incolier- 
ence to occur. 


Lemma 1 


The two conditions necessary for a cache incoherence to 
occur at processor P, issuing X , are: 


Cl. C[X] = > ni where | < i and x,’ was the 
last access to X by P.. 
| oe (Px) 
C2. —| k,1<k <i: M[X] =X, “, where jn. 
(If more then one such stores took place let x," be 
the one with largest sequence number.) 
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That is, a value of X is present in the cache of P. 
(C1), and a new value has been stored in the shared 
memory by another processor since the last access by P; 


(C2). 


Proof 


(1) The first condition must be true, for if it is not 
true the value is fetched out of GM and is there- 
fore correct. 

Let us assume that an incoherence occurred on 

Xx," , ie. M[X] #C[X] at that time, but C2 is not 

true. Let us also assume that this is the first time 

any of the conditions {Al, A2, Cl, C2} we specified 
are not satisfied. 

C2 being false means: 

MSs 
where k >i or k <1 or j=m. 


(2) 


If j} = m then the value in the cache is that stored 
by processor j. Since it was the last store before 
xe we have M[X] = C,[N], which is not true. 
Therefore j4n. 

If k > i then condition A2 has not been satisfied 
when the store of X , occurred since not all the 
preceding accesses have been made. We assumed 
that this was the first time any of the conditions 
were false, but found that A2 must have been false 
earlier. Therefore, k <i. 

k cannot be equal to i since X, is a store and X, is 
a fetch. 

k cannot be equal to | since this will imply j=m 
which we have shown is false. 

For the case of k<I we have two possibilities: X, is 
a store or X, is a fetch. If X, is a store then M[\] 
= C [x] when X;, is issued since no other processors 
store into X after X, and neither does P.. If NX, is a 
fetch we again have M[X] = C,[X]. Since the con- 
tents of memory and the cache are not equal, k 
cannot be less then 1. 

It follows then that C2 is a necessary condition for 
the incoherence to occur.O 


Corollary 


Cache incoherence cannot occur on a variable that is 
assigned only once during program execution. 


The proof follows from the fact that the conditions 
C1 and C2 of Lemma 1 cannot be true at the same 
time. 


Detecting the necessary conditions 


Our goal is to have the restructuring compiler 


detect the conditions of Lemma 1 and generate the 
cache management instructions. The power of restruc- 
turing compilers is in the ability to perform the data 
dependence analysis. Therefore, let us express the con- 
ditions necessary for cache incoherence to occur in 
terms of data dependencies. It follows from Lemma 1 
that: 


(1) The value of X has to be in the cache of processor 
P.. This implies that a statement Si, was executed 
by P; such that X belongs to IN(S; ) or X belongs 
to OUT(S; ). 

(2) A different processor has computed a new value of 
X. This implies that a statement Si, was executed 
by P, such that X belongs to OUT(S; ), k #j. 


ip 
(3) Finally, an incoherence can only occur if P, is 


fetching X. This implies a statement Si, is exe- 


cuted by P, such that that X belongs to IN(S; ). 
From the above, Si, depends on statement Siy and 
statement Si, depends on statement Si, One of the 


following two dependence graphs are possible: 


In addition to the above dependence structure, it is 
necessary that S. be executed on a different processor. 
2 


How can we detect that? We propose to use the loop 
type information for this purpose. (If other types of 
parallelism are being exploited they can be taken care 
of in a similar fashion.) Recall that only three of the 
four types of loops we are dealing with can be executed 
on multiple processors. Let us concentrate on DOALL 
and DOACROSS loops. We assume that one or more 
consecutive iterations of such a loop are assigned to a 
processor. By definition, any dependence between two 
statements inside a DOALL loop is not across iterations, 
but there are crossiteration dependences in 
DOACROSS. It follows that a statement S, in a 
DOALL dependent on a statement 5; in the same loop 
is executed on the same processor as Sie 


In a DOACROSS loop, two statements with a cross- 
iteration dependence are executed on different proces- 
sors, while statements with a dependence on the same 
iteration are executed on the same processor. 


Using the above we first construct a simplilied 
algorithm using only the loop type information. We 
then extend it to consider the dependence structure and 
the flow of control. 


A cache management algorithm 


Let us assume that the following instructions are 
available for cache management: 


Flush. This instructions invalidates the entire 
contents of a cache. 


Cache_on. This instruction causes all global memory 
references to be routed through the cache. 
Cache_off. This instruction causes all global memory 


references to by-pass the cache and go 
directly to memory. 


In addition, the cache state, on or off, must be part of 
the processor state and has to be saved/restored on con- 
text switch. Processes are created in the cache-off state. 


The algorithms uses loop types for its analysis as 
follows: 


(1) A DOALL loop does not have any dependencies 
between statements executed on different proces- 
sors. Therefore condition C2 of Lemma 1 is false, 
and any shared memory access in such a loop can 
be cached. 


(2) <A serial loop is executed by a single processor, and 
hence condition C2 of Lemma cannot be true. 


(3) A DOACROSS or an R(N,1) loops do have cross- 
iteration dependencies. Therefore condition C2 
can be true. Initially, let us just turn caches olf in 
such loops. 


The algorithm is shown in Figure 2. The algo- 
rithm turns cacheing on and off, depending on the type 
of loop a program enters. (Note that cache 
management instructions inserted in parallel loops are 
executed once by every participating processor.) Condi- 
tions for incoherence are checked at loop boundaries. 
In addition, procedure and function calls which may 
have parallel loops in called routines are considered. 
The algorithm does not really consider individual 
dependencies or look for the dependence structure satis- 
fying conditions of Lemma 1. It states that within cer- 
tain loop types (or nests of this type) the conditions 
cannot be satisfied or cacheing is not allowed. 


The algorithm is simple enough to be executed 
either at compile time or at run time. At compile time 
we know which loops can be executed on multiple pro- 
cessors, but whether they will be executed as such 
depends on run-time processor allocation. At run time 
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we know which loops get multiple processors. [For 
example, if a DOACROSS is serialized we can turn the 
cacheing on during the whole time the loop is executed. 


cache_state :== on 
insert Cache_on, Flush 
For every statement in a program do 
case(statement type) 
of(DO) 
case(DO type) 
of(DOALL) 
insert Cache_on, Flush after 
the DO statement 
push(cache_state) 
cache_state :—= on 
of(serial) 
if cache_state = off 
then insert Cache_on, Flush before 
the DO statement 


fi 
push(cache_state) 
cache_state := on 


of((DOACROSS, R(N,1)) 

insert Cache_off after the DO statement 
push(cache_state) 
cache_state := off 

endcase 

of(DOEND) 

do_type := type of DO for which this is DOEND 

old_cstate := cache_state 

cache_state :== pop() 

if cache_state—=on AND old_cstate=off 

then insert Cache_on after DOEND 

fi 

if ‘cache_state=on 

AND 

do_type={DOALL,DOACROSS,R(N,1)} 

then insert Flush after DOEND 

fi 

of(CALL) 

insert Cache_off before the CALL statement 

if cache_state = on 

then insert Cache_on, Flush after 

the CALL statement 
fi 


endcase 


enddo 


Figure 2. The cache management algorithm 


Correctness proof 


We will prove the correctness of the algorithm by 
showing that the conditions necessary for an incoher- 
ence to occur are not satisfied in programs processed by 
the algorithm. The conditions necessary for an incoher- 
ence on a variable X to occur when executing a state- 
ment S. are: 


CCl: Statement 5, depends on statement oF 
through a variable X. 


CC2: S, is executed by processor P, and S; iS exe- 
cuted by processor P,, | #m. 


CC3: X is in the cache C, prior to the execution of 
4 
S, and C,[X] 4x" : 


A general loop structure enclosing statements i and 
j is shown in Figure 3. Note that a statement not in 
any loop can be represented by a statement enclosed in 
a serial loop with one iteration. Therefore some of the 
loops shown in Figure 3 can be removed to obtain 
simpler cases. 


DO a 
DO b 
DO ¢ 
S 
DOEND 
DOEND b 
DO d 
DO e 
S; 
DOEND e 
DOEND d 
DOEND a 


Figure 3. Program structure 


We have to show that for every path in the control flow 
graph of our program between S. and S. one of the con- 
ditions above is false for any S,. A note about loops 
with GOTOs exiting the loop: we assume such loops 
have a single exit path regardless of how the loop was 
exited. This is usually required for correct synchroniza- 
tion, while it also simplifies the analysis of control] flow 
between loops in our proof. 


Let us consider the innermost loop DO, enclosing 
S;. Three cases are possible: 
(1) DO, isa DOACROSS or an R(N,1) loop. 


In this case a Cache_off instruction is issued by 
every processor executing the iterations of the DO,. 
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Therefore, X will be fetched from memory when 3, 
is executed. This is equivalent to CC3 being false. 


DO, is a DOALL loop. 

In this case a Flush instruction is issued by every 
processor executing iterations of DO,. Therefore 
CC3 is false. 


DO, is a serial loop. 

Consider all the loops enclosing S, between 
DO, and DO,. Let us skip all loops e, ..., k+1 that 
are serial to DO,, k <d, such that: 


A. DO, isa DOACROSS. 
In this case S, is enclosed by a nested set of 
serial loops DO, through DO,,,. The nested set 
is executed on one processor and a Flush 
instruction is executed before the DO, ,, accord- 
ing to the Algorithm 1. Therefore CC3 is false. 


DO, is a DOALL. 

In this case a Flush instruction is executed by 
every processor executing iterations of DO,. 
Since all of the loops DO, 4, through DO, are 
serial, every processor executing 5, has executed 
a Flush instruction. Therefore CC3 is false. 


(2) 


Otherwise DO, is a serial, i.e. all of the loops 
DO, through DO, are serial. 
Now let us consider the loop type of DO,: 


ra DO, is a DOACROSS. 
In this case a Flush instruction is exe- 
cuted before DO, according to Algo- 
rithm 1. Since S, is executed by the 
same processor, CC3 is false. 

£. DO, is a DOALL or a serial loop. 


In this case we have to consider the 
loops enclosing Sie 

I) DO, through DO, are all serial. 

In this case both loop nests 
DO, through DO, and 
DO, through DO, are serial. Since 
they are nested in a DOALL or a 
serial loop they will be executed on 
the same processor. Hence condi- 
tion CC2 cannot be true for state- 
ments S, and Si. 


II) Otherwise consider the outermost 
loop DO,, b<k <, that is not 
serial, ice. it is a DOALL, a 
DOACROSS, or an R(N,1). 

In this case a Flush instruction is 
executed after DOEND, by the 
same processor that is to execute 5. 


Therefore CC8 is false. 


Finally, let us consider CALL statements. Suppose 
5; is a subroutine or function CALL. 
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In this case a called routine may have loops in it enclos- 
ing the statement actually generating X. In addition, 
the called routine may not have any cache management 
instruction in it (that is why we turn cacheing off before 
a call). However, in all but one of the cases considered 
in the proof a Flush or Cache_off instruction is per- 
formed by a processor that executes S.. Therefore, CC3 
is false when S. is executed regardless of the type of 
statement of Si. 


The one case that does not have the cache flushed 
is C.QII. In that case S, and S. are executed by the 
same processor, and. there is no Flush on any path 
between the two statements. The algorithm inserts a 
Flush instruction after a CALL statement taking care 
of this case.O 


Improving the cache management algorithm 


In this section we describe the extensions of the 
cache management algorithm. The first one allows 
cacheing to be used in DOACROSS loops. The second 
and third attempt to reduce the number of cache 
flushes by doing a more detailed dependence and flow 
analysis. . 


(1) Cacheing of data inside DOACROSS loops. 
A DOACROSS loop is executed by assigning suc- 
cessive iterations to different processors (mod the 
number of processors available). The  cross- 
iteration dependencies that exist in a DOACROSS 
are thus between statements executed by different 
processors. Synchronization primitives have to be 
used between these processors to ensure that 
dependencies are satisfied (say, the classical P and 
V primitives). 
A. straight-forward solution is to issue a Flush 
instruction after the V by each processor executing 
a statement depending on a statement executed by 
another processor. Since the shared memory has 
the current value after the V instruction and the 
cache does not have anything, the value will be 
fetched out of global memory. Otherwise the 
DOACROSS loops can be treated the same way as 
the DOALL loops by the cache management algo- 
rithm. 
A more interesting solution is possible for architec- 
tures that support memory access combined with 
synchronization, such as [7] or [5]. In this case we 
can identify exactly which word requires synchroni- 
zation and invalidate just one word, not the entire 
cache. Cache controller design has to be modified 
to allow invalidation of individual words, prefer- 
ably as part of a synchronized fetch. The correct 
value is fetched from memory and can be put in 
the cache. Any subsequent unsynchronized fetch 
will use the value out of the cache. 


The processor performing a synchronized store does 
not have to do anything special, just write the data 
through to shared memory. 

The last solution requires a synchronized data 
access for every variable on which a cross-iteration 
dependence exists. Any attempt to minimize the 
number of synchronization primitives, as proposed 
by [12], or use of implied synchronization may 
result in an incorrect execution of a program. 


The simplified algorithm we presented does not 
really look for the dependence structure implied by 
the Lemma 1 conditions. Specifically, it does not 
check the existence of a statement bringing a vari- 
able into a cache prior to the execution of the two 
statements with a dependence on two different pro- 
cessors. An incoherence cannot occur if such a 
statement does not exist. In such a case it is not 
necessary to invalidate the contents of a cache. 
Consider a DOACROSS loop with cacheing 
enabled. Assume each processor executing this 
loop performed a Flush instruction just after it 
entered the loop. Let us know consider a state- 
ment S, that uses a variable X generated in another 
iteration of the DOACROSS. If we examine all the 
flow of control paths from the first statement in a 
loop to S, and determine there are no generations 
or uses of X on any of them then we do not have to 
invalidate X in the cache before S, (a single assign- 
ment condition). If the above is true for all the 
cross-iteration dependencies in the loop we do not 
need a Flush instruction in this DOACROOS. 
This technique can be extended to analyze the 
whole program to avoid Flushing after every paral- 
lel loop. 
(3) The algorithm uses data dependence information 
indirectly, through loop types. A beginning and 
end of a parallel loop are synchronization points it 
detects and uses to issue cache management 
instructions. This synchronizes all dependencies 
from statements in such loops to statements out- 
side of such loops. However, this synchronization 
point may be located much earlier than the 
statement using the data. Another synchronization 
point may exist later in the program that takes 
care of an earlier one. For example, consider the 
program segment in Figure 4. If no statement in 
DOALL, has dependence arcs to statements in 
DOALL, or any code between the two loops, and 
the flow of control always goes through DOALL, 
after passing through DOALL,, a Flush is not 
necessary after DOALL,. 


The correctness proof can be easily extended to include 
the algorithm improvements shown in this section. 
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DO a 
DOALL b 
DOEND b 
DOALL ¢ 
DOEND c 
DOEND a 


Figure 4. Program example. 


Conclusions 


We have presented a solution to the cache coher- 
ence problem in which a restructuring compiler gen- 
erates cache management instructions. An algorithm to 
do that is presented and its correctness proved. In this 
solution each processor manages its own cache without 
any additional communication with other processors. 
The cache management instructions are very simple, 
affect only the processor issuing them and have a small 
fixed cost. Finally, the total number of such instruc- 
tions issued by any processor is a function of loop 
bounds and loop structure of a program, not of the 
number of processors used or the number of stores in 
the program. That is why we believe the solution is 
scalable to multiprocessors of any size. 
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IN LOCALLY DISTRIBUTED COMPUTER SYSTEMS 
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ABSTRACT — This paper presents a heuristic algorithm for allo- 
cating the tasks in a task force to a set of execution sites in a locally dis- 
tributed computer system. The algorithm dynamically allocates tasks on 
arrival with the objectives of balancing the load of the system and of 
minimizing the communications costs for the task force to the extent pos- 
sible within the constraints imposed by load balancing. Results from 
experimenting with the algorithm on pipelined task forces arising from 
distributed database queries indicate that it finds the optimal allocation in 
most cases, and also that load-balanced task allocation effectively 
improves system performance. 


1. INTRODUCTION 


Task allocation and load balancing are major issues associated 
with the design of distributed computing systems. When a task or set of 
tasks are to be executed in a distributed system, they must be properly 
allocated to sites in the system. The problem of selecting appropriate 
sites is known as the task allocation problem, and a number of solution 
methods have been developed for this problem [2,4,10,12,13]. A com- 
mon trait of these task allocation schemes is that they operate using static 
information. That is, they ignore an important dynamic characteristic of 
distributed computer systems: the probability of one or more sites being 
idle while tasks are waiting at other sites can be remarkably high [6]. 
Load balancing aims to reduce this probability via task migration, which 
is the process of transferring tasks from one site to another (idle or less 
heavily loaded) site to either begin or continue execution there. In the 
Class of task migration schemes known as sender-initiated schemes [3,6], 
tasks are migrated upon arrival in the system by the site at which they 
arrive. This is similar in some ways to more traditional task allocation 
schemes, but the allocation decision is made dynamically and the load 
status of the system is considered. To distinguish this dynamic, load- 
based approach to task allocation from static approaches, we refer to it as 
load-balanced task allocation. 


In this paper we present a heuristic algorithm for solving an 
instance of the load-balanced task allocation problem. The particular 
problem that we consider here is unique as compared to most previous 
load balancing work, as we address the problem of selecting execution 
sites for an entire distributed program consisting of a number of tasks 
which are to be executed concurrently and which communicate with one 
another. Such distributed programs are known as task forces [5]. The 
particular class of task force that motivated this study is distributed data- 
base queries. In distributed relational database systems, queries are usu- 
ally decomposed into sequences of data moves and subqueries. The 
subqueries may be executed concurrently at different sites in the system 
in a pipelined fashion to achieve good performance [7,8]. Note that the 
problem of allocating subqueries to sites in a locally distributed database 
system in a load-balanced fashion is a variant of the load-balanced task 
allocation problem. 


The load-balanced task allocation problem addressed in this paper 
differs from previous work on task allocation in several ways. First, and 
most importantly, load-balanced task allocation is a dynamic problem. It 
takes a task force and the current load status of the system as inputs, and 
its main objective is to achieve a load-balanced system dynamically 
through proper task allocation. Second, given load balancing as the pri- 
mary objective, the communications cost for executing the task force is 
minimized as a secondary objective. Third, each task in the task force is 
assumed to have a feasible assignment set that specifies the sites to which 
the task may be allocated. This constraint arises in our application (dis- 
tributed query processing) because each relation is stored at only a subset 
of the sites in the system. Finally, since load-balanced task allocation is 
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dynamic in nature and must be accomplished quickly at runtime, we 
require a solution method that is capable of solving the problem quickly. 


The remainder of this paper presents our heuristic algorithm and 
summarizes the results of its evaluation. Section 2 defines the load- 
balanced task allocation problem more precisely, and Section 3 describes 
the details of our algorithm. In Section 4, we present some performance 
evaluation results plus two enhancements to the algorithm. Finally, Sec- 
tion 5 outlines our plans for future work. 


2. THE LOAD-BALANCED TASK ALLOCATION PROBLEM 


In many research efforts, the load of a processing site s; is 
represented by the number of tasks currently being served or awaiting 
service at that site, N(s;) [6,11]. That is, LD(s;)=N(s;). In order to 
quantitatively measure the degree of “balancedness" of a system, we 
extend Livny’s definition of the load unbalance factor [6], UBF , to be 
the variance of the system’s load distribution: 

D(LD (s;)-LD 
UBF = ——_—_—_—_—- = = 
n n 
N(s;) is the number of tasks at site s; and N is the average number of 
tasks per site after all query units are allocated. Using this definition, the 
load-balanced task allocation problem for a locally distributed system 
with 7 sites, {51,..., 5, }, can now be expressed as follows: 


(N(s;)-NY 
1 


Given: 

(1) <A number of tasks {t),..., ft}, to be executed concurrently as a 
task force. 

(2) A feasible assignment set S; = {5;,,.... 5;,} for each task f; 
(1<i<m). 


(3) An initial load vector specifying the intial load at each site s,, 
1<j<n, as given by LD(s,). 


Find: An allocation of tasks to processing sites such that: 
(1) The unbalance factor under this allocation is minimized. 


(2) The total communications cost, measured as the sum of the total 
data transfers between communicating tasks that are allocated to 
different sites, is also minimized to the extent possible without 
increasing the unbalanced factor of the allocation. 


In the following discussion, we assume that the m tasks are exe- 
cuted in a pipelined fashion because of our application [1,9]. In this 
case, Communication only occurs between pairs of adjacent tasks in the 
pipeline, so the total communications cost can be roughly approximated 
by the number of nonlocal task pairs (i.e., pairs of adjacent subtasks 2; 
and ¢;,,; that are allocated to different sites). However, this assumption 
does not affect the generality of our algorithm, and only relatively minor 
modifications are required in order to apply the algorithm to concurrent 
task forces with more general communication patterns. 


3. A HEURISTIC APPROACH 


The fact that task allocation is to be performed at runtime implies 
that the allocation algorithm should introduce as little overhead as possi- 
ble. Heuristic methods requiring less computational effort while provid- 
ing near-optimality are therefore preferable to exhaustive (optimal) solu- 
tion methods. In this section, we present a heuristic algorithm for solv- 
ing the load-balanced task allocation problem as defined in the previous 


section. 


1037 


3.1. Task Allocation Heuristics 


Three main heuristics are employed by the algorithm — a heuristic 
to control the order in which tasks are considered for allocation, a heuris- 
tic to avoid considering sites with particularly heavy loads, and a heuris- 
tic to help ensure that a good allocation site is selected for each task. 


Heuristic 1: Order of Allocation. The notion of the degree of 
freedom of a task is used in our algorithm to determine the order in 
which tasks are allocated to sites. This is an important consideration, as 
our algorithm avoids backtracking or reconsidering previous allocation 
decisions. The algorithm uses two freedom-related metrics associated 
with each task: (1) static freedom — the number of sites where the task 
can be allocated (i.e., the cardinality of its feasible assignment set). A 
task with a higher static freedom value will be relatively more flexible 
than one with a lower value because there are more site choices avail- 
able. (2) dynamic freedom — the sum of the current load of the sites in 
the task’s feasible assignment set. 


Using these freedom measures, tasks are allocated by our heuristic 
algorithm one by one in order of their degree of freedom. That is, tasks 
with fewer site choices are allocated earlier, and the degree of static free- 
dom is the primary consideration. In the event of a tie, the task with the 
largest dynamic freedom measure is chosen, as its candidate sites are 
more heavily loaded. 


Heuristic 2: Elimination of Full Sites. A site with a current load 
which is larger than the expected post-allocation average load is con- 
sidered to be full. A site that becomes full will not be assigned any 
further tasks (except if it is the only site in the feasible assignment set for 
a given task). Thus, whenever a site is full, whether due to its initial load 
or to the assignment of a new task to the site, it is deleted from the feasi- 
ble assignment set of each task that also has some other site in its feasible 
assignment set. 


Heuristic 3: Site Selection Criteria. After determining the task 
to be considered next, four metrics are used to select a site for the task 
from its feasible assignment set. These are: 


(1) The current load of a candidate site: This is the total number of 
tasks currently assigned to the site, including those tasks in the ini- 
tial load of the site. 


(2) The potential load of a candidate site: This is the number of unas- 
signed tasks that have the site in their feasible assignment set. 


(3) The benefit of a possible assignment: This is the communications 
cost that can be avoided by this assignment. In our case, since 
communications occur only between adjacent tasks in the pipeline, 
the benefit of assigning task t; to site s; is the number of adjacent 
tasks (e.g., t;_1 Or t;4;) that have already been allocated to the site 
(either 0, 1, or 2). 


(4) The potential benefit of a possible assignment: This measures the 
communications cost that might be eliminated by this assignment. 
In our case, this is the number of unassigned tasks which have the 
site in their feasible assignment set and which would form a con- 


secutive sequence of tasks including the task being allocated if 


they too were assigned to the site. 


Since load balancing is the main objective, an attempt is first made 
to allocate task ¢; to the site with the minimum current load among its 
feasible assignment sites. In the event of a tie, the benefit of assigning 1; 
to each site with the minimum load is calculated, and the site with the 
maximum benefit is selected. In case of another tie, the site with the 
minimum potential load is chosen, with maximum potential benefit being 
used if there are several sites with the same minimum potential load. 


3.2. The Basic Load-Balanced Task Allocation Algorithm 


The algorithm’s input includes a feasible assignment set for each 
task plus an initial load vector that specifies the number of existing tasks 
at each site when the new tasks are to be allocated. The originating site 
and result site for the task force are also specified as input so that the 
communications costs for initiating tasks at remote sites and for returning 
results to the result site (assuming that a user is awaiting results at that 
site) will be considered by the allocation algorithm. These are handled 
using two dummy tasks, t 9 and ¢,4;, which are assumed not to introduce 
any real load at their sites, but whose communications costs play a role in 
influencing allocation decisions. The basic load-balanced task allocation 


algorithm can now be described as follows: © 


(1) Allocate dummy task f to the task force’s site of origin, and allo- 
cate dummy task f,,,; to the result site for the task force. 


(2) Compute the static and dynamic freedom metrics for each task {;, 
1<i<m. 

(3) Select the next task to be allocated as the one with the least assign- 
ment flexibility using the degree of freedom metrics. 


(4) Select an allocation site s; for this task (t;) by choosing the site 
from its feasible assignment set with the least current load, consid- 
ering the benefit, potential load, and potential benefit metrics in 
turn as necessary (to break any ties). 


(5) Increment the load of site s;, and recompute the freedom metrics 
of any unallocated tasks that have s; in their feasible assignment 
set. (If 5;’s load now exceeds the expected average, simply delete 
s; from their feasible assignment sets.) 


(6) If any unassigned tasks remain, go to step (3). 


EXAMPLE: Consider a task force consisting of 5 tasks, and sup- 
pose that there are 8 sites in the system. Assume that the originating and 
result sites for the task force are both s3, that the initial system load vec- 
tor is {4, 1, 1, 2, 2, 0, 1, 1}, and that the feasible assignment sets for the 
tasks are: 


{S, S 4, Ss} 
{51,56} 


ty: {S1, 52,55, 57} to: 
ts: {53,55,57, Sg} ty: 
ts: {535 54, S55 56,57} 

Our algorithm would then operate as follows: 


i. The total initial load is 12 and the expected average load after 
allocation (rounded up) is 3. Site s, is therefore full initially, so 
it is deleted from the feasible assignment sets of ¢ ; and tf 4. 


ii. The static degrees of freedom for the tasks are {1, 3, 3, 4, 1, 5, 
1}. The dummy tasks f and f¢ are allocated to s3 first. 


iii.  ¢,4 is allocated to 5«, the only site in its feasible assignment set. 


iv. Both ¢, and t, have the same static degree of freedom value (of 
3). Since t, has less dynamic freedom, it is allocated next. 


v. t, is allocated to site s2 because s4 is the site with the minimum 
load (of 1) in its feasible assignment set. 


vi. t; is now allocated to site sz for the same reason as in v. 


vii. Since t3 is next in increasing order of static freedom, it is con- 
sidered next. It is allocated to site ss, which has the smallest 
potential load. 


viii. Finally, t, is allocated to site s,, which has the largest benefit 
among the sites in t ,’s feasible assignment set. 


ix. The final task assignment is: 


task (to) t1 to tz tq ts (f6) 
execution site (53) S7 Sz Sg S6 S86 (53) 


After assignment, the load vector is {4, 2, 1, 2, 2, 2, 2, 2}. The 
UBF for this assignment is 0.61, and the total communications cost is 5. 
It turns out that this is actually an optimal allocation under the problem 
definition. 


4, PERFORMANCE AND ENHANCEMENTS 


The heuristic algorithm described in the last section is a greedy 
algorithm in the sense that tasks are assigned one by one, without looka- 
head or backtracking. In order to evaluate the quality of the allocations 
generated by our algorithm, we conducted a study in which we compared 
its allocation decisions to those of an exhaustive search method which 
always finds the optimal allocation [9]. This study led to the design of 
two enhancements for the basic algorithm. The algorithm was also used 
in a simulation study to investigate the usefulness of load-balanced task 
allocation for distributed database systems [1,9]. We briefly summarize 
all of these results here. 


4.1. Optimality of the Algorithm 


Three groups of tests were conducted to investigate the optimality 
of the heuristic algorithm [9]. The first group of tests studied the general 
behavior of the algorithm, varying the number of tasks and the number of 
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Sites in the system. In the second group of tests, the initial "unbalanced- 
ness" of the system’s load was varied. In the last group of tests, the 
tasks’ feasible assignment set sizes were varied. 


The results of these tests showed that, in most cases, the heuristic 
algorithm finds the optimal allocation. The percentage of optimal alloca- 
tions generated was always at least 75%, and typically higher, for runs in 
the first group of tests. About 95% of the allocations obtained in this 
group were optimal if the unbalance factor alone was considered, which 
is encouraging since load balancing is the main objective. Also, more 
than 80% of the allocations had the same (or lower) communications 
costs as the optimal allocation. As the numbers of sites and tasks were 
increased, however, the number of optimal allocations found by the 
heuristic algorithm decreased. The explanation for this is that larger 
numbers of sites and tasks increase the number of possible allocations, so 
the probability of selecting an optimal allocation using a greedy algo- 
rithm decreases. It is then more likely that a near-optimal allocation will 
be chosen instead of the true optimum. This was especially clear in a 
case where each task was capable of being executed at any site, where 
our heuristically obtained allocations frequently resulted in higher com- 
munications costs. 


4.2. Enhancing the Basic Algorithm 


The results of these tests motivated the design of two enhance- 
ments to further reduce communications costs, improving the overall 
optimality of the resulting allocation of tasks to sites. 


Enhancement 1. Enhancement 1 is applied to each task individually. If a 
task ¢; has been assigned to site s;, its feasible assignment set is searched 
to find another site s, so that, if ¢; is instead assigned to s,, the unbalance 
factor of the system is not affected but the communications cost 
decreases. 


Enhancement 2. Enhancement 2 tries to group as many tasks together at 
the same site as possible without affecting the UBF . It looks at each pair 
of adjacent tasks (t;,t;,;), where ¢; and t;,; have been allocated to two 
different sites s; and s,, and it tries to find a third task t;(j>i+1) which 
was also allocated to site s; (i.e., to the same site as ¢;); it considers the 
possibility of reversing the choice of sites for ¢;,, and ¢; without affect- 
ing the UBF but eliminating the communications cost between t; and 
fiat. 

The tests described earlier were repeated with Enhancements 1 and 
2 employed, and the results indicated that they indeed improve the algo- 
rithm [9]. Enhancement 2 was especially helpful for the previously trou- 
blesome case where every task is able to run at any site; the optimal 
allocation was always found in this case using the algorithm with 
Enhancement 2. 


4.3. Execution Time and Algorithm Complexity 


The execution time and complexity of the heuristic algorithm are 
also important concerns. Tests were run using an unoptimized version of 
the algorithm, and its execution time was indeed found to be fairly small 
[9]. For instance, with 3-5 tasks in a task force and 8 sites in the system, 
its execution time on a VAX 11/780 was in the 15-40 millisecond range. 
In our database application [1,9], query task forces are typically small 
like this, and the time required to initiate a compiled query will easily 
dominate such task allocation times. As the numbers of tasks and sites 
were increased, the execution time for our heuristic algorithm was 
observed to be basically linear in mn, whereas the elapsed time for an 
exhaustive search was seen to rise dramatically. This agrees with a com- 
plexity analysis of the basic algorithm which determined its cost to be 
O (max (mn ,m? log.m)) for m tasks with a total of n candidate execu- 
tion sites. 


4.4. An Example Application 


As mentioned in the Section 1, the load-balanced task allocation 
algorithm presented here was motivated by the load balancing problem 
for locally distributed database systems. We have designed and 
evaluated a load-balanced approach to query processing for locally distri- 
buted database systems, and this work is reported in [1,9]. In particular, 
a simulation study was conducted to address the impact of using our 
load-balanced query allocation algorithm. The results of this study indi- 
cate that load-balanced task allocation provides better performance than 
either static or random task allocation strategies, improving the response 
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time and waiting time for queries, and even improving overall system 
throughput in many cases. Waiting time reductions of 50% or more were 
typical under moderate CPU loads, and throughput improvements in 
some cases reached 10-30%. 


5. CONCLUSIONS AND FUTURE WORK 


This paper presented a new algorithm for allocating task forces to 
sites in a locally distributed computer system. The algorithm is novel in 
that it takes the current system load into account, using load balancing as 
the primary objective driving the task allocation procedure, although 
communications costs are also considered (as a secondary objective). 


Because the algorithms and ideas presented in this paper resulted 
from work on load balancing for distributed database systems, our dis- 
cussions of task forces and our algorithm descriptions have assumed a 
linear, pipelined task force structure. The ideas discussed here are appli- 
cable to more general task force topologies, however. Future work is 
needed to evaluate the optimality of the plans generated via our heuristic 
algorithm (and also the complexity of the algorithm) for other task force 
topologies. It would also be interesting to study the performance 
improvements provided for various task force topologies using our 
approach to dynamic task allocation. Finally, we have not addressed the 
question of how the necessary load information is to be exchanged 
among the sites, making the integration of our ideas with an appropriate 
load information exchange policy a problem for future work as well. 
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ABSTRACT 


We consider the problem of uniformly distributing the load of a 
parallel program over a multiprocessor system. We describe and 
analyze four strategies for load balancing. The performance of each of 
these strategies is compared on a set of problems whose structure per- 
mits the use of all four strategies. 


The four strategies are (1) the optimal static assignment algorithm 
which is guaranteed to yield the best static solution, (2) the static binary 
dissection method which is very fast but sub-optimal (3) the greedy al- 
gorithm, a static fully polynomial time approximation scheme, which es- 
timates the optimal solution to arbitrary accuracy and (4) the predictive 
dynamic load balancing heuristic which uses information on the pre- 


cedence relationships within the program and outperforms any of the 
Static methods. 


It is also shown that the overhead incurred by the dynamic heuris- 
tic (4) is reduced considerably if it is started off with a static assignment 
provided by either (1), (2) or (3). 


1. Introduction 


Efficient utilization of parallel computer systems requires that the 
task or job being executed be partitioned over the system in an optimal 
or near-optimal fashion. In the general partitioning problem, one is 
given a multicomputer system with a specific interconnection pattern as 
well as a parallel task or job composed of modules that communicate 
with each other in a specified pattern. One is required to assign the 
modules to the processors in such a way that the total execution time of 
the job is minimized. 

An assignment is said to be static if modules stay on the proces- 
sors to which they are assigned for the lifetime of the program. A 
dynamic assignment, on the other hand, moves modules between proces- 
sors from time to time whenever this leads to improved efficiency. 


Given an arbitrarily interconnected multicomputer system and an 
arbitrarily interconnected parallel task, the problem of finding the 
optimal static partition is very difficult and can be shown to be computa- 
tionally equivalent to the notoriously intractable NP-Complete problems 
[1]. However, many practical problems have special structure that per- 
mits the optimal solution to be found very efficiently. 
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Cations in Science and Engineering (ICASE), NASA Langley Research 
Center. 
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. modules. 


In this paper we will consider four methods of balancing load. The 
first three produce static mappings of modules to processors. These 
static methods are: (1) the calculation of the optimal static load balance, 
(2) a suboptimal but very inexpensive static load balancing method, and 
(3) a fully polynomial time approximation scheme; the solution of which 
can be made to approach the optimal load balance. A dynamic load 
balancing method is also considered - this method allows the mapping of 
modules to processors to vary. These methods for balancing load are 
suitable for distinct but overlapping varieties of problems. These prob- 
lems can arise, among other places, in signal processing and image 
analysis, in the solution of systems of linear equations using point or 
block iterative methods, in problems of adaptive mesh refinements, as 
well as in time driven discrete event simulation. We describe our experi- 
ence with four different algorithms that we have used to solve a problem 
for which all these methods are applicable. 


The first method finds the optimal static assignment using the 
bottleneck path algorithm described in [2]. This algorithm captures the 
execution costs of the modules or processes of the task as edge weights 
in an assignment graph. A minimum bottleneck path in this graph then 
yields the optimal assignment. This algorithm has moderate complexity 
and is guaranteed to yield the optimal static assignment. 


The second method that we evaluate is the binary dissection algo- 
rithm which is derived from the work of Berger and Bokhari [3],[4]. 
This algorithm is very fast but does not always yield the optimal static 
solution. 


The third scheme that we consider is based on a widely used 
greedy method described in [5], which when combined with a binary 
search yields an approximate solution to the static partitioning problem. 


Finally we evaluate the predictive dynamic load balancing method 
developed by Saltz[6]. This is a dynamic algorithm in that modules are 
reassigned form time to time during the course of execution of the paral- 
lel program. This heuristic takes the precedence relationships of the 
subtasks into account when deciding whether and when to relocate 
This additional information and the capability to relocate. 
dynamically permits this algorithm to usually allow for higher processor’ 
utilizations than the optimal static algorithm. Each of the load balancing 
strategies incurs an overhead. The overheads of the static algorithms are 
incurred prior to the execution of the program and consists of the calcu- 
lations required to decide on the assignment of modules to processors. In 
contrast, the overhead of the dynamic load balancing problem is incurred 
throughout the execution of the program. 


The following section discusses in detail the problem addressed in 
this research. Sections 3, 4, and 5 describe the optimal static, binary 
dissection and greedy static load balancing schemes, along with an 
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analysis of the costs of performing the calculations necessary. A proof 
is presented demonstrating that the greedy algorithm produces an 
approximate solution to the static partitioning problem which can be 
made to converge to the optimal solution. Section 6 contains a descrip- 
tion of the dynamic algorithm along with descriptions of data structures 
required for the efficient execution of this method. Section 7 compares 
the performance of these four algorithms. The factors influencing the 
run time overhead of the dynamic load balancing method are explored. 
An investigation is made of the performance and overheads that result 
from initially balancing load using the various static load balancing 
methods, and then dynamically balancing load. 


2. Formulation of Problem 


The methods of load balancing discussed here are applicable to 
distinct but overlapping sets of problems. The following is a description 
of a class of problems for which all of the load balancing methods 
presented here are applicable. We consider the partitioning on a mul- 
tiprocessor system of problems which are composed of m computational 
modules, numbered 1 to m with a chain like pattern of inter-module data 
dependencies, such that module i is connected only to modules i + 1 and 
i— 1. The computation is divided into steps, and each module requires 
data from neighboring modules at step s—1, to begin the computations 
required for step s. 


When the relevant domain is partitioned into strips, explicit 
schemes for solving time dependent partial differential equations [7], 
problems in discrete event simulation [8], time driven discrete event 
simulation @ , as well as Jacobi and block Jacobi iterative methods [6], 
[9] used to solve a discretized partial differential equation can exhibit 
this pattern of data dependence. Pipelined algorithms in signal process- 
ing and image analysis [2] can also take this form. A similar but 
slightly more complex pattern of data dependence occurs in red black 
and multicolor SOR [10]. 


The importance of good load balancing strategies is accentuated 
when the work involved in solving a problem separates naturally into a 
number of subunits that is relatively small compared to the number of 
processors utilized, and when partitioning any one of these subunits 
across several processors is inconvenient or expensive. 


Consider the simulation of physical processes, either by means of 
solving a partial differential equation or by means of a discrete event 
simulation. The computations relating to a particular spatial strip may be 
assigned to a specific process which handles all computations describing 
events occurring in that region. Interactions between strips are local, 
each strip is coupled only to its immediate neighbors. It should be 
noted that in adaptive methods for the explicit solution of time depen- 
dent partial differential equations [11], block iterative methods, and 
discrete event simulation, the subunits of computation into which the 
problems naturally separate may each involve substantial amounts of 
computation. 


In signal processing, a fixed sequence of operations or transforms 
is applied to a long series of inputs. For example, each arriving packet 
of data may have to be Fourier transformed, multiplied by a fixed fre- 
quency, filtered, clipped and inverse transformed. This type of applica- 
tion has a chain like structure and lends itself naturally to pipelining[12]. 
We will index successive arriving data packets or inputs by a positive 
integer i. Module 1 processes i and sends the results to module 2. Each 
module j, 2 < j< m-—1, receives the modified packet i from module j — 1, 
further processes it and sends it off to module j+ 1. We will say that 
module j is computing step s = i + j, when it is working on calculations 
pertaining to packet i= s—j. When module j advances from step s to 
s + 1, the module processes the modified packet i — j. 


From the above, we see that in order to advance to step s, module 
j must obtain information from module j-—1 at step s—1. While 
module j — 1 does not require information from module j, it is useful to 
impose the condition that module j — 1 must be no more than one step 
ahead of module j. By doing this, we ensure that no more than one 
modified data packet need be stored per module. 


(a) D. Nicol and J. Saltz, "A Statistical Methodology for the Con- 
trol of Dynamic Load Balancing," to be published as an ICASE Report. 
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The static load balancing methods yield a mapping of modules to 
processors. The time required to complete a problem is determined by 
the processor with the heaviest load. With the dynamic load balancing 
method, each module may proceed at a rate constrained only by the 
local availability of computational resources and its data dependence on 
other modules. Load balancing is performed in a way that is explicitly 
designed to prevent processor inactivity due to a lack of data availabil- 
ity. 

The performance of the static and dynamic load balancing 
methods are compared through a variety of simulations. The perfor- 
mance of the dynamic load balancing method may be expected to 
depend to some extent on the initial balance of load at the time dynamic 
load balancing is initiated. One would expect the performance of the 
dynamic load balancing method to be favorably influenced by the use of 
static load balancing to improve the initial load balance. 


The performance of each load balancing strategy discussed here is 
compared on a set of problems whose structure permits the use of all 
four strategies. Dynamic load balancing becomes particularly desirable 
in problems in which the time needed for a process to complete one step 
is difficult to determine before the problem is mapped onto a machine, 
or when the time required to complete a step changes during the 
problem’s execution. 


In the case of discrete event simulations, the time required by a 
process to complete a step is difficult to determine a-priori, and both the 
cases of discrete event simulations and methods that solve time 
dependent partial differential equations using an adaptive grid as part of 
an explicit timestepping scheme, the activity in a given region may vary 
during the course of the solution of the problem. 


3. The Optimal Static Algorithm 


In this section we discuss briefly Bokhari’s algorithm for optimally 
partitioning a chain structured parallel or pipelined program over a chain - 
of processors [2]. We assume that a chain structured program is made up 
of m modules numbered 1..m and has an intercommunication pattern 
such that module i can communicate only with modules i+1 and i-1 as 
shown in Fig, 1. Similarly, we assume that the multiprocessor of size 
n<m also has a chain like architecture. We work under the constraint 
that each processor has a contiguous subchain of program modules 
assigned to it. Thus the partitions of the chains have to be such that 
modules i and i+1 are assigned to the same or adjacent processors. This 
is known as the contiguity constraint. The optimal partitioning would 
then be the assignment of subchains of program modules to processors 
that minimizes the load on the most heavily loaded processor. 


The above problem is solved by first drawing a layered graph (Fig. 
2) in which every layer corresponds to a processor and the label on 
each node corresponds to a subchain of modules. Every layer in this 
graph contains all subchains of modules i.e. all pairs <ij> such that 
1si<jsm. A node labeled <i,j> is connected to all nodes <j+1,q> in the 
layer below it for all j except 1 and m. All nodes <1,i> in the first layer 
are connected to the starting node s while all nodes <i,m> in every layer 
are connected to the terminating node ¢. Any path connecting nodes s 
and ¢ corresponds to an assignment of modules to processors. For exam- 
ple the thick edges in Fig. 2 corresponds to the assignment of Fig. 1. 


Weights can now be added to the edges of this layered graph as 
follows. In layer k, each edge emanating downwards from node <i,j> is 
weighted with the time required for processor k to process nodes i 
through j and this accounts for the total computation time on processor 
k. It is clear now that there is a path in this graph corresponding to 
every possible contiguous subchain assignment and the weight of the 
heaviest edge in a path corresponds to the time required by the most 
heavily loaded processor to finish. Thus to find the optimal assignment, 
we have to find the path in the layered graph in which the heaviest edge 
has minimum weight — the bottleneck path. 


The bottleneck path can be found by using the following labeling 
procedure. Initially all nodes are given labels L(i) = co except in the first 
layer, in which all nodes are labeled zero. Then starting at the top and 
working downwards we examine each edge e emanating downwards 
from a layer. If this edge connects node a (above) to node b (below) 
then replace L(b) by min(L(b),max(W(e),L(a)) where W(e) is the weight 


associated with edge e. Once the graph has been labeled, we then find 
the edge incident on node ¢ which has maximum weight. Suppose the 
edge joining node <i,m> of layer k with node ¢ has maximum weight, 
then it means that the bottleneck path would contain the node <i,m> of 
layer k and thus modules i through m would be assigned to processor k. 
The rest of the bottleneck path can be found in the same manner by 
working upwards from layer k to the top. 


The number of nodes per layer in the layered graph is O(n”) and 
thus the total number of nodes in the graph is O(n). The number of 
edges emanating from a node is at the most m, thus the total number of 
edges would be O(m'n). As the labeling algorithm looks at each edge 
once, therefore the space as well as time required by this algorithm is 
O(n). 


4. The Binary Dissection Method 


The binary dissection approach to the solution of the basic parti- 
tioning problem addressed in this paper is very efficient in terms of run 
time and gives solutions that are very close to optimal. This algorithm 
is a one dimensional version of the two dimensional partitioning strategy 
developed by Berger and Bokhari [3],[4]. The two dimensional binary 
dissection method gives a heuristic for partitioning a two dimensional 
grid of modules, and [3] , [4] give methods for mapping this partitioning 
onto a variety of architectures. 


The algorithm proceeds as follows. The given chain of m modules 
is split up into two halves such that the difference of the sums of execu- 
tion costs in each half is minimum. The two halves are then recursively 
subdivided as many times as desired. Clearly, the number of pieces into 
which the chain can be partitioned must be exactly 2* where the integer 
k represents the depth of partitioning. 


Thus this algorithm is useful for problems in which the number of 
processors is a power of 2. The time required by this algorithm is 
O(mlogn) for a problem with m modules and 7” processors since there 
can be no more than logn levels of partitioning with each level requiring 
at most one access to each module weight. 


At first sight this algorithm may seem capable of yielding the 
optimal solution. This is not always so, as the example in Fig. 3 
demonstrates. In the next paragraph we will find an upper bound on the 
difference between the optimal solution and the solution yielded by the 
binary dissection method. 


Let Wr represent the sum of the weights of all m modules. A 
lower bound on the weight of the heaviest subchain Wopr in the optimal 
partition will be W,/n under the special case when all the 1 processors 
are uniformly loaded. Let us designate the weight of the heaviest 
module by Wax and the weight of the heaviest subchain assigned to a 
processor using the techniques of binary dissection by Wyax. Then 
whenever a chain is divided into two parts, the maximum difference 
between the two halves will be bounded by w,,,,. Thus if m=2 then 
Wax S W7!2 + Wmax/2. Similarly if there are n processors then an upper 
bound on Wyax will be: 


Wax S Wyn + Waals tg tnt) 


S Wyn + Wma(n—-1)/n 


Thus the maximum difference between Wyaxy and Wopr will be 
given by the following equation under the assumption that m>n. 


Wauax — Wopr S Wmax(t—-1)/n | (1) 


5. The Greedy Algorithm 


The Greedy algorithm yields an approximate solution to the 
optimal partition of a chain structured parallel or pipelined program over 
a chain of processors. This algorithm is based on a greedy method, 
which is a widely used technique and is applied to a variety of problems 
[5]. Sahni [1] has devised a polynomial time approximation scheme to 
‘solve the knapsack problem using a greedy method while Kernighan 
uses a similar approach [13] for finding optimal sequential partitions of 
graphs. Utilizing this method one can devise an algorithm which works 
in stages and at each stage a decision is made regarding whether or not 


the next input be included in the partially constructed solution. If the 
inclusion of the next input will result in an infeasible solution then this 
is not added to the partial solution. Greedy methods may not necessarily 
provide optimal answers. For example consider the binpacking problem: 
Given a finite set W={w;,w»,....W,} Of m different weights, find a parti- 
tion of W into n disjoint subsets W,,W2,....W,, such that 2 is minimum 
and the sum of the weights in each subset W; is no more than a fixed 
constant. The First Fit algorithm for the above problem is essentially a 
greedy method in the sense that it tries to place each weight in the 
lowest indexed subset as far as possible, but this does not result in the 
optimal solution [1]. If however we put an extra condition on the prob- 
lem that weights w; and w,,; are to be placed in either the same subset 
or subsets W; and W;,; respectively then the same greedy approach will 
be able to find the optimal solution. 


The greedy algorithm is based on the function PROBE (described 
below) and takes advantage of the fact that the weight assigned to the 
most heavily loaded processor in the optimal partition lies somewhere 
between W7/n and Wr/n + wax aS discussed in the previous section. 
The algorithm selects a trial weight w in the above range and then uses 
the function PROBE. The function PROBE(w) returns true if it is possi- 
ble to partition the chain of modules into subchains such that the weight 
of each subchain is less than or equal to w, and false otherwise. The par- 
tition that results from a successful attempt to partition the chain so that 
the weight of each subchain is no more than w is called the greedy 
partition(w) 


function PROBE(Processors{1..n], Modules[1..m], w):boolean; 


begin 
f=1,;j=1l,p=1; 
while p<n do 
begin 
repeat 
J+; 
until weight of subchain Modules{i..j] > w or j=m; 
If j = m (all modules have been assigned) then return(true); 
Assign the subchain Modules{[i..j-1] to processor p; 
t= J; p= pti; 
end; 
return(false); 
end. 


The greedy algorithm then makes a binary search in the range 
W7/n, W7/n+Wmax using the above function to find the partition for which 
the weight of the heaviest subchain is minimum. For each trial weight w- 
the function PROBE has to look at each module only once. If the 
above range is resolved to an accuracy of e€ then the greedy algorithm 
will find a greedy partition(w) in time proportional to O(mlog2(Wmax/€)) 
with the assurance that Wop7SWwSWoprte where as before, Wopr is the 
weight of the heaviest subchain in the optimal assignment. It is impor- 
tant to note that the order of the greedy algorithm is proportional to 
log(Wmax/€) unlike other fully polynomial time approximation schemes in 
which the time complexity is polynomial in 1/e as described in [1]. 


In the following paragraphs we will prove that if there exists an 
assignment with the weight of its heaviest subchain equal to w then the 
procedure PROBE will always find that or an equivalent assignment 
assuming that subchains with no modules in them (empty subchains) are 
allowed. 


Definition: The weight of a partition is the weight of its heaviest sub- 
chain. 

Notation: 

Tn, a partition with weight w and n subchains. 

ge a greedy partition with weight w and ” subchains. 


Lng a mixed partition with weight w and  subchains in which the 
partition up to the first k subchains is greedy and the remaining 
partition may or may not be greedy. . 


_ Observe that }1,,,0=%y,, and |, namo ae 


Claim 1: p,,,, can always be transformed into [Ly ne+1 
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Proof: Move the right hand partition of subchain &+1 to the right until 
any further movement would cause the weight of subchain k+1 to 
exceed w or exhaust the modules. 


1. If this is possible without disturbing the right hand partition of sub- 
chain k+2 then u,,, been transformed into Hy,4+1 and the claim is 
correct. 


2. If during the course of this movement the r.h. partition of subchain 
k+1 coincides with the r.h. partition of subchain k+2, this means that 
subchain k+2 is now empty (which is permitted). Continue movement 
of both partitions together, combining with any further partitions that 
may be encountered. When the threshold point is reached, 1, ,,, been 
transformed into }1, 4:1, One or more subchains to the right of k+1 
are empty but the claim is still correct. O 


Claim 2: If there exists a 7,,, then there must also exist a Yy,y. 

Proof: Recall that 7, =}Hwn0- 

By repeatedly applying transformation (1) above we can transform: 
Tw, nbHw,n,0-w,n,1 9" nk” na Yon 0 

Result: If there exists an assignment of weight w then the procedure 
PROBE will find that or an assignment of equal weight. 


6. The Predictive Dynamic Load Balancing Method 


We assume that a computation is composed of a fixed number of 
computational processes or modules. The computation is divided into 
steps, and each module requires data from a set of other modules at step 
s—1 to begin the computations required for step s. Each module may 
proceed at a rate constrained only by the time required for the processor 
to perform the computations required by the module, the local availabil- 
ity of computational resources and data dependence on other modules. 
Hence at a given point in time, the system may contain modules 
advanced to a variety of different step numbers. In the predictive 
dynamic load balancing method, the data dependencies between step s—1 
and s are arbitrary, although in the section following this only chain 
structured data dependencies will be considered. 


At any given point in the computations, the data dependency 
requirements place limitations on the number of steps a module may 
advance. As stated before, in order to advance to step s, a module 
requires data from step s—1 from a specific set of associated modules. 
The largest step reachable by each module is the largest step number to 
which a module may be advanced while maintaining the required data 
dependency relationships. Load balancing is performed in a way that is 
explicitly designed to prevent processor inactivity due to a lack of data 
availability. 


The potential work of a processor is defined as the amount of time 
that will be required to advance all modules in a processor as many 
steps as possible given the data currently available from other proces- 
sors. The parallel efficiency of a processor may be defined as the per- 
centage of time a processor spends performing the computations 
required by the modules assigned to it. Transfers of modules between 
processors impact parallel efficiencies in a machine dependent way. 
The communication time required to transfer a module from one proces- 
sor to another along with the degree to which that communication can 
be masked with computation are essential factors in this dependency. 


In the predictive dynamic load balancing method to be discussed 
here, load is shifted between processors in a way that attempts to equal- 
ize the potential work in each processor. When the potential work of a 
processor falls below a predetermined threshold, load balancing is con- 
sidered. A module is shifted from a neighboring processor when the 
neighboring processor has stored an amount of potential work greater 
than or equal to the threshold plus a pre-determined safety factor. If 
more than one neighboring processor fits this criterion, the processor 
with the largest potential work contributes a module. In the chain struc- 
tured one dimensional problem considered here, there is only one 
module that a given processor can contribute at a given point in the 
computations. In general, the choice of the module ‘to be contributed 
may not be straightforward [6]. 
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The ability to efficiently calculate the potential work in a proces- 
sor is central to the usefulness of this method. Simple and inexpensive 
methods for calculating potential work will now be described. The 
potential work stored in a processor may have to: be calculated from 
scratch in some situations. When the computations involved in solving 
a problem are initiated or when modules are shifted in or out of a pro- 
cessor after load balancing, one must take into account both the pattern 
of data dependencies within a processor and the availability of data 
from other processors in order to calculate potential work. Given a pro- 
cessor which has assigned to it a value for potential work, a simpler set 
of computations can be performed to update the value of potential work 
in response to the receipt of a new datum from another processor. 


It is useful at this point to describe in more detail the interaction 
between step numbers achievable by the modules assigned to a proces- 
sor and the external data available to the processor. A linked data 
structure representing an undirected graph DEPEND, with weighted ver- 
tices is defined for each processor P. The I internal vertices represent 
the modules in P while the B boundary vertices represent modules in 
other processors directly coupled to modules in P. Let z; 19<B 
represent boundary vertices and let v; 1<7</ represent vertices within the 
processor. The weight w; of each vertex v; represents the largest step 
reachable by each module, given the currently available boundary infor- 
mation. The weight q; of each of the vertices z;, represents the step of 
the largest available boundary variable data for the module. 


The largest step reachable by a vertex v; in the processor given 
currently available boundary data is determined by adding one to the 
minimum of: (1) the largest steps reachable by all internal vertices v, 
linked to v and (2) the step number of the latest available boundary data 
for the boundary vertices z, linked to v. The weight assigned to v; may 
be written as 


we=min(wpgi+ (2) 
where v; and z; are linked to v;. 


Denote the current step number of v; as s; and the time required to 
advance v; one step t;, The potential work associated with P at a given 
point in the computations may be written as 


Lwesihi (3) 


where the sum is over all i corresponding to v; in P. For each boundary 
vertex z; the graph DEPEND may be divided into equivalence classes 
based on the minimum number of edges that have to be traversed to get 
to z;. We define r,; as the equivalence class of z, to which v, belongs. 
Note that each internal vertex belongs to B different equivalence 
Classes, one corresponding to each boundary vertex z,, 1<k<B. The pro- 
position below states a sort of superposition principle that holds for the 
determination of the maximum achievable step for internal vertices in 
response to constraints arising from boundary vertices. 


Proposition: The weight of v; is given by 


w= min (qx+74,) (4) 


Proof: The proof is carried out by substituting the postulated solution 
into (2). Fix attention on an internal vertex v;. Corresponding to each 
r,; where 7,;22 there must be an internal vertex v; linked to v; with 
Ty j=Tki-1. If there were not, it would not be possible to find a shortest 
path from v; to z, consisting of 7,; edges. Moreover, there cannot be an 
internal vertex v, connected to v; with 7,;<r,,-1, if there were, then v; 
would have a shortest path to z, consisting of fewer than 7,; edges. 
Corresponding to each r,; where 7,,=1 there is a direct edge from v; to 
Zk 
Now substituting (4) for each w, into (2) yields 


weal min (qg+r,,)>q.1+1 (5) 


for all j,1 such that v; and z, are linked to v;. Equation 5 may be rewrit- 
ten as 


we min [min((qet7 ej» Q]+1 


For each k, there exists an internal vertex v, with r,=r,;-1 connected to 
v; and there cannot be a vertex v, where r,;<r,;-1. Hence from (5) we 
obtain (6) 


Ke [dere s-1),4+1 (6) 


For boundary vertices z,; to which v; is directly connected, 7,;=1. Since 
all quantities involved are positive in sign, we obtain from (6) the equa- 
tion (4) for v; as desired. O 


We are now in a position to calculate the potential work from 
scratch, given values of s; and t; corresponding to all vertices v; in P. 
For each v; in P one may calculate w; from (4) in O(B) operations per 
vertex. Since there are I vertices the calculation of potential work from 
scratch requires O(IB) operations. 


If a processor has a value of potential work assigned to it the 
potential work may be updated in response to the receipt of a boundary 
datum. One finds the weights for each vertex v; in P in the following 
way. By equation (4) incrementing the weight of a single boundary ver- 
tex can either leave the weight of interior vertices unchanged or increase 
the weight by one unit. Moreover, only interior vertices currently con- 
Strained by the incremented boundary vertex will have their weights 
incremented. 


In response to an increment in a boundary vertex z, the weights 
in equivalence classes may be adjusted in order of increasing 
equivalence class number with only one pass necessary. Assume that z, 
has had its weight incremented from q,-1 to q;. Before z, was incre- 
mented, the constraint on the weight of vertices in equivalence class 
ry=n was q,-l+n. The constraint on the weight of vertices in 
equivalence class r;=n-1 after z, is incremented is q,+(n-1). The 
adjusting of equivalence class r, = n will have no effect on the adjust- 
ment of equivalence class 7; = n—1. 


If a vertex in equivalence class 7; = m has a weight of less than 
qxtn—1 before being considered for readjustment, it is not being con- 
Strained by z,. Incrementing z,’s weight will consequently not affect the 
vertex. Since the only vertices which can possibly have their weights 
incremented have weights q,+m-1, the order in which vertices in an 
equivalence class are considered is unimportant. 


Updating DEPEND may proceed as follows. The weight of the 
vertex in DEPEND representing z, is first incremented. In a breadth 
first manner beginning with the vertex representing z, DEPEND is 
searched for vertices whose weights must be incremented. When a ver- 
tex v is found that does not require a weight increment, the search does 
not continue to examine other vertices linked to v. 


In the model problem, the time and space requirements of this 
updating algorithm algorithm are O(mm) and O(m) where nv is the 
number of modules in the problem and m is the number of ae over 
which advancement is to proceed, 


7. Comparison of Results 


We have compared the performance of both the static load balanc- 
ing methods and the predictive dynamic method through a variety of 
simulations. Note that with minimal computational effort, on a set of 
weights consisting of single precision floating point numbers, the greedy 
approximation scheme produces a balance identical to the optimal load 
balance. Thus, the performance obtained through the use of the optimal 
method and the greedy approximation scheme were identical, and in this 
section we shall simply refer to the performance of the optimal load 
balancing method. 


Static and dynamic methods can be combined; a static load 
balancing may be performed before beginning work on a problem, and a 
dynamic load balancing policy may be utilized once work on the prob- 
lem has begun. It is found that the initial use of static load balancing 
policies can enhance the performance of the dynamic policy when com- 
pared to the performance obtained through the initial assignment of an 
equal number of contiguous modules to each processor. Furthermore, 
both the optimal and the binary dissection static load balancing methods 
yield rather comparable performance when used with the dynamic 
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predictive load balancing method. Used without a dynamic load balanc- 
ing method, the optimal load balance was found to be notably superior 
to binary dissection, while there was hardly any difference between the 
optimal load balance and the greedy load balance on the test problems 
described here. 


It is important to note that the performance measured in the case 
of the dynamic load balancing method is average processor utilization, 
the overhead required to move modules from one processor to another is 
accounted for through counting the average number of module shifts per 
step per processor. The time required for these shifts is an architecture 
dependent variable. 


We consider a system with 16 processors and a fixed number of 
modules. In each trial, random deviates representing the weights of 
modules are drawn from a truncated normal distribution. For each set 
of random deviates, both the optimal static load balance and the binary 
dissection balance are calculated and the performance is tabulated. 
Simulations utilizing the predictive dynamic policy are also run using 
the same set of random deviates. These simulations utilize both static 
policies and the assignment of a fixed number of modules to each pro- 
cessor as starting conditions. Performance is measured by calculating 
the average percentage of time processors are occupied advancing 
modules over the course of the simulations. Performance results are 
averaged over SO trials differing only in the values of the random devi- 
ates generated. 


In Fig. 4 and Fig. 5 the performance obtained through the use of 
the static and dynamic policies is depicted. In these figures, the perfor- 
mance of the policies is plotted against the variance of the truncated 
normal distributions from which the module weights were drawn. In the 
experiments depicted in the above figures, the weights for the modules 
were drawn from truncated normal distributions with variances of 0.5, 
1.0 and 2.0 and mean 1, and the problem was assumed to run for 200 
steps. In Fig. 4 during each trial 64 modules were assigned to the sys-_ 
tem while in Fig. 5 96 modules were assigned to the system. In both of 
these cases, for all variances tested, the dynamic load balancing method 
outperformed both static load balancing methods. Note however, that 


this measure of performance does not take into account the machine 


dependent cost of shifting modules between processors, a cost that will 


be studied in more detail below. The binary dissection static method 
was in all cases noticeably inferior to the optimal static load balance. 
The use of a static load balancing method initially had a relatively 
minor positive impact on performance in the experiments with 96 
modules, and no discernible impact at all in experiments with 64 
modules. The performance impact of the initial use of a static load 
balancing method is quite dependent on the number of steps required to 
solve a problem. It will be seen later that for problems that continue for 
a relatively small number of steps, the initial use of a static load balanc- 
ing method can markedly improve performance. 


In the dynamic load balancing method, the moving of modules 
from one processor to another will exact a cost that will depend on the 
details of the machines’ interprocessor communication network. We can 
calculate an approximate expression that estimates the overhead that 
results from the module transfers required by the predictive dynamic 
method. Let y be the ratio between the average cost of transferring a 
module and the average computation cost C of advancing a module one 
step. Letting P represent the number of processors in the system, t be 
the average module transfers per step per processor, and recall that m is 
defined as the number of modules in the a The average time 


spent computing per processor per step is ——, and the average time 


7 
spent transferring modules per processor per step is yCr. Under the pes- 
simistic assumption that no communication can be masked by computa- 
tion, the ratio of the time spent transferring modules to the time spent in 


computation is =. 


In Fig. 6 and Fig. 7 the. average number of modules that must be 
moved from one processor to a neighbor per step of the computation is 
plotted against performance for a range of values of the dynamic 
method’s safety factor. The average processor utilization obtained 
through the use of optimal static load balancing and through the use of 
binary dissection static load balancing are depicted in these figures for 
the sake of comparison, these involve no module shifts. In each of the 


two figures, the use of static load balancing does play a notable role in 
increasing performance and decreasing the frequency with which 
modules have to be shifted. On each curve in Fig. 6 and Fig. 7 both the 
cost and performance were strictly decreasing functions of the safety 
factor used. As the safety factor increases, modules are moved increas- 
ingly frequently and the performance and the overhead cost both 
increase. If the time required to transfer a module is equal to the time 
required to advance a module one step, the ratio of the time spent in 
computation to the time transferring modules in Fig. 6 and Fig. 7 is sim- 
ply the average number of module shifts per step per processor divided 
by six. This ratio ranged from 0.005 to .0045 in the plots depicted in 
Figures 6 and 7. 


The number of steps advanced are varied and the performance and 
the overhead in modules moved per step are depicted for the dynamic 
load balancing method in Fig. 8 and Fig. 9 respectively. In both figures, 
the effects of using the two static load balancing methods as well as 
using no load balancing at the beginning of the computation are com- 
pared. In all cases, the performance increases with the number of steps 
advanced. 

For problems that do not require a large number of steps, the per- 
formance obtained by starting out with a static load balancing method 1s 
superior to that arising from the dynamic load balancing method without 
initial static load balancing. Perhaps somewhat counter-intuitively, ini- 
tially balancing load with binary dissection leads to better performance 
than initially performing an optimal balance for problems requiring over 
10 steps. The optimal static load balance is not necessarily the initial 
load distribution that best allows the dynamic load balancing method to 
move modules so that processor idleness is avoided. As the number of 
steps increases, the performance differences obtained through the use of 
different initial load distributions becomes less marked. 


The initial use of static load balancing also leads to marked reduc- 
tion in module transfer overhead as depicted in Fig. 9. In this figure the 
overhead per step generally increases with the number of steps. For 
problems with very large numbers of steps, the overheads for the initial 
load distributions all approach a single value. When no initial static 
load balancing is used in a problem that is advanced a small number of 
steps, both low performance and relatively high costs in number of 
modules transferred are incurred. It is noted that in Figure 9, when of 
initial static load balancing was not used, the number of modules 
transferred reaches a local maximum for problems of 10 steps, and then 
declines briefly before resuming its long term increase. This phenomena 
has been observed in a number of similar experiments, its cause is 
unclear. 


The performance obtained through the use of binary dissection as 
a Static load balancing method was notably poorer than that produced by 
the optimal balance. We have observed in these and other experiments 
that initial static load balancing used along with the predictive dynamic 
load balancing method improves performance and reduces the frequency 
with which modules must be moved. The choice of method used to ini- 
tially balance load does not appear to have a marked impact on perfor- 
mance or cost. 


8. Conclusions 


The experimental results presented here revealed that the predic- 
tive dynamic load balancing method led to processor utilizations that 
were consistently above those obtained by the optimal static load 
balancing method, when the delays caused by the overhead of transfer- 
ring modules are negligable. As one would expect, the optimal static 
load balancing method, in turn, consistently out performed the binary 
dissection method. 


The initial partitioning of load at the point dynamic load balanc- 
ing was initiated proved to have a marked effect on the performance of 
the dynamic load balancing algorithm. All three static load balancing 
methods used in conjunction with the dynamic load balancing method 
lead to a substantial improvement in performance and a decrease in 
module transfer overhead when compared to a uniform initialization of 
modules to processors. The magnitude of these effects depended on the 
number of steps the problem is advanced, being most pronounced when 
a problem is finished after relatively few steps. It is interesting to note 
that the binary dissection algorithm appeared under some circumstances 
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to consistently lead to results that were superior to optimal load balanc- 
ing when used in conjunction with dynamic load balancing. 


One of the principal costs of the predictive dynamic load balanc- 
ing method is expected to be the machine dependent cost of transferring 
the computational modules between processors. The effect of initial load 
distribution on this cost was examined and it was found that the fre- 
quency with which modules were transferred between processors was 
markedly reduced when either form of static load balancing was initially 
employed. 


The initial distribution of load in a multiprocessor system is 
clearly an important determinant of the performance gains achievable by 
the dynamic load balancing policy; this initial distribution also has a 
strong influence on the overhead costs of the dynamic policy. 


The four load balancing methods discussed in this paper each 
have their own distinct advantages and disadvantages. Finding an 
optimal static load balancing is in general an NP-Complete problem 
unless special structure is present to permit a low order polynomial solu- 
tion. For the test problems that we have considered, the greedy algo- 
rithm was an order of magnitude faster than the optimal load balancing 
algorithm and it provided results as good as the optimal solutions. The 
binary dissection method and the predictive dynamic load balancing 
algorithms are both quite useful in situations in which low order polyno- 
mial solutions to the optimal static load balancing problem do not 
appear to be available. 
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Fig. 2. The layered graph for a problem with 9 modules and 4 processors. 
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Abstract 


High level tasking constructs and rendezvous 
mechanism are employed to experiment with parallel 
processing of program components on nodes of a local 
network. In contrast to instruction level synchroniza- 
tion requirements of SIMD machines, our study 
centers around interaction point level synchronization 
in a distributed MIMD machine. Program decomposi- 
tion issues are discussed and a strategy for allocation 
of tasks to optimum number of processing stations 


together with experimental performance results are 
presented. 


1. Introduction 


Local area networks have generally been used for 
connecting a number of computing modules within a 
limited geographical space. The primary objective has 
been to enable users on different hosts to communi- 
cate and share access to expensive peripheral devices 
and remote disc files. There has also been much work 
on distributed processing aspects of local networks, 
with emphasis on intertask communication primitives 
and remote call implementation [(1,2,3]. In this paper 
we address parallel processing potentials of local area 
networks. 


Parallel processing within a local network is 
attractive for two main reasons. First, it enables utili- 
zation of the spare processing capacity of the increas- 
ingly powerful personal workstations by distributing 
the computational intensive jobs among’ them. 
Secondly, a fast local network can be used as an inex- 
pensive way of connecting processors in a stand-alone 
multiprocessor computer. The basic issues of parallel 
processing in a local network include: the choice of the 
programming language, suitable granularity of paral- 
lelism, and especially program partitioning and assign- 
ment strategies. 


“This research was funded by the Natural Sciences and En- 
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In the following sections we will discuss the above 
issues and present the results of our experiments with 
task level parallelism. The experiments were carried 
out to measure effects of the frequency of interaction 
points (rendezvous) between constituent tasks 
modules, as well as the number of processing stations 
employed, on the execution time of a given job. 
Finally we present a heuristic method which can be 
used to estimate the optimum number of workstations 
which should be used for parallel execution of com- 
ponents of a given job. 


2. System Support Components 


Since parallelism at task level was our main 
objective, a high-level language with concurrent task- 
ing facilities was the logical choice. The programming 
language used in this project is Martlet [4] which is a 
Pascal derivative with concurrent tasking facilities 
added on. The details of Martlet and its implementa- 
tion have been given in references [5,6]. Briefly, a 
Martlet program generally consists of a number of 
cooperating tasks which communicate using the 
Entry_Call Accept rendezvous mechanism similar. to 
Ada [7]. The compiled tasks are assigned to the pro- 
cessing stations by means of a system generation pro- 
gram called SETUP which creates the final core 
images for the processing stations. The criteria for 
allocation of tasks to processing stations will be dis- 
cussed in the section dealing with program decomposi- 
tion. Martlet has proven very suitable for distributed 
and parallel programming applications and was ported 
to a network of PC/XTs for this project [8]. An 
example of a Martlet program is given in the Appen- 
dix I. 


The target network consists of a number of 
workstations (PC/XT/AT’s) connected by 10 Mbps 
Ethernet [9] as shown in Figure 1. Martlet’s run-time 
support system consists of a distributed multitasking 
kernel and an interpreter with copy of each in every 
station of the local network. The station kernel is 
responsible for managing the processor and the tasks 


PC/XT AT 


PC /XT /AT PC/XT /AT 


10 Mbps Ethernet 


Figure 1 The Local Network Configuration 


in the station. Interstation communication between 
tasks is handled transparently by pseudo transport 
level tasks which are created at system SETUP time. 


3. Program Decomposition 


The problem of decomposing a large program 
into concurrently executable sub-tasks has been stu- 
died for some time [10,11]. Partitioning (i.e. division 
of a program into procedures, modules, and tasks) 
together with assignment which deals with allocation 
of these units to the available processing units, are the 
two most important issues in distributed and parallel 
processing systems. In a network or distributed 
environment, the actual placement of the computa- 
tional objects has substantial effect on the overall per- 
formance of the system. This is mainly because of 
additional communication delays when communicat- 
ing tasks are located in different stations of a local 
network [3]. 


Minimizing interprocessor communications as 
well as load balancing are the two main criteria for 
allocating tasks to the available processors in a sys- 
tem. Exact partitioning of a set of communicating 
tasks into a distributed multiple processor system is 
an np-complete problem. However, heuristic algo- 
rithms have been developed for optimum allocation of 
computational objects in a distributed system. These 
algorithms generally define a graphical model of the 
communications structure of the tasks, then use a 
heuristic approach to ‘cut’ the graph into distinct 
components such that the sum of the weights 
(representing the frequency of communications) of the 
edges in the resulting edgecut is a minimum [12]. 


However, we have to be cautious when trying to 
apply the above allocation algorithms in a local net- 
work because these algorithms assume: (1) the cost of 
in-station communications between tasks to be zero 
and (2) the cost per inter-station communication to be 
the same for all configurations. Our experimental 
results show that both of these assumptions are far 
from being acceptable when using high level intertask 
communication constructs in a broadcast type local 
network. To achieve higher execution speed we must 
distribute the load among additional stations. The 
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question is how many stations should be employed 
before the benefits of additional processing power are 
offset by the increased communication delays. We try 
to answer this question in the next section. 


4. The Task Allocation Method 


Here, we are concerned with the class of problems 
which can be implemented using task level parallel- 
ism. Of particular interest are synchronized algorithms 
where parallel tasks must exchange data and syn- 
chronize their operations at some interaction points. 
As an example, a number of producer tasks may per- 
form many iterations of a simple function and then 
communicate the results to a consumer task at specific 
points within the computation. Therefore we are deal- 
ing with a MIMD environment where operations are 
synchronized after some intervals; this is in contrast 
to a SIMD environment where instruction level 
synchronization of the processing units is necessary. 


We now present the heuristic allocation strategy 
which was developed for the class of problems which 
can be translated into a set of producers-consumer 
tasks as stated above. The strategy is based on the 
simple guideline that the actual processing speed-up 
due to an additional workstation should be at least 
equal to or greater than the corresponding communi- 
cation overhead which must be handled by the new 
station. To find the formula for estimating the 
optimum number of stations we define: 


Ts = Total execution time for a job when all com- 
ponent tasks are assigned to a single station. 

Tr = Average delay overhead of a network rendez- 
vous (message exchange). 

n= Total Number of Rendezvous during the execu- 
tion of the entire job. 

N = Maximum number of stations to be used for a 


given job. 
Assuming balanced load distribution, then for an 
additional station to have a positive effect we should 
have: 


Ts —nTr - Ts—nTr > nTr 
N-1 N — N 
Therefore: 
Ts 
< 
Ns nir 


Ts can easily be obtained from the single station 
execution of a job in the class of problem concerned. 
Even Tr can also be estimated from communications 
delay overheads when all tasks are run in a single sta- 
tion (see Figure 3). Once N is estimated from the for- 
mula, the component tasks are divided evenly among 
N stations. Figures 2(a) and 2(b) support the validity 
of such a strategy as discussed above. 


5. Experiments and Results 


The experiments were conducted for 
configurations which ranged from a single station to 9 
stations in the local network. The entire job consists 
of one consumer and eight instances of a function 
(producer) task as given in Appendix I. Each producer 
performs an identical amount of computation and 
returns a single result to the consumer. The consumer 
then computes the sum of eight results. Each func- 
tion is a simple iterative loop from 1 to 100. At run- 
time, each function is given a parameter called Force 
= 1..200 telling the function how many 1..100 itera- 
tions it should perform before returning a value. The 
entire system is given a parameter called WORK = 
1..10, telling the system how many times the consu- 
mer should calculate the sum. WORK determines the 
number of rendezvous during the test. Figures 2(a) 
and 2(b) show the speed-up in execution time as the 
number of stations is increased. The total workload 
and the number of rendezvous are used as parameters. 


(a) Total Iterations = 40,000 
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Figure 3 gives the variation in rendezvous delay 
overhead as a function of the number of stations tak- 
ing part in execution of the job. We should point out 
here that due to the interpretative nature of the run- 
time system the execution and delay figures are larger 
than normally encountered in non-interpretive sys- 
tems. However, this does not effect the relative 
nature of our comparisons. 
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Figure 3 Variations in Rendezvous Delay 


Table I shows the maximum number of stations 


which should be employed for different workloads and 
frequency of rendezvous as calculated using our pro- 
posed allocation method. These numbers are in close 


Number of Workstations (processors) 


000 agreement with the actual number of stations at 
; which execution speed-up levels off, as shown in Fig- 
ures 2(a) and 2(b). 
m 
t:] 
i 
n 
Ss Table I Optimum No. of Workstations for a Given Job 
e 
7 No. of Execution time No. of Optimum No. 
n Iterations on a single Rendezvous of workstations 
d Station(Ts) (n) using N = Ts/nTr 
s and load balancing 


20.5 (sec) 
24.5 
34.8 
39.2 


80 
40 
80 


0 1 2 3 4 5 6 7 8 9 
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Figure 2 Execution Time vs No. of Workstations 
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6. Conclusions 


We have shown that a particular class of prob- 
lems which require interaction point level synchroniza- 
tion similar to the given test problem, are suitable for 
parallel processing on stations of a local network. We 
have also presented a heuristic method which uses 
the data from a single station execution of a set of 
cooperating tasks to estimate the optimum number of 
stations that should be used for their parallel execu- 
tion. 


Further work includes finding methods by which 
a general computation intensive problem can be 
decomposed into a set of producers-consumer tasks 
which can then be handled by the method described 
in this paper. 
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APPENDIX I: Test Programs Written in Martlet 


(* Consumer *) 


task consumer; 
type coeff = integer; 
entry sync/0..7/(var w: integer; var f: integer); 
entry fn/0..7] (x: coeff); 


close; 


task body consumer; 

var 
WORK, FORCE, w: integer; 
a, b, C, d, €, i, 9; A: coeff; 
results: array [1..200] of integer; 


body 


(* Synchronize tasks & pass them work assignments *) 
accept sync/0](var w: integer; var f: integer) 
then begin w := WORK; f := FORCE; 
accept sync/1/(var w: integer; var f: integer) 


then begin w := WORK; f := FORCE; 


aecepi sync/7/(var w: integer; var f: integer) 
then begin w := WORK; f := FORCE 


end 


end 
end; 


(* Collect the coeffictents *) 
for w:= 1to WORK do 
begin 
accept fn/0] (x: coeff) then a := 2; 
accept fn/1] (2: coeff) then b := 


accept fn[7] (x: coeff) then h := 2; 


(* The coefficients have been computed *) 
(* The result can be calculated & saved *) 


results/w/:= (a+b+e+d+e+f+gth) *u; 
end; 
close; 


(* Producers *) 


task producers/0..7/; 
close; 


task body producers; 
use consumer; 
var 


WORK, FORCE, w,f,1,7: integer; 
body 


sync/myinstance/(WORK, FORCE); (* Synchronize *) 
for w := 1 to WORK do 


begin 
for f := 1to FORCE do 
begin 
t:= 0; 


for j := 1 to 100 do (* 100X the FORCE value +#) 
1:= succ(t) (*keep busy within loop +) 
end; 
(* Return producer instance as value *) 
Jn/myinstance/(mytnstance) 
end 
close; 


